Wednesday, August 06, 2008

Code is the easy part: Preventing data corruptions

Coding, coding, coding. There's an awful lot of attention paid to coding in the software engineering world. I read redditt's programming category a lot these days, and most software engineering posts are about specifically code. I think this is not because coding is the most important part of being a programmer, but because it is the most interesting and fun part.   Code is easy to change. If you have a bug, you reproduce it and fix it, and then it goes away. Sometimes it's difficult, but even if it is, you fix it, make sure it doesn't happen again by writing test cases (preferably before you fix it), and then you forget about it.

Let's contrast the fixing of bugs in code, with fixing corrupted data. Corrupted data in any reasonable system is probably going to be caused by a software bug. So, you first have to find out why the data is corrupted.  Of course, maybe the corruption is old, and the bug has been fixed.  Hard to say.  You have to investigate.  You could try and reproduce it, if you can figure out a likely scenario.  Most likely you will fail to reproduce it, and even more likely is that you know you will fail to reproduce it, so you don't try.  So instead you look at the code, seeing how it could happen.  This is a useful practice.  You can sometimes find the cause here.  Then you can write a test for it and fix it.   After you fix it, your work isn't done, of course.  You have to actually fix the corruptions.  This could be a simple as running a SQL query, or as complicated as writing a script to patch things up using a code library to do the work.   Of course, sometimes you never can find out why your data is corrupted, so you just have to fix it, if you can figure out how.


Some amount of corrupted data is an inevitability, and in fact some may come from design decisions.  For example, some database systems cannot do two-phased commits, and if you need to hook into one, you may have to accept a certain amount of data corruption due to not having atomicity in your transactions.  If the error rate is very low, some corruption may be a fair price to pay for whatever benefits the second system is getting you.   Even so, this is dangerous, and a low error rate today may be a severe error rate tomorrow, leaving you with a lot of angry customers with corrupted data, and a few dejected developers who have to clean it all up.

There's a few best practices you can do to avoid data corruption
  • Use one transactional system with ACID properties, and use it correctly.
  • When using SQL, use foreign key references whenever possible. 
  • Before saving data, assert its correctness as a precondition.  This includes both the values stored, and the relationship of the data to data that both links to it, and is linked to.
  • Create a data checker that will run a series of tests on your data.  This is basically like unit-tests for data, but you can run it not only after a series of tests, but in your production system too.  Run this program regularly, and pay attention to the output.  You want to be alerted of any changes in the sanity of your data.  Like unit tests, the more invariants you encode into this tool, the better you will be.  When changing or adding data,  modify the data checker code.  After each QA cycle, run the data checker.  Any errors should be entered as bugs.
  • If your data can be repaired, have an automated data repairer.  This shouldn't be run regularly, because you don't want to get too complacent about your corruptions.  Instead, if you notice that the data checker has picked up some new errors, then you run this, modifying it first if the errors are of a new type.

Doing a good job on all these tips should prevent most data corruption, but not all.  Like bugs, even the best preventative measures will not guarantee success.

Having clean data is extremely important.  This data is not yours, it is your customer's, and they trust you with it.  You need to protect it, and it isn't easy, but preventing data corruptions is always the right thing to do.

In org-mode, abandoning GTD

I've already mentioned the problem with contexts in the GTD system. In using org-mode, I've come to realize that another important concept of GTD is either flawed or redundant: the next action.

Next actions seem good at first. Every project has a next action, the action that is a short-doable task that will advance the state of the project. Of course, some projects have several actions you can do in parallel, so there are several next actions. As the number of projects you have grow, you get more and more next actions. At some point, the number of next actions becomes too much to keep track of. Some people might say this goes away with proper contexts. However, I've never found a good use for contexts, because in truth at any point I can do any of my next actions. Another strategy for getting rid of excessive projects is to move some to a someday/maybe folder. I think this is reasonable, but sometimes you know that in a particular week you are just going to work on a few different projects, perhaps because of deadlines or other prioritization. What then?

In org-mode, I solved this problem by using the agenda, and scheduling my next actions that I wanted to work on for the current day. I would then see a list of the next actions I had to accomplish that day. If I didn't get them done that day, the next day I'd move them up a day, to the current day. It ended up being a daily-planner-like system a lot like Tom Limoncelli recommends in Time Management for System Administrators. But with next actions.

Last week I had a revelation: scheduled items, for me, were the same as next actions! So I removed all next actions, just having states TODO, WAITING, and DONE. Like next actions, when I finish a scheduled item, the next item in the project becomes scheduled. I like my new system. It combines the flexibility of Limoncelli's day-planner system with the project-planning of GTD.

I spend my day as before, taking tasks from the day view of org-mode, and using a weekly review to schedule or de-schedule items from that view. I also add notes to tasks all the time, which proves to be helpful. I think I've benefitted from next-actions, since they force you to think of actions at level of granularity such that each task is specific, concrete, and immediately actionable. I treat tasks the exact same way now, even without next actions themselves.

For more info about org-mode, check out the talk on org-mode which happened after I invited Carsten Dominik to give a talk at Google about his system.