Monday, 28 October 2013

Eureka moment (not phone)

At around 1600 on Friday 25th, an Excel spreadsheet which I used failed to start in the usual way, a cell on one of the its larger worksheets having been corrupted. It took around 20 hours of worrying at it, off and on, before we got started again around 1200 the following day. The whole business reminding me of the occasion when a project management contractor from Pcubed expressed frustration, the same chap, as it happens, as I mentioned on 6th August 2011 in the other place (http://pumpkinstrokemarrow.blogspot.co.uk/). The background to the frustration was the failure of a program, a failure which was ultimately caused, I believe, by giving too many different contractors fingers in this one, not very big, civil service pie. But this belief went no way towards running down this particular failure, leaving the project management consultant frustrated because we had no process to follow to sort out failure. He believed in having a process for everything; just follow the process and all will be well with the world. Now while in general terms I agree with him, processes are a good way to get wayward human activities under control, sometimes those activities are too wayward or too unusual for it to be worth while developing a process which works. Cheaper and more effective to rely on skill and experience and to hell with the process; a problem which television detectives such as Chief Inspector Morse know all about.

Coming back to my corrupt cell, I find that it is the outward expression of a property of a row in a Visual Basic (VB) table (called fred in what follows) with some 10,000 rows. The corrupt cell had acquired its value from the wrong row, a value which was inappropriate and which the program refused to accept when I next tried to start it.

First step was to look at all the places in the VB code which addressed this particular property of table fred, places of which about a dozen were suspects. I thought about and dismissed the possibility of a public variable (called cod in what follows) getting corrupted. The code only looked at cod in one or two places and looked entirely harmless. None of the other suspects yielded fruit at this point.

Second step was to correct the corruption and see what happened then. Answer nothing, which was reassuring in a way but went no way towards finding out why we had the corruption in the first place.

Third step was to peer at what I thought were the relevant places in the various logs written by the code as it goes about it lawful business. A slow process, with my having forgotten how much of the code works turning up lots of red herrings. But again, nothing which helped.

Then I thought, was it just finger trouble? The worksheets are unprotected and it would be easy enough to damage them by hitting the wrong key, dropping the newspaper on the key board or whatever.

Next, came the obscure bug in the Excel code base, which has survived despite the battalions of MS staffers assigned to the product. The obscure but hitherto unknown bug which has hit my code, a thought which usually comes to mind when I get a long-life bug. Perhaps a fantasy of fame at Microsoft HQ; the obscure geek working away in his garage who hits the pot of gold. I am sure a good psychiatrist could make something of it; perhaps the chaps at http://www.harleytherapy.co.uk/ who have paid Google to come top of the search for 'psychiatrist' intended to check my spelling and who appear to offer every sort of therapy that one has ever heard of. One can only suppose that their staffers carry multiple qualifications, which one should perhaps inspect before buying, just to be on the safe side.

Then I remembered, or at least thought I remembered, a similar incident in the past. So the fourth step was to peer at the record sheets (using the Windows search feature) and, as it happened, something similar had happened back in May. On that occasion I had just corrected the error and carried on, without bothering to run the thing down. But it did involve, as did the present error, a culling of the offending table fred.

Fifth step was to peer more closely at the corrupted property. Its corrupted value turned out to be precisely dated and so I was able to peer more closely at exactly that bit of all the logs which I had peered at more vaguely before (see step 3). Still nothing.

Getting tired of all this, I go off and do something else, during the course of which all the material I had gathered up was clearly being gone over by the good old subconscious and suddenly the answer popped into my head. The culling had corrupted cod which had resulted in the two rows of fred getting crossed over. Re-calculate cod after cull! 

The problem now solved after the aforesaid 20 hours or so, I can move into witch hunt mode. What mistakes did I make during the detective journey? Was there some better process which might have stopped me releasing the suspect cod after first picking him up?

On the whole I am content and can move on. The system of records and logs in place does provide something, enough for detection to get a grip on. There are enough integrity checks (and no need to copy the one that caught this particular bug all over the place). One does usually get there in the end. No need to add another layer of core cycle burning anything just presently. And no need to invent a new process.

But will I be eating my words after some far more damaging bug in a few days time? The unprotected worksheets?

No comments:

Post a Comment