Every now and again, you write some code and it mostly works. Years ago, I wrote a tool called “ParFilEd” which was initially just a bulk uploader for an Amiga BBS.
It worked fine most of the time, but sometimes it would miss a file or it would put the wrong description up and you might have to redo that file. It never did these things when running in debug mode, which meant it could be either my code, the assembler API for the BBS I’d written, or the BBS itself.
You can usually estimate how long a bug like this is going to take to fix. These are the unremarkable day-to-day hassles of a programmer.
But then there are the super-bugs which defy detection and leave you blind and groping: The memory leaks, stray pointers, miscasts, stack corruptions, compiler bugs, execution ordering anomalies…
Sure, there are diagnostics and remedies for them, but the nature of the beast means those tools catch most cases. The only real cure is prevention, but I’ll get to that later.
Advanced warning, this is a blog entry and not some official press release.
I guess with all the changes I’ve made to the strat host specifically, and the host code generally over the years … it’s natural selection. Sooner or later, some bug that thwarted the 3-5 developers originally working on the host was going to loom over me and thwart me too.
I keep going back to my initial TOE code from last year hoping a few weeks away will highlight the problem in neon. I’ve put it through various unit tests. And for 9-15 minutes, it seems to work. Then bam: core gets dumped in some random place, core tries to get dumped but exceeds max size … or box reboots.
The unit testing seems to indicate its not my code that’s causing the problem, but massaging my ego doesn’t do anything for changing the fact that in-situ its effect is unhealthy. The second iteration I went through assures me that its not my fault…
I’ve taken a much simpler approach this time, dispensing with a lot of the replacement API I was trying to graft in. Most of the facility based supply system is still in place – but when it comes to trying to find the supply entry for the weapon at the supplying facility, I’ve moved that structure to the Brigade.
But even doing this, I’ve managed to step on a few mines. When I removed the “CountryWeaponList” from the facility structure, Bad Things Happened.
The 4-byte pointer was filling a slot necessary to make a subsequent structure member line up with a differently named field in a second-cousin-twice-removed structure used by the server library.
Now, to be fair, there was a comment. It said
// COUNTRY_WEAPON_LIST CountryWeaponList *cwl;
In a little burst of dumbstruck (and rather naieve) hope I built a version of the TOE beta with a re-organized facility structure and fired it up.
Didn’t fix the problem.
The normal coder solution is to throw the problem code out and write it from scratch. I have that urge, but I already did that twice. And throwing the whole strat system out to rewrite … well, certain people would have a field day with that.
It’s like opening your eyes and seeing the pearly gates only to have St. Peter tell you be it’s an halucination and you’re not dead, then glance at his watch and mumble “yet”.
Right now I’m slowly picking my way to what I hope is the bottom of the stack of the spawn/vehicle management system in the chasm between four coders and two servers. I’ve got a replacement stack to drop in once I’ve finally finished doing that.
Its gotten slower and slower because there are so many structures, functions, etc, to cross-reference; if I had a code monkey, I’d have flagged tons of this stuff for re-encapsulation as C++ so that I could be damned, freaking, bloody-well certain that POSTED_WEAPON::facilitySource isn’t sneakily a mirror for some vital part of the authentication system. *Sigh*
But the thing is… When I’m done, will I be done? Having only verified that the bug wasn’t in my TOE code, but not having yet discovered what causes the problem … Well, I’ve got nothing to form an ETA with. If this pass doesn’t help narrow down the cause, I’ve got no metric by which to gauge what or how much of anything I need to attack. And there’s enough of the old strat system left the old methodical functionality crippling generally only leads to a total disabling of the process. I can take 3-6 months to rewrite the remaining legacy strat host only to find that the problem is actually in the networking stack or the cluster library or an aged version of one of the utility libraries.
I think a lot of programmers are control-freaks and control-freak-wannabes, and I fall into one of those two categories. On a personal level, this is kicking my ass; you wanna damn me to hell for not having given you TOEs yet? Already there.
We don’t have anyone else in house who has worked on the host code or could easily start without weeks or months of brain dumping. Rickb has occasionally had to pull off a miracle and do something to teulKit, Ramp has on occasion caved in and put together a client-side hook to a server callback. But only Thunder has ever actually even so much as looked at host code.
So there is all the additional back-pressure of all the non-TOE work that is becoming more pressing and urgent on a daily basis too, and any work that becomes critical I have to do under the gun.
We pushed out a minor patch today. Just before we did, NetCode2 stopped accepting new connections. The patch got delayed slightly while I looked into it.
Over 30 minutes a gnawing, biting terror began to settle on me as a all the signs of a yet another blind showed up and in a new system that I thought I could trust.
And then it hit me. We have virtually no management instrumentation of our systems; most of our instrumentation is focused on player behavior. You’ve experienced this in the awful stability of the early game and the lethargic response times to outages — CRS simply didn’t know when the game was up or down because instrumentation isn’t something programmers like to develop.
The sad thing is that I instrument my code, but rarely ever need to use that instrumentation, so its easy to forget that stuff is instrumented. And NetCode2 has particularly good instrumentation.
At a glance I was able to see that NetCode2 was still working – bits were still being exchanged. But around 10:18 CDT it stopped accepting or even seeing new connections, and it was affecting all of the servers across 6 physical boxes.
They didn’t all stop at the same moment, rather there were several seconds difference that very much matched the sequence in which the processes came online. A glance at some other data told me that they had come online, together, some 25 days ago.
On a hunch, I checked the time the servers had started. May 5th 14:42pm CDT. Less than 25 days. Infact, 24.85 days earlier. They’d been running for exactly 2^31 miliseconds when they stopped accepting connections.
Something in the connection code is using signed, 32-bit time values (one bit gets used for the sign, hence ^31) so that active connections aren’t affected but new connections are no-longer accepted. If the value was unsigned, the problem might have occured after 49.7 days – the same period with which an idle Windows NT box would crash (for this exact reason).
When you spend so much time working in the dark, as I’ve been forced to do with the TOE crash, you can forget how easy it can be to trap these kinds of problems indirectly.
Instrumentation goes hand-in-hand with testing and automation as both something crucial to project success and health – and something that programmers hate doing.
The moral of the story is for dev leads to never let their coders write black boxes, for producers to always require their programmers be responsible for not just writing the code but for providing them with a test that validates the code’s compliance with its spec and for investors to require that the product – not its developers – produce meaningful metrics.
Programmers tend to share more contractor mentality than they like to accept – they don’t like to consider it their responsibility to make sure that you say you need rather than what you think you’re asking for.
Specifically, coders generally don’t like being responsible for making sure their stuff isn’t going to be broken when someone else uses it in some way they hadn’t anticipated (coder for ‘doesn’t work right/properly’).
That’s OK – as long as they turn that from laziness to hubris and you and they agree to delegate that responsibility to an automated 3rd party.
In good software development, you don’t ask if the framerate has improved, the software tells you.
I’m not sure where I’d start instrumenting the strat host if this pass encounters the same problems – but I finally see a light at the end of the tunnel once again, and honestly I was beginning to despair.