Oh look, a bear trap

Every now and again, you write some code and it mostly works. Years ago, I wrote a tool called “ParFilEd” which was initially just a bulk uploader for an Amiga BBS.

It worked fine most of the time, but sometimes it would miss a file or it would put the wrong description up and you might have to redo that file. It never did these things when running in debug mode, which meant it could be either my code, the assembler API for the BBS I’d written, or the BBS itself.

You can usually estimate how long a bug like this is going to take to fix. These are the unremarkable day-to-day hassles of a programmer.

But then there are the super-bugs which defy detection and leave you blind and groping: The memory leaks, stray pointers, miscasts, stack corruptions, compiler bugs, execution ordering anomalies…

Sure, there are diagnostics and remedies for them, but the nature of the beast means those tools catch most cases. The only real cure is prevention, but I’ll get to that later.

Advanced warning, this is a blog entry and not some official press release.

I guess with all the changes I’ve made to the strat host specifically, and the host code generally over the years … it’s natural selection. Sooner or later, some bug that thwarted the 3-5 developers originally working on the host was going to loom over me and thwart me too.

I keep going back to my initial TOE code from last year hoping a few weeks away will highlight the problem in neon. I’ve put it through various unit tests. And for 9-15 minutes, it seems to work. Then bam: core gets dumped in some random place, core tries to get dumped but exceeds max size … or box reboots.

The unit testing seems to indicate its not my code that’s causing the problem, but massaging my ego doesn’t do anything for changing the fact that in-situ its effect is unhealthy. The second iteration I went through assures me that its not my fault…

I’ve taken a much simpler approach this time, dispensing with a lot of the replacement API I was trying to graft in. Most of the facility based supply system is still in place – but when it comes to trying to find the supply entry for the weapon at the supplying facility, I’ve moved that structure to the Brigade.

But even doing this, I’ve managed to step on a few mines. When I removed the “CountryWeaponList” from the facility structure, Bad Things Happened.

The 4-byte pointer was filling a slot necessary to make a subsequent structure member line up with a differently named field in a second-cousin-twice-removed structure used by the server library.

Now, to be fair, there was a comment. It said

// COUNTRY_WEAPON_LIST 

CountryWeaponList *cwl;

In a little burst of dumbstruck (and rather naieve) hope I built a version of the TOE beta with a re-organized facility structure and fired it up.

Didn’t fix the problem.

The normal coder solution is to throw the problem code out and write it from scratch. I have that urge, but I already did that twice. And throwing the whole strat system out to rewrite … well, certain people would have a field day with that.

It’s like opening your eyes and seeing the pearly gates only to have St. Peter tell you be it’s an halucination and you’re not dead, then glance at his watch and mumble “yet”.

Right now I’m slowly picking my way to what I hope is the bottom of the stack of the spawn/vehicle management system in the chasm between four coders and two servers. I’ve got a replacement stack to drop in once I’ve finally finished doing that.

Its gotten slower and slower because there are so many structures, functions, etc, to cross-reference; if I had a code monkey, I’d have flagged tons of this stuff for re-encapsulation as C++ so that I could be damned, freaking, bloody-well certain that POSTED_WEAPON::facilitySource isn’t sneakily a mirror for some vital part of the authentication system. *Sigh*

But the thing is… When I’m done, will I be done? Having only verified that the bug wasn’t in my TOE code, but not having yet discovered what causes the problem … Well, I’ve got nothing to form an ETA with. If this pass doesn’t help narrow down the cause, I’ve got no metric by which to gauge what or how much of anything I need to attack. And there’s enough of the old strat system left the old methodical functionality crippling generally only leads to a total disabling of the process. I can take 3-6 months to rewrite the remaining legacy strat host only to find that the problem is actually in the networking stack or the cluster library or an aged version of one of the utility libraries.

I think a lot of programmers are control-freaks and control-freak-wannabes, and I fall into one of those two categories. On a personal level, this is kicking my ass; you wanna damn me to hell for not having given you TOEs yet? Already there.

We don’t have anyone else in house who has worked on the host code or could easily start without weeks or months of brain dumping. Rickb has occasionally had to pull off a miracle and do something to teulKit, Ramp has on occasion caved in and put together a client-side hook to a server callback. But only Thunder has ever actually even so much as looked at host code.

So there is all the additional back-pressure of all the non-TOE work that is becoming more pressing and urgent on a daily basis too, and any work that becomes critical I have to do under the gun.

We pushed out a minor patch today. Just before we did, NetCode2 stopped accepting new connections. The patch got delayed slightly while I looked into it.

Over 30 minutes a gnawing, biting terror began to settle on me as a all the signs of a yet another blind showed up and in a new system that I thought I could trust.

And then it hit me. We have virtually no management instrumentation of our systems; most of our instrumentation is focused on player behavior. You’ve experienced this in the awful stability of the early game and the lethargic response times to outages — CRS simply didn’t know when the game was up or down because instrumentation isn’t something programmers like to develop.

The sad thing is that I instrument my code, but rarely ever need to use that instrumentation, so its easy to forget that stuff is instrumented. And NetCode2 has particularly good instrumentation.

At a glance I was able to see that NetCode2 was still working – bits were still being exchanged. But around 10:18 CDT it stopped accepting or even seeing new connections, and it was affecting all of the servers across 6 physical boxes.

They didn’t all stop at the same moment, rather there were several seconds difference that very much matched the sequence in which the processes came online. A glance at some other data told me that they had come online, together, some 25 days ago.

*jingle jingle*

On a hunch, I checked the time the servers had started. May 5th 14:42pm CDT. Less than 25 days. Infact, 24.85 days earlier. They’d been running for exactly 2^31 miliseconds when they stopped accepting connections.

Something in the connection code is using signed, 32-bit time values (one bit gets used for the sign, hence ^31) so that active connections aren’t affected but new connections are no-longer accepted. If the value was unsigned, the problem might have occured after 49.7 days – the same period with which an idle Windows NT box would crash (for this exact reason).

When you spend so much time working in the dark, as I’ve been forced to do with the TOE crash, you can forget how easy it can be to trap these kinds of problems indirectly.

Instrumentation goes hand-in-hand with testing and automation as both something crucial to project success and health – and something that programmers hate doing.

The moral of the story is for dev leads to never let their coders write black boxes, for producers to always require their programmers be responsible for not just writing the code but for providing them with a test that validates the code’s compliance with its spec and for investors to require that the product – not its developers – produce meaningful metrics.

Programmers tend to share more contractor mentality than they like to accept – they don’t like to consider it their responsibility to make sure that you say you need rather than what you think you’re asking for.

Specifically, coders generally don’t like being responsible for making sure their stuff isn’t going to be broken when someone else uses it in some way they hadn’t anticipated (coder for ‘doesn’t work right/properly’).

That’s OK – as long as they turn that from laziness to hubris and you and they agree to delegate that responsibility to an automated 3rd party.

In good software development, you don’t ask if the framerate has improved, the software tells you.

I’m not sure where I’d start instrumenting the strat host if this pass encounters the same problems – but I finally see a light at the end of the tunnel once again, and honestly I was beginning to despair.

 

6 Comments

bizarre, obtuse, and *hopeful*? Wow.

This is from The Fixx to you KFS:

KFSONE TOE Theme Song:

The deception with tact
Just what are you trying to say
You’ve got a blank face, which irritates
Communicate, pull out your party piece
You see dimensions in two
State your case with black or white
But when one little cross
Leads to shots, grit your teeth
You run for cover so discreet
Why don’t they

Do what they say, say what you mean
Oh well, one thing leads to another
You told me something wrong
I know I listen too long
But then one thing leads to another

The impression that you sell
Passes in and out like a scent
But the long face that you see
Comes from living close to your fears
If this is up, then I’m up
But you’re running out of sight
You’ve seen your name on the walls
And when one little bump
Leads to shock miss a beat
You run for cover and there’s heat
Why don’t they

Do what they say, say what they mean
One thing leads to another
You told me something wrong
I know I listen too long
But then one thing leads to another
Yeah, yeah, yeah

One thing leads to another

Then it’s easy to believe
Somebody’s been lying to me
But when the wrong word goes in the right ear
I know you’ve been lying to me
It’s getting rough, off the cuff
I’ve got to say enough’s enough
Bigger the harder he falls
But when the wrong antidote
Is like a bulge on the throat
You run for cover in the heat
Why don’t they

Do what they say, say what they mean
One thing leads to another
You told me something wrong
I know I listen too long
But then one thing leads to another
Yeah, yeah

One thing leads to another…

Nice read KFSone. Last year I was working on the Test Automation team here. We were refactoring the old code to – well, it was crap and needed serious redesign. I included a smack-ton of instrumentation in the refactor and it has allowed the team to rapidly track down issues where-as before, they relied on the hunt-and-peck method.

BTW: I have that book you linked to signed by the author. Worse choice I ever made was to leave that team for more money. I was the lead developer there and it was refreshing to be able to design systems in a proper way from the start (or restart in the case of the refactor). Now, I am in development hell working with a guy who refuses my every effort to do things right, to make it better. Grrr.

hence: the reason they paid you more money. :)

Try getting code rewritten because it is bad, but the developer that wrote it all is now the the development manager.

I’d rewrite the whole damn host code from scratch. Of course I would be coding living under a bridge with the rest of the team but at least it would work. Sadly no one would care anymore :(.

It’s like opening your eyes and seeing the pearly gates only to have St. Peter tell you be it’s an halucination and you’re not dead, then glance at his watch and mumble “yet”.

Personally, the image I had was St. Peter going “Now where did I put my keys…” – the wonderful world of everything working is right there, but some stupid little thing you can’t find is keeping you out, and you don’t even know where to start looking for it.

“opening your eyes … pearly gates” :- real bummer, you’re dead; the current code
“it’s an halucination” :- reprieve, rewriting some big part of it
“yet” :- axe waiting to fall at any moment, finding that the bug was in some other part of the system

Leave a Reply

Name and email address are required. Your email address will not be published.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

You may use these HTML tags and attributes:

<a href="" title="" rel=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <pre> <q cite=""> <s> <strike> <strong> 

%d bloggers like this: