Yesterday the ticket to fall onto my desk was an odd one, rumors of the game client crashing on the 30th or 62nd despawn. Indeed, we had several confirmations. Martini provided me with a callstack from a crash which gave me an idea of where the problem was – but about as specific as knowing that ‘Bob will be in the Bahamas, Tuesday’.
The crash is deep inside our networking API, teulKit. The code hasn’t changed in a very long time – it’s robust and (mostly) stable. So what had broken it? Well, infact, it seems like it’s just been broken for the last 4-5 years but only became noticeable when I reigned in a rather silly memory over-allocation a while back.
I could tell early on what the problem was: despawning effectively spaws your connection to the cell host for one with the map host. Doing this involves a little sleight of hand – it is actually a new connection, but once it is set up we close the old one and drop the new one into its place.
However, for some reason, it gets an identifier out of range. A clear case of “if i > max” where “if i >= max” was required.
Now just to find i.
My crash break-point was in a file of 7000+ lines of code littered with similar function names wrapped between #ifdef Mac/#ifdef Linux/#ifdef Windows checks. By the end of Wednesday I was confused as hell.
One minute I was looking at a function building select() descriptor lists, and the next minute I was looking at a call to poll() instead. Then a minute later I am looking at a function called _pollDescriptors() which uses select lists. What the heck?
Attacking it again this morning it suddenly became clear: some of these functions were being completely ignored.
signed integer function()
bit of work ;
expensive work ;
// Hi – I’m just here to scare and confuse you
work critical to the description of the function ;
more expensive work ;
return 0 ;
A half hour spring cleaning, 5000 lines of code removed, and it suddenly starts to make sense. Plus, the Mac’s network code is now the same BSD-based code used by the servers rather than … well, whatever it was before.
Certainly cheered me up but it didn’t solve my bug. The increased clarity of vision did allow me to zip straight to it though and, 15 minutes later, I had the culprit locked down. As I suspected,
if ( teulIndex > MAX_ONE_THING + MAX_ANOTHER )
return INVALID ;
Change the “>” to a “>=”, retest a bunch of times, it’s fixed. And I get a few extra FPS on my Mac (it may help the Windows version too since there were large chunks of shared code/data that didn’t need to be).
Having fixed it, I automatically want to quickly try to understand how the mistake was made. And in this case, it was the combinations of MAXs. A “teulIndex”, it turns out, has a value that is contextually dependent on what you think it means.
For instance, if you think that a teulIndex of 12 is a local index, then it may actually have a value of 15, or 9, or -1! But if you think it is a “real index” then it may actually have a value of 22, or 2500 or -1. But, if you’re sure it’s not either, then it probably has a value of 12.
This confusion is an insert. At one point the teulKit API had to support multiple processes talking to the end user through a single outfacing connection. Inside this was called “interfaces” which has variously been misunderstood since to be network interfaces.
Why didn’t I remove this last huge chunk of it. I spent a while anaylzing the 2702 line source file. It didn’t call anything outside itself and there were only a few calls into it. One initialized the management system and another kept it ticking. But what did it actually do? The only active calls into it were to store a value that is then periodically retrieved, this during the processing of each incoming packet. The value is the index of the connection talking to us, and the only other query is to ask what index that was.
In short – it’s sole purpose in life was to store a single 32 bit number, and now and again report back on what that value was. All of these calls are from one other source module.
// 2072 lines of code
int interfaceIndex = teulQueryInterfaceCurrentIndex() ;
sendSomeData(interfaceIndex, data) ;
static int s_contextIndex = -1 ;
s_contextIndex = index ;
sendSomeData(s_contextIndex, data) ;
The painful swelling of the bad kind in my head just got replaced by so much warm fuzzy…