2702 lines of code to create … a variable

Yesterday the ticket to fall onto my desk was an odd one, rumors of the game client crashing on the 30th or 62nd despawn. Indeed, we had several confirmations. Martini provided me with a callstack from a crash which gave me an idea of where the problem was – but about as specific  as knowing that ‘Bob will be in the Bahamas, Tuesday’.

The crash is deep inside our networking API, teulKit. The code hasn’t changed in a very long time – it’s robust and (mostly) stable. So what had broken it? Well, infact, it seems like it’s just been broken for the last 4-5 years but only became noticeable when I reigned in a rather silly memory over-allocation a while back.

I could tell early on what the problem was: despawning effectively spaws your connection to the cell host for one with the map host. Doing this involves a little sleight of hand – it is actually a new connection, but once it is set up we close the old one and drop the new one into its place.

However, for some reason, it gets an identifier out of range. A clear case of “if i > max” where “if i >= max” was required.

Now just to find i.

My crash break-point was in a file of 7000+ lines of code littered with similar function names wrapped between #ifdef Mac/#ifdef Linux/#ifdef Windows checks. By the end of Wednesday I was confused as hell.

One minute I was looking at a function building select() descriptor lists, and the next minute I was looking at a call to poll() instead. Then a minute later I am looking at a function called _pollDescriptors() which uses select lists. What the heck?

Attacking it again this morning it suddenly became clear: some of these functions were being completely ignored.

#ifdef Macintosh
signed integer function()
{
  bit of work ;
  expensive work ;
#ifdef Windows
   // Hi – I’m just here to scare and confuse you 
   work critical to the description of the function ;

#endif
  more expensive work ;
  return 0 ;
}

A half hour spring cleaning, 5000 lines of code removed, and it suddenly starts to make sense. Plus, the Mac’s network code is now the same BSD-based code used by the servers rather than … well, whatever it was before.

Certainly cheered me up but it didn’t solve my bug. The increased clarity of vision did allow me to zip straight to it though and, 15 minutes later, I had the culprit locked down. As I suspected,

  if ( teulIndex > MAX_ONE_THING + MAX_ANOTHER )
    return INVALID ; 

Change the “>” to a “>=”, retest a bunch of times, it’s fixed. And I get a few extra FPS on my Mac (it may help the Windows version too since there were large chunks of shared code/data that didn’t need to be).

Having fixed it, I automatically want to quickly try to understand how the mistake was made. And in this case, it was the combinations of MAXs. A “teulIndex”, it turns out, has a value that is contextually dependent on what you think it means.

For instance, if you think that a teulIndex of 12 is a local index, then it may actually have a value of 15, or 9, or -1! But if you think it is a “real index” then it may actually have a value of 22, or 2500 or -1. But, if you’re sure it’s not either, then it probably has a value of 12.

This confusion is an insert. At one point the teulKit API had to support multiple processes talking to the end user through a single outfacing connection. Inside this was called “interfaces” which has variously been misunderstood since to be network interfaces.

Why didn’t I remove this last huge chunk of it. I spent a while anaylzing the 2702 line source file. It didn’t call anything outside itself and there were only a few calls into it. One initialized the management system and another kept it ticking. But what did it actually do? The only active calls into it were to store a value that is then periodically retrieved, this during the processing of each incoming packet. The value is the index of the connection talking to us, and the only other query is to ask what index that was.

In short – it’s sole purpose in life was to store a single 32 bit number, and now and again report back on what that value was. All of these calls are from one other source module.

Before:

File 1:
// 2072 lines of code

File 2:
 // …
 teulSetInterfaceCurrentIndex(index) ;
 // …
  int interfaceIndex = teulQueryInterfaceCurrentIndex() ;
  sendSomeData(interfaceIndex, data) ;

File 1:
// Deleted

File 2:
static int s_contextIndex = -1 ;
// …
  s_contextIndex = index ;
// …
 sendSomeData(s_contextIndex, data) ;

The painful swelling of the bad kind in my head just got replaced by so much warm fuzzy…

7 Comments

At this rate you going to get the code down to 10 lines. ;)

10 LET n = 1
20 LET n = n + 1
30 IF n = 100000000000000000000000000 GOTO 50
40 GOTO 20
50 SPANG n

Ah the life of a programmer, always so much fun.

Long live KFS1! All infantry players are grateful. This was like the most annoying bug ever.

Well it was a VERY good find by you Elegance.

Can’t beat a totaly repeatable bug.

Nice work on fixing it Oli.

!S

May I ask, who the hell wrote that code? :D
Well gj and what is your next ticket? :)

Drave

Trackbacks and Pingbacks

[…] mass of changes and fixes to the network stack (teulKit) over the last year are really paying off. So far, the network connection status log is […]

Leave a Reply

Name and email address are required. Your email address will not be published.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

You may use these HTML tags and attributes:

<a href="" title="" rel=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <pre> <q cite=""> <s> <strike> <strong> 

%d bloggers like this: