How’d that ever work?

When we [tried] to roll out an update a few weeks ago, I ran into an extremely bizzare problem. When the new host processes went live, with a very minimal set of changes, suddenly things started going horribly wrong. Suddenly the host started crashing in places that clearly hadn’t been changed or weren’t even in a process that had had any changes.

Yet somehow, code which, at first glance couldn’t possibly crash, had started crashing.

Cutting to the chase, one of the weirdest parts was in the Chat Grid. Each grid entry maintains an active “neighborhood” list. I chose to waste memory over CPU. This neighborhood list, for ease of use, includes the entry itself.

And when the entry goes away, it removes itself from all of its neighbors. Including itself.

Therein lies the problem.

NEIGHBORHOOD::iterator it ;
for ( it = neighbors().begin() ; it != neighbors.end() ; ++it )
{
  (*it)->neighbors().erase(this);
}

The “self” entry is *usually* the first entry in a given neighbors() list. Erasing an entry from a list makes any iterators invalid. Which means that, usually, the first thing we do is remove the first entry in the list we are for()ing over.

This code should never have worked. It needs to manually remove the self entry first:

neighbors().erase(this);
for ( it = neighbors.begin() ; it != neighbors.end() ; ++it )

What I don’t get is how this code survived the torture tests I put it thru when I first wrote it, how its survived the test harnesses I use every dev cycle, and then suddenly just started crashing when we changed something unrelated in another host.

12 Comments

Simplest answer: it wasn’t actually checked in yet?

;)

The fixed code still seems somewhat backwards to me, but maybe I’m not understanding the structure properly. Wouldn’t “neighbors().erase(this);” cause the rest of the list to get lost, or is that method removing then reassembling the list?

I think I get why you can’t “remove the others then remove self”, though – the “remove the others” wouldn’t have the smarts to recognize itself to skip it.

As Bill Clinton once said; “It all depends it what your definition of (this) is.”

…@/

Or was it ever being used, would be another question. Welcome to the world of ERP’s. Where a change in one module in explicitly affects another. I just call it the butterfly effect.

Also just another example of something that could go wrong with TOE’s and why you have to take it slow and get it right. The joys of a production environment.

The code you worry about = the code you think about and trace in your mind, and setup traps to watch.

The code you *know* you havn’t touched, you do nothing to monitor or test.

Whatever reason it broke live, it was trying to tell you it was going to break. But if you don’t have a speedometer, you don’t know whether your test car is going too fast. Until you have the speedometer running at the race track, and the engine fries.

Lesson Learned #22: You must always watch the areas you are not watching – for there is where all the real trouble comes from.

It was being used *all* the time. Its in the code that gets called every time a grid node went away. I have a little test bench that specifically creates every possible combination of neighborhood list for a random selection of groups – it takes upto to 8 hours to run, and it has run. It didn’t crash before, it crashed with the new compile of the code (and no, I haven’t changed compilers or compiler libraries). It was being called, on average, 1983 times an hour.

My only guess is that it happened to be a “lucky strike” in terms of whatever random piece of memory it used to land when it self-corrupted. Cause it doesn’t work no more :)

Been there and done that. I am amazed at the number of times I have said the words: “Well, that should have NEVER worked!”

Ahh, STL, how I miss thee ..

Is that a type of linked list?

BTW, I haven’t done c++ programing for about 12 years now. I was messing around with DevC++ at work the other day, did something happen to the standard libraries? iostream.h not supported anymore?

There should be <iostream> and <iostream.h> – the main difference being when you include it as “iostream” everything goes into a namespace so you have to refer to, e.g., “std::cout”, whereas if you use the .h version it will be exported so you can just refer to it as, e.g., “cout”.

It should be in there – until fairly recently I used DevC++ and it had it. If you’re talking about the little warning, it’s just that the “.h” has been dropped for the “new” standard library – just instead of , for example. That and the namespace bit are the only changes I know of…

Gah, it didn’t show up. Stupid HTML…

[iostream] instead of [iostream.h], is what should be at that part where the sentence makes zero sense. Sorry about that.

Leave a Reply

Name and email address are required. Your email address will not be published.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

You may use these HTML tags and attributes:

<a href="" title="" rel=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <pre> <q cite=""> <s> <strike> <strong> 

%d bloggers like this: