When we [tried] to roll out an update a few weeks ago, I ran into an extremely bizzare problem. When the new host processes went live, with a very minimal set of changes, suddenly things started going horribly wrong. Suddenly the host started crashing in places that clearly hadn’t been changed or weren’t even in a process that had had any changes.
Yet somehow, code which, at first glance couldn’t possibly crash, had started crashing.
Cutting to the chase, one of the weirdest parts was in the Chat Grid. Each grid entry maintains an active “neighborhood” list. I chose to waste memory over CPU. This neighborhood list, for ease of use, includes the entry itself.
And when the entry goes away, it removes itself from all of its neighbors. Including itself.
Therein lies the problem.
NEIGHBORHOOD::iterator it ;
for ( it = neighbors().begin() ; it != neighbors.end() ; ++it )
The “self” entry is *usually* the first entry in a given neighbors() list. Erasing an entry from a list makes any iterators invalid. Which means that, usually, the first thing we do is remove the first entry in the list we are for()ing over.
This code should never have worked. It needs to manually remove the self entry first:
for ( it = neighbors.begin() ; it != neighbors.end() ; ++it )
What I don’t get is how this code survived the torture tests I put it thru when I first wrote it, how its survived the test harnesses I use every dev cycle, and then suddenly just started crashing when we changed something unrelated in another host.