3 years ago, we had an issue with the auth server. It would run fine for a while, and then the database would just stop responding.
I was told it “happened” now and again. 30 minutes later it “happened” again, and repeated itself every 30-90 minutes for the next 6 hours.
We couldn’t find a single cause, so we solved the problems we were seeing right there. We were all set to upgrade the SQL server. But then the problem stopped happening. We spent another couple of days looking for a cause but never found one.
2006, Thanksgiving. Ramp is on a river boat with his family and nowhere near the ‘net unless ‘gators have bluetooth. Killer is in Houston. Gophur is out of town and out of booze, Doc is swimming in poosville (sewage back-up issue). Beep goes the pager. Beep, beep, beep, beep, beep, beep. “Uh” goes the host guy.
Absolutely nothing seems untoward. For 15 minutes I look at log files, messages. I can’t find a hint of a whiff of anything wrong, people are spawning, capturing, dying, respawning. There are even people logging into the game.
“beep, beep, beep, beep, BEEP” goes the pager.
Finally, I notice that one of the auth processes is still checking on the same customer it has been checking every time I’ve looked. This guy is either logging in like a freak or somethings up. I restart auth, whoosh. Everything goes green.
Well, it had been running for 7 months. Processes get tired.
Long story medium-length: after a 3 year hiatus, our problem is back. With help from Ramp, we did the database upgrade that’s been waiting for 3 years; we ran table checks, we ran hardware checks, we ran software checks. After 3 outages and a couple of aspirin, things stabilized again last night. And I was just dozing off this afternoon when BEEEP. And then it just kept doing it all afternoon.
Finally, Ramp and I investigate further, there’s absolutely no indication of a problem. Things work totally normally but this one query just doesn’t ever come back. Nor does it time out either.
After much greying and pulling of hair, I notice a rather strange coincidence. 20030910 – when the problem went away 3 years ago – a new table was created, and all of the auth connection logs were copied into it and the primary log cleared out.
Now, after 3 years, and many hundreds of millions of connections later, there’s rather a lot of data in there.
As it happens, Ramp recently built me a replacement auth box so that I can build an upto-date auth host contemporary to the current game engine, it just got reprioritized to the top of the list.