So I’ve spent most of my weekend doing this silly directory-flattening thing, mostly because there would be no great peril if I got distracted by something important. Given the server crashes the last two weekends I wanted to try and catch it happening this week, so I stopped by each of the server processes every 15 minutes looking for signs of a problem.
Finally around 11pm (+/-30 minutes) I step away from the computer (well, for more than 5 minutes) and I come back 30 minutes later and find the server has gone into its crazy-not-quite-dead-yet loop. Alas, when it does it registers just enough to fool all of our monitoring systems and other servers that its alive, but not enough to do anone any good, and meanwhile it’s writing “Oops, there’s a problem” to a log file so fast that in the 20 minutes between it starting and my coming back and checking all 4 80Mb log files were full of “Oops, there’s a problem”.
Fortunately, one of the servers actually had some interesting breadcrumbs for me still that might allow me to investigate this properly tomorrow. I have a nasty hunch I might have to break out Valgrind and do some rigorous memory testing.