While I’ve been trying to dig the TOE systems out of a marianis trench of functionality, I still have other host responsibilities to try and tend to. In part, I’m still overcoming the lack and fear of tools that existed here when I started.
In the last few months, I’ve been replacing the inter-host network with Netcode2. That light bulb that just went on “oh, that’s what the problem was”; no, the problem wasn’t Netcode2. The solution has been Netcode2, and the problem was that we didn’t have resources to maintain Netcode1.
Netcode1 is written in C, based on TCP, and in general tries to be skeletal. Over my years here, I’ve built bits of framework to make using Netcode1 easier — classes that abstracted away much of the otherwise tedious processes involved in, for example, registering an RPC call and the functions that handled its sending, receipt and data marshalling.
As I’ve been doing this, I’ve had an opportunity to begin reinstrumenting some of the systems, dealing with issues where the old logging system would kill a server process repeatedly logging an error hundreds, thousands or even tens of thousands of times a second.;
And introducing a system of performance/event counters. We used to have custom, specific tables for this kind of thing, or we had to trawl the logs. But a lot of the information going into the log files was redundant. Helpful in debugging someone’s connection or a very specific issues, but no use for the overall health, performance and status of the servers.
At the same time, I stumbled across JPGraph. I’ve already begun severing my ties with Roxen and its diagram tag and we have an Apache server internally for our wiki and trac. Trying to coerce your data into something <diagram> will use… Is just a pain. Doing it in PHP with JPGraph is a walk in the park, though – allowing me to go from zero to graphage in no time. In particular, JPGraph understands timestamps.
In the past, we have found or detected problems through monitoring, which we have quite a lot of. However, the monitoring is largely passive, and until we actually observe something break via monitoring, it often goes unmonitored.
This is already earning its own keep, finally I can observe bad trends forming ahead of time. It already highlighted a leak we otherwise wouldn’t have detected (we don’t monitor the neccessary resource, doh). This leak is going to mean scheduling server downtime, but we have data we can use to project how long before it becomes a problem – so we can schedule downtime to restart the servers well in advance.
Of course, arguably we’re looking at an incomplete picture with the dataset we have, and it’ll take some time to develop meaningful sets of data and trends to work with. At some time in the future we need to sit down and predict some of those trends based on how we think things should work and see if we see anything aproximately like that.
It has shed some light on parts of the system that were pretty obscured, too. I may have found why the firebases never seem to want to save properly between host restarts – my suspicion is that they only save state after the server starts, so when we next restart the server, it reloads that state instead of the state it was in when the servers came down.
Lastly, its been pretty useful in helping me verify the conversion of our inter-host backbone to NetCode2 – which is looking pretty healthy in the beta cluster.