Oh, wait a second – I think that’s the light at the end of the tunnel!
Yes, I’m pleased to announce – although Rafter is going to flay me alive – that Battleground Europe v1.31 is likely to release within the next Two Weeks!
I think this is going to be a fun weekend: 99% of the CTDs turned out to have a singular cause, so once Ramp fixed those, 1.31 shifted into it’s final gear. There was also a connectivity issue between two host processes that was causing almost all of the host issues, so now that’s fixed…
But the fun is playing 1.31, albeit on the Open Beta server. Having some really great battles there. I think even Toto had fun today.
I’ve spent most of the last couple of months struggling away with the symptoms of that connectivity issue, mostly in the bowels of the cell host code. By last weekend, my head was on the verge of exploding from the fact that every angle I took just broke things differently, as has always tended to be the case with the affected cell host code.
Bad enough that, last Sunday – and for recreation – I rewrote the update system.
Oh – that’s not going into 1.31, that’d delay it another 6 months easily. Rather, I coded it up and did some preliminary testing on it.
It’s using similar concepts to area chat’s “grid system”. The current system uses almost all of the CPU in the process of selecting whom to send updates to, building a vis list for them, preparing that list for culling and polishing the list. The actual sending of the list is very trivial.
At every juncture, the code turned out to be resistant to parallelization without significant work, and the last thing I wanted to do at the end of a dev cycle was … rewrite a major system.
It’s actually a pretty simple bean-counting operation. It’s just complex because it was written over-complexly when less of the 3rd-party tools (such as STL) were commonly and reliably available.
Sadly, it’s also written to self-optimize for circa 2001 hardware. When you put it on modern hardware, the code kicks the CPUs ass in every conceivable way.
My approach migrates the overheads to suit modern architecture.
For example: the spaces in the grid will probably be about 1-2km on a side. Under normal circumstances, no vehicle is going to traverse these spaces more than once every few updates. So it is relatively safe to transfer work from the per-vehicle observation of the world to management of the spaces.
Each space in the grid self-associates with its neighbors when it becomes active (populated) and disassociates itself when it becomes inactive (empty).
While it’s unlikely a player will transition between cells on every update, it’s not impossible: a player running in an ordinal direction right on the boundary may “wobble” between cells from one update to another.
So the work of recovering inactive cells is deferred to a periodic “garbage collection” task.
I also found that for every vehicle that every person sees on their vis list, the host does a “is it alive?” check. This can leave ghost vehicles around, but not doing the check had all kinds of nasty side effects. So, I relocated that work to a management step, reducing the work done per-update-per-observed-vehicle by a small amount that adds up over the entire loop. (.01ms per loop, or 1ms of CPU time per second).
By writing from the top-down, from scratch, and by judicious use of const and restrict markup, I can also be confident that it is reasonably thread safe – which means I’ll be able to execute steps in parallel.
One of the more annoying aspects of the old code was its re-use of a “holder” concept. Throughout most of the code, a xxx_HOLDER is used for linking an entry within a list: it has a prev, a next and an item pointer. In the old cell/grid system it got padded out with a number of similar(ish) variants, all of which assumed they were exactly the same as this-or-that.
The main flavor was the “GRID_HOLDER” which contains prev, next and vehicle pointers. This is used for maintaining vis lists. When it gets around to building your next update, it checks to see if it’s own copy of the vehicle ID matches that at the vehicle pointer. If it doesn’t, then the vehicle has been freed!
This structure also contained several other fields, designed to help track what state you had last seen the vehicle in. Unfortunately, all of the information [has to] get overwritten when the holder is first accessed. Since this is rabbit hole guaranteed to have an angry badger at the bottom, the rest of the host code just works around this lack of delta information :(
I was able to discard the rest of the HOLDER structures and use Standard Template structures for managing the data sets instead. This provides much faster access to elements of the structures at the cost of extra management overhead – however, that is safely delegated away from the list generation.
The remaining holder, I renamed “Impression”: it captures an impression of the vehicle when last observed, so that it can provide delta information the host can use to reduce the amount of data that has to be sent.
By far, the biggest crime of the old update system is in-place sorting. Everything it does uses linked lists that are sorted in-place.
The loop starts by iterating over every vehicle in the game, including those that aren’t on this host – so it can do it’s “is it alive?” check.
Optimization #1: stay on task, only iterate over the active vehicles for this loop.
For each vehicle, it checks to see if it is ready for another update, assigns it a priority based on time since last update, and then in-place sorts it onto a linked list.
Optimization #2: take the weight and a pointer to the vehicle, and stuff them into a simple array.
Optimization #3: when the list is built, if it is larger than the max updates allowed per iteration, sort it using a fast, parallel sort algorithm.
Once we have the list of vehicles to update, it starts iterating over each vehicle and doing a series of checks on the vehicle’s data and status (sending EWS info to strat, sending area chat info the chat host, etc).
Optimization #4: none of these tasks are done every update, none is done more than once per second; move these tasks to the once-a-second “is it alive?” loop that checks for ghosts/timed out vehicles.
Now we get to the “World Update” phase. This is where we determine which vehicles the player can see and any strat updates they may need.
Every vehicle in the game world has a flag, “inUpdate”, which tells the code whether or not it is in update being currently generated (thread-unsafe, causes writes to ~1000 pages of memory).
Then we check the player’s previous vis list: First, the HOLDERS are all removed from the player’s list and moved to a temporary (again, in-place sorted) list. We then extract each HOLDER from the temporary list and check to see if it’s still valid.
Next we perform a subset of the main vis-list rules: Is this the vehicle I’m riding (so I have to see it)? Is this my vehicle (if I’m multi-crewed)? Is this vehicle still in range?
Based on this, a score is assigned, and then the vehicle is placed onto a “biasing” list that tries to further prioritize vehicles based on the viewer and the subject.
Then we go through the vis list and set all the vehicles inUpdate flag to TRUE.
Any vehicles that were deemed no-longer visible during this process have their HOLDERs shuffled onto a different (global) list so that we can tell the client which vehicles to stop drawing.
Optimization #5: Since there is a hard-limit to the number of vehicles you can see, there’s also a hard-limit to the number you might stop seeing in any one update. So add a simple array of vehicle IDs to the world update structure and simply add each removed vehicle to the end of the list and put the HOLDER directly onto the free list.
Optimization #6: Check the Impression list to see if a vehicle occurs; guaranteed to be faster than writing 1000 pages of memory every time. Also makes the operation thread-safe, so that the cost can later be amortized by running several iterations at once.
Optimization #7: Don’t in-place sort. The vis list is now an unsorted std::array of Impressions. If we remove vehicles, I use the “tail swap trick”
std::array<int> foo ;
std::array<int>::iterator it = find(foo.begin(), foo.end(), 4) ;
if ( it != foo.end() )
// tail-swap trick
*it = foo.back() ;
The cost of this is the copy of foo.back() into *it plus decrementing the size counter for foo.
The host now has to look up where the player’s vehicle is. I’m really not sure why this information isn’t kept. But every time an update is sent, the host has to spend nearly 5000 CPU cycles figuring out where in the world you are.
Optimization #8: figure out where I am when I move into a grid space; the cost is about the same, but the frequency is vastly reduced.
Knowing where you are is only half the story. The host now has to calculate which other spaces might be in your vis range. It uses a fairly efficient hash based system that spirals out from your current grid location and places the addresses of each spaces into a list. (This flattening of the spiral is called the “GRID_STRIP”).
Then, the code iterates through the list and looks up the space matching the coordinates. For all this effort, what we get is a pointer to a holder. The holder has a prev and next, which provide the internal chaining of the grid itself (the whole grid is one huge, pre-allocated doubly linked list).
It also has a pointer to another type of holder, which contains a prev-and-next for iterating over the contents of the space, and a pointer to a vehicle (or a vehicle holder, I was never entirely clear which it actually was, it was always safest to dereference it’s vehicle pointer and know you definitely had a vehicle at that point).
Optimization #9: use an unordered_map to hold the grid: this does away with constraints checking, and create a grid-per-instance so we won’t have to check every vehicle we might see for being in the same instance.
Optimization #10: give each element in the grid a concept of a “neighborhood” – that is, the list of nearby spaces that are active (not empty). When a space becomes active, register it with its neighbors. When it becomes inactive, unregister it. Thus the “what’s nearby” evaluation is done at a significantly decreased frequency. Per-update CPU saving: ~60,000 cpu cycles.
This is partly because the “spiral” system used by the old system is absolutely guaranteed to conflict with CPU caching. The old code is almost designed to hurt a modern CPU :(
Optimization #11: Use arrays (std::vectors) where possible so that data is spatially coherent in memory, improving cache performance.
The host now goes through the vehicles in each grid space it looked up. It allocates a new HOLDER, copies vehicle information into it, and then performs a variety of tests to see if the vehicle is suitable, if it is already in the update list, if the vehicle has timed out and should be kicked, etc.
See Optimization #4: we do less of these checks per-viewer and instead do them once per vehicle at the appropriate frequency.
See Optimization #6: here we actually have to pay an increased price, checking all the entries in the current vis-list for a matching vehicle ID. Depending on further testing, I may actually maintain an unordered set of in-list IDs to reduce the cost of this in high-density areas where there are a lot of vehicles you wind up not seeing.
Each vehicle that is a candidate for being seen gets placed onto a biasing list. This involves looking through my vehicle’s biasing preferences: How many friendlies do I want to allow for? How many troops? How many boats? Etc.
Each addition to the biasing lists performs an in-place sort. If a list grows too long, the last element is removed. If it was on our previous vis list, then it gets added to the remove list.
Optimization #12: don’t allocate holders … yet.
Optimization #13: take pointers to the previous vis list and new vehicles being considered for addition, and store them unsorted.
Optimization #14: only if the resulting list is bigger than the player’s vis list should we apply biasing.
Optimization #15: consider each rule in my biasing list once, allowing me to:
Optimization #16: produce a simple biasing operator specific to each rule that can be applied with extreme efficiency to all vehicles.
Optimization #17: instead of writing the weights to the vehicle/holder, store them along with the pointer.
Optimization #18: pre-sort the list based on their general vis-desirability weight, then apply the bias-rules rule-at-a-time to the list, transferring the bias-favored vehicles onto the waiting vis-list, using an incremental “bias number” index to assign them weight.
Optimization #19: once all the biased vehicles have been transferred to the vis list, we can tell if there is room left for the remaining vehicles. If there is, re-sort the remaining “to be biased” pointer/weight list, and copy as many entries as will fit. Remaining entries are checked for transfer to the removal list.
The old system has to merge all of the bias lists together, which is again done with an in-place sorting linked-list insertion and potential trim. Garbage collection has to be performed since we are using globals out of the wazoo.
Optimization #20: our list is already good to go :)
We now have a full vis list for the player to see.
Except, we don’t send updates for the entire vis list; because we don’t have deltas, we can’t quite fit enough data into the old 522 byte packets to describe everything. As it is, each update is between 58-73 bytes, averaging around 65 bytes. And we have a bunch of other data we’re going to stuff into this packet, including strat data (building states).
To work around this and squeeze more updates into each packet, we use 3 different levels of update. Players further away send lower-resolution data, and omit some data entirely on some updates to squeeze more data per packet.
So we create a new HOLDER list. We go through the vis list, based on weights we calculated before, we create a new HOLDER, copy the holder information across, and calculate a weight based on how-long since this person last sent you an update. Then we use another sorted-list insertion into the “level” list. If a holder pops off the end of the list, we put it back onto the free list.
Once we’re done, this means we have 3 holder lists that form the list of vehicles we’re going to update at different fidelity levels.
This is essentially just a re-sort of the list, but a very expensive one.
Optimization #21: use the same pointer/weight trick as used for biasing.
Optimization #22: since we don’t support Dialup with its 576 byte MTU, use larger packets (1400 bytes including tcp/ip headers etc), allowing us to send more vehicles at higher fidelity per packet. This saves having to store/calculate the alternate fidelities of data.
Optimization #23: calculate the “resend” priority back when we were either checking the previous vis list or adding a new Impression.
We’re almost there!
The host now goes through each of the selected vehicles and calculates how much data will be needed to describe it. It uses this to determine how many it will actually write.
Now it goes through them again and writes them into the pending update packet. After each one, it checks to see if there is enough space left – because the calculations have sometimes been wrong in the past. Sigh. And because deltas (if they were working) would change the actual bytes used.
Now our world update is just about ready. To finish off, we want to cram in any strat data. When we supported dialup, this was crucial. The world update packets had to be extremely efficient bandwidth wise. Reducing the number of packets sent ensured smoother traffic flow (in theory).
So the host must keep a record of strat data that hasn’t been sent to this client yet. This used to be done in a separate process. Both the cell and the strat-updater processes ran on the same machine and talked to the player through a proxy that also ran on the same machine. This added an extra “stack” (layer of communications) that increased the cost of every message sent and received (roughly double+). I removed that quite a long time and merged the two processes. A net improvement in overall host performance.
However… For every update that has to be sent to a player, this strat updater grid has to be checked for any events the player has yet to see.
Optimization #24: re-use the new grid system to immediately queue events to players and dispatch them immediately; when a player changes spaces, send them the latest state for that space.
Optimization #25: since event-changes are comparatively rare, track a much larger radius for players and send them updates for spaces they have previously visited. Trickle them states for further-out cells separately.
Optimization #26: sending removals and strat data separately, after the update loop is complete, increases the “freshness” of vehicle data in all levels of cache and of the update arriving at the client machine.
The two next big steps of this process are to saturate this new system with unit test hooks (I’m going to write a unit test module that simply #includes the main code to avoid making it unreadable) and then to rewrite the code for generating the updates.
Our current method squeezes data for troops, tanks, trucks, guns, aircraft and ships into the same data structures. I’m working with Ramp to create vehicle-specific updates. In a sense, this is a de-optimization since it means different vehicles run different code paths. But it also means that the data can be treated more specifically allowing less data to be sent most of the time, meaning more useful data is sent per packet, which means less vehicles competing for packet space.
That may potentially allow us to separate “vis limit” from “bandwidth” – so players with kick-ass bandwidth can get more and more updates per second (while seeing the same number of vehicles). That will pan out as an optimization since the same vehicle will have less list reprocessing.
We’ve also been discussing ways to more closely tie the client/host data through all this to produce much better 3rd person representations.
Lastly… A key assumption in my approach is that all cell<->client updates will be sent via NetCode2 and not with our old network API. This offers a ton of advantages, not least of which is that the NC2 method uses 30% less CPU, is far more cache friendly, and is threaded.
That is the change that opens the door to dynamic mission origins and targets (doing all your mission stuff while spawned, being able to respawn after dying without having to go back to the map, etc) for another release down the line…
I’m really hoping we can put this together very quickly for a really nice 1.32 update, if not a 1.31 point release.