The new grid: Sexy and short.

Yesterday I got the new cell host’s Grid System code integrated, up and running for the first time. Slight frame-of-reference whoopsie on the first run, I’d commented out some frame-of-reference code so if you weren’t at 0,0 you couldn’t see anyone :)

I’m stuck for words describing the elation this has induced in me and why.

I want to leap, gushing, into what it’s already allowed me to do, the immediate benefits and gains I can see as the cell host’s godfather. But I’m also aware that what I write here ends up in the forums and would likely get blown out of proportion :) So I’ll start with some obvious tech-head droolery.

$ for file in host/hostGrid.cpp updt/updtPackers.cpp
> do
>   wc -l pn-{1.32,1.33-newgrid}/WW2/src/${file}
> done

 1379 pn-1.32/WW2/src/host/hostGrid.cpp
  557 pn-1.33-newgrid/WW2/src/host/hostGrid.cpp
...
  2364 pn-1.32/WW2/src/updt/updtPackers.cpp
  1315 pn-1.33-newgrid/WW2/src/updt/updtPackers.cpp

And to give perspective…

1.32 host/hostGrid.cpp comments: 145
1.33 host/hostGrid.cpp comments: 275
1.32 updt/updtPackers.cpp comments: 163
1.33 updt/updtPackers.cpp comments: 209

How to describe what I’ve done… Hmm.

This particular project tackles the dissemination of “updates” to clients.

Old system:

* Store large amounts of data at from events and then marshal per-player into a singular “update packet” at various intervals, leading to a packet called the “World Update”.

* Avoid or defer managerial tasks until the data needs to be accessed.

Pros: Single, highly condensed packet, reduces chances of fragmentation and/or packet loss on low-bandwidth connections.

Cons: Significant memory pressure: data for any given player’s world update ends up being sparsely distributed across memory. Since the system’s development, CPU:memory performance has changed in favor of processing over storage. As a result, the process of building a world update becomes heavily dependent on throughput of the system’s memory bus.

Cons: Incredibly thread-unsafe and inherently single-threaded, meaning that while any singular world update is being generated, nothing else can be happening.

Cons: Space constraints imposed by a single 576 byte packet (minus 55 bytes of message overhead -> 522 bytes of data payload per world update) impose limits on the number of avatar updates that can be supplied per update.

New system:

* Dispatch event messages as early as possible,

* Perform managerial tasks as early as possible within the context of the events that predicate them.

By “event messages” I mean stuff that was previously in the world update: game-time sync, despawn timer, strat events and vehicle removal messages.

All of these rapidly ate into the space available for actual player updates.

But sticking to an implementation perspective, they also diluted the application’s focus and incurred extra memory pressure.

For example, while generating each world update message, the code had to search for any strat events (ownership changes, building states) that would be appropriate to you. That requires accessing significant other areas of memory, well away from the primary focus of vehicle data, and executing significantly different code paths, reducing or annihilating potential cache efficiency.

By “managerial tasks” I mean things such as figuring out where on the game map you are, and what entities are thus in your immediate zone of interest, as well as really critical stuff such as figuring out which vehicles/connections need to be timed out (e.g. a vehicle that hasn’t sent any updates for 30s and needs to be booted back to the map screen, if the client hasn’t actually crashed).

The deferral of managerial tasks mean’t further divergence from the core process of marshaling and dispatching the vehicle updates and imposed extra caveats into down-stream code processes.

Worst of all, each of these components comes together to reduce the determinacy of the task at hand. In order to generate your world update, all of this information has to be staged so that the system can figure out how much space it has to work with.

Having selected 24 vehicles to send updates for, the system may then be confronted with insufficient space to send that many updates.

Step 1, already covered in previous posts, was to drop the deferral of strat events and instead broad cast them immediately to players. Player experienced benefit: independent and continual flow of higher-level strat updates (town ownerships etc); no more having to click on a town 20km away to see if it still has EWS or an AO.

The original system actually reserved space for upto 4 events in every World Update packet. Since you are receiving roughly 10 world updates a second, that’s 40 events per second. Even on the busiest of nights if the server was packed to capacity, you’d be very unlikely to average 1 strat update every 5 seconds over a 30 minutes play session. So that’s 199 wasted reservations every 5 seconds… Duh.

Plus the server was having to look for strat events on a per-player per-update basis. In a 5 second period on a cell with 1200 players, that’s 58,800 needless searches. And they weren’t exactly trivial searches, because they were intermingled with the whole general per-update process, so each one would have required that the strat stuff be brought out of L2 or L3 cache for the purposes of the search… That’s a lot of idled cpu cycles :( Nearly half a Ghz of cpu time spent finding nothing every 5 seconds. Plus the “return to focus” time for the CPU to go back to looking at vehicles in cache after each strat search.

Step 2, manage the grid as necessary.

The old system had it’s own “Vehicle Grid” or “Cell Grid” or “Grid Cell System” or whatever it was called. It was convoluted and complicated by lots of variants of structures used for accessing the data, lots of data duplication and pointers to thing and … chaos.

Credit where it’s due, the implementations were very efficient for 1999. But in the face of modern cpu architecture, it was really awful. And it was implemented in hand-crafted C, so every nuance was there to be understood and maintained and worked with.

The new system uses STL and boost constructs to deliver efficiency without having to implement the minute details. This also vastly simplifies it and allows for better tuning and optimization.

But best of all, it takes into the account that transactions at the grid level – that is a player moving between slots in the grid – happen very rarely. They’re therefore worth dealing with as they arise.

A really fast moving player may elicit one grid transaction every, say, 5 seconds, under the right circumstances. With the right configuration, the average player is more likely to elicit a grid transaction on spawn and another on despawn, and perhaps one additional transaction somewhere during their play session.

The old system essentially performed not only a grid transaction per player per update, but also per subject vehicle that was resultingly considered for vis list feasibility.

Even if everyone on the server decided to vibrate across grid cell boundaries to piss the new system off, it would still be performing tens of thousands less grid transactions per update cycle…

So the grid transaction process itself can afford to be more expensive. To that end, the grid system I wrote actually expends quite a lot of extra effort in ensuring that the update loop will have a nice, handy and very efficient access to the data it needs to get cracking your next update packet with minimal fuss.

It also allows…

Step 3, use the grid to direct localized events (building states, ai) immediately to players

If a building is destroyed at X,Y a single lookup of the grid system provides an immediate list of all the players within listening range of the event.  A little extra overhead in the grid transactions translates that to a very handy “broadcast index” essentially equivalent to a chat channel. Allowing for really efficient dissemination of those incoming events that allows them to minimally impact everything else going on the server.

Step 4, move events out of band of the update

When a player spawns, send them a game time sync straight away, and when the game time supervisor’s sync message is received by the cell hosts, send it straight on down.

Yeah, it’s trivial and it’s noddy … But game time syncs happen every N seconds. So it’s kinda stupid having the update loop check for a time sync for every update – 1200 players at 10 times a second is 12,000 checks, or 1,800,000 cpu cycles in a branch miss every second. And that’s not allowing for the down-stream checks that were required for the packet shaping that could potentially result.

Another packet shaper was the list of vehicles that were leaving your vis list. Again, overall these were rare. For 99/100 updates you received there’d be no removals. Well, unless you were over your vis limit in which case you might get some “swapping” artificially elevating them, but then part of your problem at that point would be the reduction in actual updates caused by the removal list eating up precious World Update payload space…

So I moved all of these things out of band, to their own discrete messages.

A possible downside to this is an increase in the quantity of packets, but I can address this in the near-term future at the networking level by enabling packet aggregation until the update loop is completed, and then leaving the OS/network stack to handle the actual to-wire of the packets on their own.

Step 4, do away with signaling bits in the World Update

Having stripped the World Update down to pretty much just vehicle updates, all that was left to get the new system working was to go through the “Update Packer”.

The need to add conditional bits was mooted: previously a flag was sent to indicate whether or not there would be this, that or the other, and lastly whether or not there’d be any vehicles. If so, then there’d need to be some packet-shaping information describing how the vehicles would be expressed.

Since it’s only going to be vehicles, it’s more-or-less moot whether we use a bit or a byte for this since – compared to the overhead of sending a packet – either is trivial. So we can simply send the vehicle count.

This radically changed the look of the update packer, leading on to

Step 5, Rip out as much of the packet shaping stuff and concentrate on filling available space rather than figuring out how much space I had to work in.

This project would have been done a week sooner if I hadn’t allowed myself to get a little too carried away at my first attempt to integrate it, where I pretty much ripped out the entire update mechanism with the intent of rewriting that too. While that would really deliver player-tangible changes, it would also take a lot more time than I expected just because it has tie-ins with the other coders that would slow it down a lot, and they’re all tight for time right now.

But with all of the out-of-band shaping options removed, the packer routine suddenly turned from illegible conditional code into a core of fairly manageable stuff.

Most importantly, it became clear that the previous system used a two-pass mechanism leading to a plethora of issues that have likely caused all kinds of bugs and problems thru the game’s history.

You see, the “update system” – the system that determines your vis list and what you need updates for – finishes at the hand-over of the list to the packer system, assuming the dispatch of data for what it’s marshaled as fait acompliss. But, due to the packet shaping, there were all kinds of edge cases where stuff might not have room to get sent, and because the system was trying to stuff everything but the kitchen sink it, it was usually the vehicle updates that suffered. It all makes clinical, technical sense, it just doesn’t make practical sense in the context of delivering a solid gameplay experience :(

Hindsight is 20/20, and I have the luxury of having had years to watch the flow of these packets and learn the trends that shape these kinds of decisions. The reservation of space for strat updates was sensible, since it mean’t that when you spawned in you would get a fair balance between initial strat updates and initial vehicle updates.

But that requirement only existed because it hadn’t yet been determined that this was an edge case and that, by and large, the relative frequencies of data populating the packet should indicate that strat updates weren’t appropriate content for the packet.

The old code started, on the host, by going thru the vehicle list it had been handed to generate updates for, for a given viewer, and analyzed various fields (i.e. fetched from memory in highly sparse/random access fashion) to build a set of masks for the actual update, and from that calculate the approximate amount of space required by the resulting update.

From this is could then figure out where it would have to stop, and then proceed to go through the shortened list of vehicles to re-fetch the data and produce the final update payload.

Here I stumbled on a corker: As the first pass goes thru a given viewer’s candidate list, it copied various values and timestamps into the viewer’s update list, such as the current firing state.

After this was done, it then performed the “will there be room for this?” check. If not, the first pass would stop.

Which means that if you were *about* to be sent an update for an SMG that had just stopped firing, but there wasn’t enough room left for that update, well your “last firing state” value for him has already been updated…

Pretty sure this has been the cause of all kinds of things well beyond the obvious stuck-firing bug.

With all of the signaling bits removed, I was able to consolidate the two passes into a single pass. This amounts to a major performance boost to the update loop, ensuring far more timely delivery of updates to players and efficient processing by the cell host.

I was also able to find several data duplications (a player riding another vehicle gains 64 bits of extra position information I’m still not sure we need to send, but worse still it got sent twice due, I think, to someone trying to fix a different issue at some point and finding it very hard to track exactly what was sent where due to the way the code was spread out so illegibly due to all the conditional code and the disparate client/server variants of things).

The final product is a significantly streamlined, minutely improved, much less buggy, and far more aggressive update cycle.

CAUTION: The following describes internal test conditions and not live play conditions, you should not read the following as “omg gameplay pwnzored” because it does not reflect accurately improvements that will be made to gameplay over internet conditions.

When I ran the 1.32 binaries on our internal dev cluster, I saw a 15-20ms ping time to our internal dev cluster with 200 infantry moving around, firing, etc, in test conditions.

When I ran the 1.33 binaries on our internal dev cluster, playing back the same test, the result was noticeably smoother, handling a variety of edge cases more cleanly, and the ping was down to 1-2ms.

On the live server, this might translate to a single digit ms ping reduction – regardless of what your ping time is – 70ms or 700ms. If you are reading the above as a percentile change, you’re gonna be gnashing your teeth when it goes live.

Plus, in both tests, I used a playback rather than actual live players. The old code is well within performance tolerances so that it is not a factor in negative gameplay experiences such as lag etc. It is the transformation of the data that is more significant, the increase in performance is a gap to be filled: having freed up so much CPU, I can now put it to better uses in better tuning of the data that actually gets sent – such as ensuring less redundant data gets sent (right now every vehicle update always includes the vehicle’s position, taking up some 60 odd bits; even if the vehicle hasn’t moved).

Even with the very minimal changes I made to the first test version of the code on Thursday, the improvements were really nice to watch on the dev cluster. We won’t know how well they pan out under internet conditions until I’ve finished this work and we’ve tried it in open beta.

I also freed up enough space that we can comfortably squeeze in quite a few more updates: we haven’t done this yet, and there are some considerations we have to make before it can be done, such as how to deal with running out of space. Our network stack requires us to work within fixed packet size constraints, and giving the update packer the capability to spill over into a second packet will not be entirely trivial. Not terribly difficult, but not trivial, and it brings several extra considerations. But for now, you get at least as many as you are supposed be getting with the old code, which will translate to more than you actually get in practice, which means smoother gameplay.

These changes have opened several doorways to seriously improving the feel of infantry play, but that’s work for another dev cycle I suspect. From a developer’s perspective, the really good news is that … there’s nothing remotely intimidating about having to revisit the update code right now :)

16 Comments

I spent my week struggling with improving the performance of some UDP client/server code. Both the client and server are Windows apps. I’m polling select and my best ping time is twice my update rate, or about 30ms. I was wondering what method you use to read/write your sockets that you can get 1ms ping times?

(Sorry if this is a dupe, I wasn’t sure the last one posted)

I spent the last week improving some UDP client/server code. Both the server and client are Windows. I am polling select and the best ping time I can get is twice my update rate or 30ms. I was wondering what method you use to read/write sockets that you can get a 1ms ping?

One more question, you said your max packet size is 576 (which I find is the maximum guaranteed size for a packet not to be split.) I read John Carmack saying that you can do 1200 pretty safely today. Have you considered larger packets or doing MTU measuring for individual connections?

I was wondering what method you use to read/write your sockets that you can get 1ms ping times?

Asynchronous poll/recv. Grab the data hot off the wire and process it for receipt while the rest of the client is doing the rendering or whatever. You could then use either mutexes or something like zeromq to forward the ready data to the application.

MTU measuring is unreliable for long-term connections unless you plan to do it continuously, which is a PITA and would generally result in an increase in CTHLs in a gaming context.

Some time ago I added the capability for our network stack to do packets above 576 bytes, and sicne 1.30 (if memory serves) I changed it so that low-vis connections have a 1200 byte max while medium and high have a 1400 byte max (it was 1440 for a while but so many people seem to have a 1400 byte limit that I just decided to lose the extra 40 bytes).

Until now it’s just been too complex to really address the payload content to make aggressive use of the extra capacity.

There’s also an issue about where the payload generation sits in the code stack. Right now it remains below the allocation of the payload space, so the packet size limit is a hard ceiling on how much of the marshaled data can go out; there’s no mechanism for saying “I need another packet” when this approach is taken.

The refactoring that was involved in the above means that it’s now practical for me to populate more than one packet if the payload absolutely requires it.

So you have a separate thread that is doing select/recvfrom with a sleep to keep it from eating all cpu cycles? I’ve been looking at IO Completion ports tonight, I have to think about it some more.

I would have thought packet routes would be relatively stable so MTU measuring would at least give you a good guideline. For me, I’m using 1200 byte packets, so finding out that a client can only do 576 would probably make it worthwhile at least doing an initial test.

In the past, it’s seemed from a user perspective that server/network code improvements generally have been “spent” on new game features and functions.

We’ve been told since the early days that multicrewing (i.e 2 up) is a stretch and polycrewing (i.e. 3+ up) is out of the question. While it’s never publicly been made clear what the real issues are, I think it’s been implied that they have to do with the resulting server workload to keep multiple clients closely synced. Could improvements like those you’re making to the server/network system potentially contribute…if CRS were to decide to “spend” them that way…toward making polycrewing practical?

jwilly “a player riding another vehicle gains 64 bits of extra position information I’m still not sure we need to send”

what he meant to say is poly crewing in 1.34.

I don’t know where that’s ever even been hinted at, JW.

At first glance I read the title as, “The new girl: sexy and short.”

WoT Attack … ;o)

Sounds like you had fun dissecting the old grid…. glad to hear you finally had a chance for a complete rewrite. I hope things will work out as planned. The inf guys will be more than happy to hear that most of their CQB problems could be a thing of the past.

Caution, highly interpretative and from memory: When will you get a chance to combine the rewritten grid with the new player object (talking about the unified representation structure for all host processes in the server cluster) to get us real polycrewing, inflight spawn and make all kill credit/fire bugs a thing of the past?

That sounds brilliant. Very exciting news in terms of the changes. Nice of you to put the test results below that re-affirmed what i was THINKING it may do directly.

Of course live server will be differing in results and effects but it sounds like a huge leap forward and will explain why sometimes I’m seeing heavier than expected CPU work for what really is going due to the number and order of required updates.

Sorting them out will do nothing but help the player experience and also open up options to you as a programmer I would guess.

Great news look forward to seeing it go into test at some point.

just thought of a question. so, if i’m reading correctly it seems all this ‘could’ lead to increased update rate such that ghosting infantry would be lessened.

if that much is correct then in the same vein 1 way collisions in the sky could also be reduced yes?

or am i completely off here?

From my limited testing in the open beta. The whole game has a very snappy feel to it. Nothing like the plodding in 1.31 or 1.32. Threw a grenade with 1 account, immediately switched to the other account to observe it in flight. I swear the lag felt smaller then the 300-340ms roundtrip latency I have from UK.

The new grid isn’t in beta yet… I’m working on some other items at the moment, and we didn’t want my changes to confuse the TPR stuff.

hmm so the readme is not exacty accurate then.

KFS1 any chance the new grid system will make it into 1.33? Or is 1.34 more likely. Would be a shame cause for me it is more interesting change then the whole 3rd person eye candy.

Leave a Reply

Name and email address are required. Your email address will not be published.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

You may use these HTML tags and attributes:

<a href="" title="" rel=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <pre> <q cite=""> <s> <strike> <strong> 

%d bloggers like this: