Productive day

I was late in to work this morning, I slept poorly and I think my stomach was making pancakes. The sort of portentous start that gives rise to the simple wish that tomorrow isn’t going to be “undo day”.

So I was quite impressed when 7pm rolled around and I finished off my “updates” email to the production list only to find it was some 50 lines long.  I had some extremely useful discussions with Rickb, consoled Latham on agonies of having to ban people when what you live for is trying to make a game for them to play, fixed some hugely annoying gotchas in the web tools, finished the reorganization of supply/resupply into a far more sensible single system rather than 3 separate systems, eliminating a couple of spaghetti loops, got several Lua scripts caught up, made some headway on the old “fire-death” issues.

I’d been averse to using the resupply system to “move” equipment when a brigade moves, but over the weekend I had chance to run some metrics on it and come to the conclusion that it should generate relatively small loading on the ticket system, since Gophur expressly wanted to keep the number of deliveries small — 16 deliveries over, say, 30 minutes, with a 10 minute delay starting delay.

Don’t quote me on those figures, they are speculative.

Lastly, I managed to finish my switch over of the first systems to the cell hosts “chat grid” – that is the code that detects which CP you are near and tells the chat host where people are for area chat (which is in theory now redundant, I could just send the area chat messages to the cell hosts). The code has been working for a while it just wasn’t authoritative for the information – doubling the work load. The data was reaching the chat host but seemed to be complete garbage. I couldn’t figure it out.

I finally took the knife to it and wrote a series of debug functions to try and analyze the data’s transfer arrival and reception. And then just as I was about to test it, I finally saw the mistake:

writeDataBits(update, sizeof(update));


consoled Latham on agonies of having to ban people when what you live for is trying to make a game for them to play

If you’d be more trigger-happy with the ban-hammer and instead made a game for the people that don’t need banning, I might even come back to play. ;)

I like a good bit of drama, is this the guy who had 2 accounts and both accounts where in opposite HCs? I think that is damn funny :)

I would go for “Does ‘WriteDataBits’ really expect the size in bytes?”, but thats to easy I think. It would depend on how the second paramter is used in that function.

Sres, is that true about the two accounts in opposite HCs? :)

I’ll guess, but keep in mind my C sucks :)

If update is a structure:
writeDataBits(update, sizeof(update) * 8);
If update is a string:
writeDataBits(update, sizeof(update) * 8 * 8);

Erh, if update is an array-of-char string, sizeof(update) * 8 * 8 will return 8 times the size of the string. If it is a pointer-to-array-of-char string, that will return 256 or 512, regardless of how long the string actually is.

We actually use a wrapper macro, since we dislike random constant numbers in the code

#if sizeof(char) != 8
# error("Cannot compile on system with sizeof(char) != 8")
// Shift by 3 provides fast multiply and divide by 8
# define BYTES_TO_BITS(n) ( (n) << 3 )
# define BITS_TO_BYTES(n) ( (n) >> 3 )

I should probably add BIT_SIZEOF(n).

And yes – the error was that sizeof returns a size in bytes, which to a function that works with bits, is a bit confusing :)

This code looks so 20th Century. You also need to account for machine/transmission-dependencies. You probably need to stick to network-centric sizing and byte-ordering such as ‘inet_ntoa or ‘inet_ntop, etc… If your data is text then you also need to consider the text encoding, it’s a utf world now. If you are talking about ‘strings’ then you need to include the encoding so that you have to have a function ‘LengthOfBytesUsingEncoding(data, encoding).’ Remember it’s a utf world. Trying to figure the ‘size’ of a string is not a trivial task. The specific context is critical.

Have a nice day.

Not when the transfer is between hosts since all the hosts innately run on the same hardware.

The ability to write near-raw objects is a huge performance advantage. It puts a trivial number of operations between generating the data and putting it on the wire; for the receiving host it means that the reception overhead is little less than the actual IO/IP stack overhead, which is mostly off-CPU in the hardware we use.

That means we can quite comfortably saturate the gigabit backbone between hosts with messages without taxing the CPU.

But it’s not about messages per second, its about the turnaround time of any individual given message. The CPU on the cell hosts, for instance, is primarily there to crunch vehicle updates.

A hundred or two extra cpu cycles per message transmission on each end might not seem a great deal, but it rapidly adds up.

If you have idle cpu cycles and a saturated backbone then you maybe you should think more about the content and frequency of the transmissions or bump the backbone. (yeah, I can be a wisea** sometimes). Optimizations are never optimum…
I should think that you would have learned that by now. Ouch. It’s why the stacks/standards were put there in the first place so that days could be productive. :)

Have an interesting day. s!

Did you just recently complete a mid-level management seminar?

– cpu cyclé

I didn’t say we have a saturated backbone – we’re not even close. But it wouldn’t hurt us if we did.

There’s no reason to either deconstruct/reconstruct the object, or expend the overhead of marking it up for reflection/etc. Generalization is never optimum.

Sending a 40 byte message to another host costs a total of 164 cpu cycles. Receiving it costs 118. The rest is offloaded to hardware. I’ve benchmarked 750Mb/s of 40-600 byte message packets of meaningful data between two hosts with minimal CPU impact. The cost of sending more bytes is tiny.

Whereas if you use a component-based system, you’re going to add 15-25% per byte.

template <typename _Tp> writeData(const _Tp& outgoing) { writeDataBits(&outgoing, BYTES_TO_BITS(sizeof(outgoing))); }

FYI: Like any scalable network architecture, the challenge in developing this kind of system is the optimization. The code itself is all relatively trivial message passing.

The C#/Java solution is to apply more CPUs to the problem – but that in itself defeats scalability.

There are a lot of projects out there that have insane price tags because they are well developed, high-performance, high-scalability Java/C# projects which could have had higher-performance, higher-scalability and a lower price tag if they’d been developed well instead of well developed.

They’re both great application languages. But even as Java gets faster, it continues its evolution into an accademic fru-fru language.

Last year a friend of mine forwarded me details of a $2bn project he’s been part of. They have an immense cluster of servers providing the back end. The system is really elegant. Its basically a point-of-sale system.

He went to great lengths to describe how it includes built-in functionality to provide them with an easy way to trace faults, naming objects, etc, etc. In essence, at the point of any fault, they can reverse engineer the entire lifetime of the components.

They document included defect tracking stats across the lifetime of the project. 48% of defects occured in “lifetime tracking” assets.

Approximately 20% of CPU resources are devoted to lifetime tracking, and 25% to data encapsulation (although, some portion of that is lifetime tracking, imho).

I spent a couple of hours over the following weekend sketching out a fairly simple design:
– Two oracle clusters (for fault tolerance) instead of 3 (all of which their current implementation uses, they don’t have any fault tolerance) and 1/15th the number of servers
– Stored procedures to replace about 30% of their code
– A simple, C-based, minimalistic network transport (reliable UDP)
– A simple ecapsulation wrapper for all messages to identify message origin
– STL based object pooling to reduce memory contention with single-provider single-consumer mutex-free locking
– “core”s to provide stack access if any runtime debugging ever becomes neccessary
– A simple logging system to provide minimal object tracking

They rebuilt major parts of the system in C++ and dumped the whole “lifetime tracking” which was essentially encumbering the customer with runtime, on-the-fly, self-diagnostics and gave them a high level of abstraction freedom while developing. It took them 4 months and they were able to deliver ahead of schedule and under budget, which got them all huge bonuses (and me a tidy little xmas present).

But this system was simple, there was no justification for near-scripting-level abstraction especially not at the overhead costs.

I’m not saying that the C++ code was easier to write than Java – but by improving the atomicity of individual atoms, they had to distribute less and were able to execute more atoms per CPU. Drawing everything in simplified the system as a whole and eliminated lots of clutter.

Kevin’s motto used to be “optimization is never optimum” too.

Hold on there fella, I’m not talking about using whether or not to use Java or C++ or my CPU is bigger than your CPU, although the benchmarking does warm my heart. :) I’m talking about a day spent debugging the transmission of text from 1 node to another. A function that may have been working just fine for years then BAM! something’s broken and you are not sure where. “The relatively trivial message-passing” just broke. I’m emphasizing that even determining the number of bits to transmit is not, apparently, a trivial exercise.

I mentioned the old saw “optimization is never optimum” because it recognizes that the target that you are optimizing is not stationary. That simple ASCII text that you may have been transmitting for years worked fine until someone tried to send UTF16 or whatever and the # of characters was no longer equal to the number of bytes.

If total fault avoidance/tolerance would be too expensive for your app maybe just improved fault detection/isolation would make a more productive day. Just lessons that have been paid in my time.


NetCode1 doesn’t lack constraints, its actually a fairly good system, and it has a number of features that actually make it more robust than most other message passing protocols I’ve seen (bi-directional packers so you only write the code once, for instance)

Unfortunately, there are constraints on how much I can explain about NetCode1 :) It would benefit in terms of robustness from a C++ encapsulation to expunge it of lots of duplicated functionality and overly-long functions that can be hard to diagnose.

But runtime efficiency was significantly more important when it was developed than maintainability. And I say that as a maintainer.

NetCode2 has to have a vaguely similar API to NC1 in order to make transplantation cost-feasible; however it’s also C++ from top to bottom so it does provide a number of robustness abstractions.

Efficiency is still king, however. The trick is to reverse the urge to use compactness to provide efficiency. Instead of writing

writeDataBits(pointer, size<<3) ;

be more explicit. Treat optimizations like you do subroutines, and create a macro/template/function for them. Let the compiler inline them if that makes for more efficient code.

writeDataBits(pointer, BIT_SIZEOF(*pointer)) ;

or better still – the template I created for exactly this purpose.

This specific segment of code calls for writeDataBits during development. Ordinarily, each transportable object defines a serialize() routine which packs and unpacks it to a datastream. However, that would have made the object incompatible with NetCode1 connections so for initial testing I had to go the more direct route.

And in this particular case, having a fancy object serializer is definitely a no-starter: the host has to write upto 1500 of these things in one go; having them already lined up ready for transport in their natural state and having minimal to-wire overhead means that there is very minimal cost with the transport, and all of the remaining cost is in the processing on the chat host — which can process 1500 grid updates in 39,104 cpu cycles which seems to take about ~6μs to actually execute.

That’s a fairly long time for a single operation, but breaking it up into 10 events upped the total wall clock time spent executing the instructions to ~7.5μs.

Once the mechanism was working, I switched from using writeDataBits to using object.serialize() and made that call writeData() (the byte-oriented version; writeDataBits provides some extra diagnostic features).

Total CPU clock time spent “sending” 1500 updates to the chat host in one shot (basically, offloading it to network hardware) – under 1μs. Total time “receiving” and processing it on the chat host: under 8μs. That’s 32μs per second handling updates from all four full cell hosts if *everyone* was generating one every second (and they don’t – typically the rate averages out at 9-10 seconds per update per player).

Not sure what my Akismet filter has against you – sorry it took so long for me to spot & unspam your comment :)

Leave a Reply

Name and email address are required. Your email address will not be published.

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

You may use these HTML tags and attributes:

<a href="" title="" rel=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <pre> <q cite=""> <s> <strike> <strong> 

%d bloggers like this: