parallelization

Rewriting the grid.

Two of the core components of our game server systems are the poorly named “World Grid” and “Update System” (the names reflect what the systems do, but not their overall role in the game server cluster). The combination of these two is alternately called “The Grid” and “The Update System”.

In short: they divide the world up into small parcels of virtual space, track the players moving around in those spaces, and generate data packets describing movements between one player and another.

Written by coders not happy to co-operate together, hodged and podged over 10 years, sprinkled with genuine pixie dust(*); neither component lends itself to maintenance, study or peace of mind.

They were developed for 1999 hardware and lovingly hand-optimized to running with magnificent efficiency on an old Pentium III 800. But you would really have to go out of your way to make them less hostile to a modern CPU like the Dual Xeon Quad-Core 3ghzs our servers currently run on.

For 7 years+, they have been the bane of my existence.

Multiple cores, why not CPUs?

Historically, multi-CPU motherboards were generally graded “server” (i.e. expensive). Now that multi-core is pretty much defacto, you’d expect to be seeing multi-cpu motherboards in the desktop/workstation grade.

But most dual-cpu motherboards seem to be labelled “server” still. The pricing is coming down, although still significantly higher than single CPU motherboards.

My hunch is that when desktop performance consumers spent $500 on a motherboard, $2000 on a pair of i7 extremes and $500-$1500 for sufficient cooling, they’d be kinda upset when they found performance to be about the same as a single CPU, possibly even slower in many cases.

Uncovering why would damage Intel/AMDs calm. Perhaps blow the lid on the shameful state of current-gen multicore CPUs. No lengthy explanation this time, lets just say that I view the “Core i” gen CPUs as an alpha/beta market test.

ZeroMQ and scalability

The elegance of ZeroMQ messaging is that it provides easy scalability. The API for 0MQ sockets is the same whether you are doing in-process (inter thread) communication, inter-process communication, peer-to-peer network communication or multicast communication.

As long as you are putting data and not pointers in your messages (*cough* Async::Worker *cough*), you convert your code from in-process communication to cross-network with a one line change:


// Create the socket: same code in either case.
zmq::socket_t outSocket(zmqContext, ZMQ_REP) ;

...

outSocket.connect("in-proc://my-connection") ;
// becomes
outSocket.connect("tcp://192.168.0.65:2342") ;

and voila – your application is networking instead of talking to itself.

They also provide three external utilities that make it a doddle to scale your application across multiple machines.

Async::Worker: Parallelism with ZeroMQ

I’ve put the source to my Async::Worker system, documentation and examples here.

From the examples, how to offload batches of work for processing in parallel:

    int main(int argc, const char* const argv[])
    {
        static const size_t NumberOfElements = 20000000 ;
        static const size_t GroupSize = 8192 ;
        Numbers numbers ;
        numbers.resize(NumberOfElements) ;
        for ( size_t i = 0 ; i < NumberOfElements ; ++i )
        {
            numbers[i] = (rand() & 65535) + 1 ;
        }

        uint64_t parallelResult = 0 ;

        // Dispatch groups of numbers to workers.
        Numbers::iterator it = numbers.begin() ;
        do
        {
            Numbers::iterator end = std::min(it + GroupSize, numbers.end()) ;
            Async::Queue(new CrunchNumbersRange(it, end, &parallelResult)) ;
            it = end ;
        }
        while ( it != numbers.end() ) ;

        // Wait for all the results, calling Result() on each
        // returned object to produce a total.
        Async::GetResults() ;

        printf("Done. Calculated sum as %lu.\n", (unsigned long int)parallelResult) ;

        return 0 ;
    }

More on ZeroMQ

ZeroMQ is the messaging infrastructure I mentioned a little while back.

I’ve had a little opportunity to dabble with it now and, I have to say, I’ve taken to it. The interface is really nice and lean. It’s “core standard” too – it looks like sockets, it plays like sockets. It plays nicely with real sockets. The O/S can schedule around it like sockets – which is a huge boon on just about every OS running today.

And it’s incredible frugality and minimalism helps achieve  impressive performance: one of my (-O0) unit tests manages to pump an incredible 65,000 messages from one thread and back to the original thread in under 1 millisecond, running on a virtual Ubuntu 10.04 on a physical core-2-duo.

ZeroMQ

Ahwulf sent me a link to an interesting little message-passing library, ZeroMQ (0MQ).  Lean, mean, frugal.

There are a host of language wrappers (C, C++, Lua, Perl, Python, Ruby, .NET etc) which makes it pretty handy for interop. A quick C++ example:

The problem with multiple CPU cores…

Computers are based on sequences of 1s and 0s; bits. By chaining these together, you can form a vocabulary of instructions from a sort of tree. E.g. the first bit is either ‘close’ (0) or ‘open’ (1), and the second bit is either ‘gate’ (0) or ‘door’ (1). So, 00 is ‘close gate’, 10 is ‘open gate’ and 11 is ‘open door’.

CPUs used fixed-sized sequences of bits to represent their internal vocabulary like this, the result is called the instruction set. These machine instructions are usually incredibly simplistic, such as “add” or “divide”.

Typical computer programs are long sequences of these machine instructions which use math and algebra to achieve more complex goals. This is called “machine code”.

Very few programmers still work in machine code; we tend to work in more elaborate languages which allow us to express many machine code instructions with a single line of text, and in a slightly less mind-bending way. This is “program code”.

Think of it this way: bits are letters of the alpha bet; machine code instructions are the vocabulary or words of the language. Computer languages are the grammar, dialect and syntax one uses in order to communicate an idea via the computer.

At first, CPUs got faster, so the time each word took to process got shorter, making programs run faster.

Then that stopped and the manufacturers started slapping in more CPU cores.

But more CPUs does not equal more speed. Infact, it suffers from chefs-in-the-kitchen syndrome…

Technology Review Agrees

Multicore Processors create software headaches.

I guess my recent post was timely :)

Threading Building Blocks intro

I put together a short video walkthru creating a skeleton Visual Studio 2008 project with Intel’s Threading Building Blocks that introduces the basics of using Threading Building Blocks and compares performance of serial and parallel sort.