Going Parallel

I’ve been looking at OpenMP and Thread Building Blocks. I’ve actually managed to OpenMP one part of the client (which seemed to reduce the “chug” as you move at high speed between large areas of terrain, but didn’t otherwise impact performance, maybe at spawning, I dunno), and a few parts of the cell host (I reorganized the order in which the host prepares vehicles for their world update; the initial part of which is quite thread-safe, and boils down to a simple culling algorithm; on the live servers that should significantly reduce the time the world update dispatch takes, meaning that the data sent to clients is “fresh”er by 5-10 ms under busy load).

My initial understanding of both systems was fairly hazy – I’m trying to run before I can walk, but then the documentation for both starts at a sprint.

The worst mistake I made was misunderstanding their focus: which is algorithmic parallelization.

OpenMP is not well suited to creating worker tasks. For instance, if you try to do:

int main(int argc, char* argv[])
  #pragma omp parallel
    #pragma omp master
    _realMain() ;

static void _realMain()
  /* do some stuff */
  #pragma omp for
  for ( ... )
    /* do some more stuff */
  while ( running )
    #pragma omp task
    commitDatabaseStuff() ;  // Can do this in the background.

    updateNetwork() ;

    std::vector<Players*> consumers = GetUpdateConsumers() ;
    #pragma omp for shared(consumers) schedule(static, 1)
    for ( auto it = consumers.begin() ; it != consumers.end() ; ++it )
      if ( (*it)->MeetsSomeCriteria() == false )
        *it = NULL ;
        (*it)->DoSomeWork() ;
    // Generate and dispatch updates to the remaining fools.
    // Done serially on the master thread.
    for ( auto it = consumers.begin() ; it != consumers.end() ; ++it )
      if ( (*it) != NULL )
        (*it)->GenerateUpdate() ;
        if ( !(*it)->UsingNetcode2() || ! (*it)->DispatchNetcode2Update() )
          (*it)->DispatchUpdate() ;
        /* Do some more work */

The code will work, but you will find some of your CPUs/cores spinning on 100% usage.

Why? Because they are in a spinlock waiting for work.

I’d looked briefly at Intel’s Thread Building Blocks before, but it seemed to require the Intel Compiler, and we aren’t using the Intel Compiler on Mac or Linux yet.

Then it fully registered on me that TBB is Open Source and gaining fairly wide support.

Sure enough, several Linuxes have packages for installing TBB, and when I used the Ubuntu GCC 4.4 rather than my own roll (which took for freaking ever to build!) #include <tbb/tbb.h> worked without further ado.

But TBB’s documentation is kinda chunky. So I ordered the O’Reilly book in PDF format.

Disappointing :( For a large part it is an O’Reilly formatted version of the free, online documentation PDFs. And it’s out of date. So it’s possible that the differences between it and the online docs is purely that the online docs are more recent… I’d guess that 80-90% of the book is covered, more accurately, by the online docs. And the online docs are kinda hectic.

So my current level of understanding of both is limited to optimizing cpu-intensive loops and algorithms. The “threading” in TBB leads me to believe that somewhere in there ought to be some kind of potential for managing, well, threads: worker pools, etc. But thus far I’ve been unable to find any documentation, code or examples that point at the typical uses of these in long-duration applications (such as servers) which is the sparse, periodic hand-off of resource-specific, asynchronous workloads. TBB still seems to suffer from the “give me work or I’ll try idling”. (And ‘try idling’ = cpu consumption).

So I’m waiting to see if my post on the TBB forums gets any replies: maybe there is already some existing pattern for sparse workload dispatching, which works alongside algorithmic parallelism when there are workers available.


“Then it fully registered on me that TBB is Open Source”

wtf?? They are selling their open-source package without saying it’s also open-source? Or do they just conveniently pack it in with the compiler and not actually sell it?

There are some differences to the commercial version. But you get the commercial version anyway when you buy ICC or Parallel Studio…

See “Intel: Milking the cow at both ends”, mate.

So yeah, when you buy a TBB COM license, you’re buying support and access to the commercial development releases. If you don’t have the COM license, you get access to the “Commercial Aligned” OSS releases (i.e. not development ones, in theory, although in practice they seem to release both).

And when you buy their compiler products, most of them include a TBB COM license.

That doesn’t explain why they have a separate “Buy TBB” option. I guess we might want a TBB COM license if we wanted support for using TBB with GCC on our Linux systems, without buying an ICC license. But given the cost, I think I’d just go ahead and ask Playnet to buy a Linux ICC license and be done with it :)

“See “Intel: Milking the cow at both ends”, mate.” Indeed.

Which makes me think of the Intel/AMD lawsuit: since the dispatch code generated by icc apparently (at least until recently) didn’t use optimized paths for AMD processors, have you benchmarked the code on an AMD machine? Maybe all the AMD guys will raise hell if their fps go down when the Intel ones go up…

We’ve been using ICC on the client for about 2 years.

The ICC’s baseline code path is significantly better than that generated by Visual Studio.

The AMD stuff only affects alternate code paths: Windows /Qax, Linux -qax.

That is – it would make the AMD CPUs run the default code path rather than branching into, say, an SSE2-optimized alternate code path.

But if you built the code with an SSE2 baseline (-xSSE2) then both Intel and AMD cpus would run the SSE2 code.

In 11.x, at least, they have two options:

– Build alternate code paths,
– Build Intel specific optimization paths.

Your TBB forum post sounds like a case for tasks, task_scheduler and some concurrent collection of database handles. It may not be that much nicer than manually creating pthreads, though.

Agreed: Except that as I introduce TBB functionality elsewhere in the code, I thought it would integrate better. Plus I’m hoping that someone else has already implemented something like this that I could leverage. If not, I think that the templates I threw into the post will probably be a good basis for a generic resource-oriented system.

Trackbacks and Pingbacks

[…] intel, parallelism, threading The answer to my question about Thread Building Blocks’ usefulness for non-algorithmic parallelization (i.e., in a word, threading ;) was tucked away in a Change Log (it seems that TBB could use a […]

Leave a Reply

Name and email address are required. Your email address will not be published.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <pre> <q cite=""> <s> <strike> <strong> 

%d bloggers like this: