The problem with multiple CPU cores…

Computers are based on sequences of 1s and 0s; bits. By chaining these together, you can form a vocabulary of instructions from a sort of tree. E.g. the first bit is either ‘close’ (0) or ‘open’ (1), and the second bit is either ‘gate’ (0) or ‘door’ (1). So, 00 is ‘close gate’, 10 is ‘open gate’ and 11 is ‘open door’.

CPUs used fixed-sized sequences of bits to represent their internal vocabulary like this, the result is called the instruction set. These machine instructions are usually incredibly simplistic, such as “add” or “divide”.

Typical computer programs are long sequences of these machine instructions which use math and algebra to achieve more complex goals. This is called “machine code”.

Very few programmers still work in machine code; we tend to work in more elaborate languages which allow us to express many machine code instructions with a single line of text, and in a slightly less mind-bending way. This is “program code”.

Think of it this way: bits are letters of the alpha bet; machine code instructions are the vocabulary or words of the language. Computer languages are the grammar, dialect and syntax one uses in order to communicate an idea via the computer.

At first, CPUs got faster, so the time each word took to process got shorter, making programs run faster.

Then that stopped and the manufacturers started slapping in more CPU cores.

But more CPUs does not equal more speed. Infact, it suffers from chefs-in-the-kitchen syndrome…

When you start a program on a modern, multi-tasking operating system, it seems like that program is running quite independently of any outside interference.

In actuality, your computer is only really running one single program – the operating system. It loads in your program’s code into its memory. Then it points the CPU at the start of the program and lets the CPU run those instructions for a little bit. After a given brief period of time (nano/micro seconds), the CPU jumps back into the operating system code.

This allows the operating system to ensure that all the programs it has started, and things like disk activity etc, get a fair shot at the CPU.

All of this operating system overhead is done in terms of program code.

If programs could just offload themselves – or chunks of their work – to another CPU/core, that would mess with the operating system’s scheduling. Only the operating system can assign work to specific CPUs/cores; if it didn’t, the offloaded code would effectively be going rogue due to the way current architecture works.

You really don’t want a system where a piece of code can usurp the operating system and take control of the CPU for itself.

To offload work, a program must summon the operating system, and create a thread which will be transfered to and executed on a different CPU core (hopefully, if none are available, the CPU will just add it to the scheduling list).

The work involved in this is significant. It’s not a few handy machine instructions. It’s more like several pages of long paragraphs.

That sets an entry barrier to the types of work that are worth doing in parallel. Anything under a few thousand CPU instructions isn’t really worth it, which results in the vast majority of code going unparallelized: your CPU cores sit there idle.

Infact, if you watch your CPU cores on most modern operating systems, even when the system is multitasking, almost all the work continues to happen on the primary core, because most of the tasks going on just aren’t heavy weight enough to cross that threshold.

It’s a bit like the difference between “Dear Santa, I’d like a bike this year, luv Oli” and having to file a requisition form for it. Asking for code to be executed independently of your main thread of work requires paragraphs, pages even, of machine instructions, no matter how convenient it is to express in your programming language.

The only way around this obstacle to better and more efficient usage of multiple CPU cores is for one or more of the manufacturers to introduce hardware (i.e. on-chip/machine instruction) support to the operating systems for these facilities.

Perhaps, for desktop systems etc, they might want to look at some kind of FPGA layer on their chips so that the scheduling systems of the OS/kernel can be converted into pseudo-machine code.

7 Comments

Nice explanation Oli.

Think even our sixth-form ICT lot might understand it! ;)

Soo basicly it’s like the government. Each request has to go through channels and then be delegated.

Tuure Laurinolli May 10, 2010 at 4:35 pm

Indeed it’s difficult to split operations that are already “small” across cores.

How about larger, more independent tasks, don’t you have any of those? For example handling tasks related to each player in a separate thread, and letting the OS schedule those. That specifically is maybe unrealistic, but derivatives of it aren’t.

Well, I wasn’t specifically talking about WWIIOL there, but programming in general; you have to find those larger tasks in order to make use of multiple cores. And if your program’s workload doesn’t involve lots of large, CPU-greedy payloads, like a game client or a physics simulation etc, you can fall into variations of the same delegation trap.

As I said, all the paraphernalia of concurrency – like signals and mutexes etc – tend to go through the operating system at a program code level – there’s still no hardware support for those. Windows actually seems to have the edge on performance of parallelization tools at the moment, although the OS kernel is still bulkier making its overall performance lower. Linux offers software “futex”es which are basically spinlocks, so if used badly they consume a lot of CPU.

WWIIOL uses worker tasks for offloading definitely async tasks. But in order to support a large amount of user-concurrency, almost all of our workloads are very atomic in the first place – generally well below the threshold of current parallelization.

Many other server systems struggle with the same predicament: consider mail and web servers; most of their processing time is spent manipulating IO, which means that they don’t really get to take a lot of advantage of multiple cores, and generally they succeed in obtaining their multi-core efficiency by forking the entire process, which creates its own parallelization conundrums and performance drains.

What we really need for mundane programming efficiency – for everything from word processors to web browsers to desktop OS performance – is to move the concept of a thread to the CPU through some kind of ASIC core/layer which will allow the Operating System to reduce its workload, and which will also allow software to use a machine instruction to request parallelization of small non-atomic workloads.

I guess my concept was that a quad core machine would be putting processes on different cores so that no single core would have too much load on it. For example, right now my task manager is showing 56 processes running. Now, I don’t necessarily think the operating system would split things up so that there are 14 processes are on each core, but I would think (or hope!) that maybe most of the OS is on one core, the anti-virus stuff is on another core, and Internet Explorer, which I’m currently using to post this message, would be on yet another core.

To me the biggest advantage of a multi-core system is that when you load a large program like WWIIOL, it gets its own core, and all the other stuff gets moved to the other cores. Or am I just living in Fantasyland?

I think the overhead results in OSes being lazy about scheduling tasks across CPUs until load really calls for; otherwise applications that are parallelized would run into resource shortages — say your video compressor coming along and kicking up 8 threads expecting fairly even load distribution.

But again, that just says: why the heck isn’t there a machine instruction for trans-coring, etc?

Remember: The absence of CPU instructions or extensions means that the Operating System still has to use multiple sequences of CPU instructions to determine the right configuration. Which denies the potential for real-time micro parallelization that would benefit overall generic compute performance.

I have another theory on “chefs syndrome” it simply does not exist, as the plethora of cores even after all these years of advancement are still stuck at what silicon allows for. The companies manufacturing these processors understand this, it is simply a capitalist market that unless you can benchmark every given consecutive technology released thus far then chances are either the software is faulty at the instruction and abstraction level or it is actually the hardware that is faulty. I take dibs on hardware both processors and manufacturers and it has little to do with dropping costs and low spec eco friendly designs.

Leave a Reply

Name and email address are required. Your email address will not be published.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

You may use these HTML tags and attributes:

<a href="" title="" rel=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <pre> <q cite=""> <s> <strike> <strong> 

%d bloggers like this: