Code Quiz: Efficiency

For a given problem, which of the following fragments of code are more efficient and why?

1. Constrain a number to the range 0-31

Fragment A:
constrained = number % 32 ;

Fragment B:
constrained = number & 31 ;

2. Concat two strings and pass them to a function:

Fragment A:
std::string message = “Hello ” ;
message += “world” ;
function(message) ;

Fragment B:
std::string message = “Hello ” ;
function(message + “world”) ;

3. Array/vector iteration over 10 integer values

Fragment A:
int sum(int* array, int size)
{
  int sum = 0 ;
  for ( int i = 0 ; i < size ; ++i )
  {
    sum += array[i] ;
  }
  return sum ;
}

Fragment B:
typedef std::vector<int> ARRAY ;
int sum(ARRAY& array)
{
  int sum = 0 ;
  for ( ARRAY::iterator it = array.begin() ; it != array.end() ; ++it )
  {
    sum += *it ;
  }
  return sum ;
}

Fragment C:
typedef std::vector<int> ARRAY ;
int sum(ARRAY& array)
{
  int sum = 0 ;
  for ( UINT32 i = array.size(); i > 0 ;  )
  {
    sum += array[–i] ;
  }
  return sum ;
}

Fragment D:
typedef std::vector<int> ARRAY ;
int sum(ARRAY& array)
{
  int sum = 0 ;
  for ( UINT32 i = 0 ; i < array.size() ; ++i )
  {
    sum += array[i] ;
  }
  return sum ;
}

Fragment E:
typedef std::vector<int> ARRAY ;
int sum(ARRAY& array)
{
  int sum = 0 ;
  ARRAY::iterator it = array.begin() ;
  for ( UINT32 i = array.size ; i > 0 ; ++it, –i )
  {
    sum += *it ;
  }
  return sum ;
}

19 Comments

I really need to retire and focus on learning to program :)

1.

AND is fast, and even though modulo is just another instruction, and it thus all depends on the CPU, I doubt it would be *faster* than AND.

2.

This is so C++-specific that I can’t tell what the difference really is. Is the character-array-storage for stack-allocated strings allocated on stack too?

In that case B might be faster by a relevan amount, since it would save one copying of the string, since the concatenation would be evaluated as a parameter and stored directly in the new stack frame, instead of being stored in the current stack frame, and then copied to the new one upon call of function.

3.

A, all the others involve method calls hidden behind operators for range checking on every iteration, plus possibly other overhead related to the structure.

1.
Fragment A requires dividing two numbers and finding the remainder. Fragment B requires finding the result of a bitwise AND on two numbers. The bitwise operation should be much faster than the divide. Fragment B should be more efficient.

True if the code is not compiled with optimizations, Tazle, what if it’s compiled with optimizations?

Compiler optimization in which case?

For the first, I would say the compiler either converts the modulo into an AND, since the right side is constant, or doesn’t and it will be the same as without optimizations.

For the second, A might become faster with optimizations, if the compiler notices that message is actually constant (in which case there would only be copying message from some pool to stack at runtime). B could probably be optimized in the same fashion, but I’m not sure if that’s done in practice, or if the language specification has something in it that forbids such optimization.

For the third, I think the first would still be fastest. The method calls for checking the length in B and D cannot be optimized away(except perhaps in code without threads), because the length might change at any time.

The C and E fragments should have roughly the same speed as A though, if the compiler inlines dereference and increment operators of the iterator.

Then again, the code in A would probably still be the easiest to vectorize, if that kind of optimizations are used.

Bah, this is why i don’t worry about optimizations of speed :).

Stick to the simple things in life, “For loops, If/Then, Do/While and the occasional Array”, which i’am told i shouldn’t use ;).

Then spend the next six days fixing all the bugs that arise from it :D

Hehe, yes, I was mostly thinking of the 3rd one, Tazle sorry :)

A lot of people misread the STL documentation and are then surprized when some vector operations turn out not to be as efficient (and actually far less efficient) than simple array operations. The only way to make a variable-length vector array access like that as efficient as the array equivalent is


int sum(const ARRAY& array)
{
 int sum = 0 ;
 const size_t size = array.size() ;
 for ( int i = 0 ; i < size ; ++i )
 {
  sum += array[i] ;
 }
 return sum ;
}

It *is* as fast as Fragment A when compiled with optimizations, and possibly even faster (depending on the calling convention used and registers in play).

I’m not so sure about using a for_each function, because the code and optimization varies.

And yes, that’s sort of cheating because it wasn’t an option, but I didn’t want to give it away in the quiz itself :)

1) B, because AND operations are the fastest operation possible, although you might have a compiler that will optimize % 32.

2) B, because the line ‘message += “world” ;’ causes the message to be called concatenated then stored back in memory, then recalled again for the function. While B will keep the concatenated string in the CPU register.

3) E,

A sucks because you have to pass in size and it references by array[i] which requires you to find the proper memory location starting from the array pointer.
B sucks because it calls end() and begin() each time you iterate through the loop
C sucks because it doesnt work ‘sum += array[–i] ;’ should be ‘sum += array[–-i] ;’ . Also indexes by array indead instead of iterating the pointer.
D sucsk because size() is called each time you iterate as well as not indexing by pointer.
E wins because it iterates the pointer and doesn’t call a function to get array information on each iteration.

==========
Posted this before I read the comments. How’d I do?

Regarding Q2: (hello world)

This is far more tricky than it looks :)

Fragment A is often considered the more efficient, because its cheaper to copy a few extra bytes into a string buffer than the create a new string from scratch, which is what Fragment B does silently.

The twist, though, is that in FragA the string buffer for message is only big enough to house “Hello”. When you add “world” it probably has to reallocate the memory, copy “Hello” into the new string and then concat “world” – the same workload as FragB has.

Of course, there is another element – but not a factor – that in Fragment B, both “message” and the temporary, concatenated string co-exist so heap, memory or stack usage will be increased – thus FragB is definitely less preferable if recursion may occur.

gn: 3C is actually correct code, the wordpress formatter just eats the first decrement symbol.

With the correct limit operation (i != 0 perhaps) it may actually be the most efficient, in some environments and with optimization, because it would produce very branch-predictor friendly code, and testing for zero is often more efficient than testing for a limit value, meaning you only ever access the limit value one – and that’s in the assignment. (Note that I only aimed for equal efficiency to the Frag A in my previous example)

B is very close to being efficient, what it lacks is const cast on the ARRAY& argument, which would provide most compilers with a hint that begin() and end() are constant, and so only need be accessed once. You could assert this (rather than randomly hoping for it) by modifying it to be:

int sum(const ARRAY& array)
{
 int sum = 0 ;
 ARRAY::const_iterator it = array.begin() ;
 const ARRAY::const_iterator end = array.end() ;
 for ( /* no inits */ ; it != end ; ++it )
 {
 sum += *it ;
 }
 return sum ;
}

The std::vector&[i]s aren’t as expensive as you think, either; in a debug build they’ll be very expensive but provide the advantage of bounds checking. In an optimized release build they usually do no bounds checking at all, so the final code is nothing more than loading the address of the vectors buffer into a register and adding the relevant offset.

The most efficient form of this loop ought to be:

int sum(const int* begin, const int* end)
{
 int sum = 0 ;
 while ( begin < end )
  sum += *(begin++) ;
 return sum ;
}

Some compiler scenarios don’t correctly optimize begin++ though, and you wind up needing to provide the compiler with additional assitance in recognizing the array striding:

int sum(const int* begin, const int* end)
{
 int sum = 0 ;
 static const int INT_STRIDE = sizeof(int) ;
 while ( begin <= end )
 {
 sum += *begin ;
 begin = (int*)((char*)begin + INT_STRIDE) ;
 }
 return sum ;
}

Using <= means you can still call the function with sum(&array.begin(), &array.end());

kfs: hehe figured C was a something like that, I just felt like being a wise ass :)

I’m looking over your two example best case code samples…
The first one I understand. The two parameters are pointers to the begining and end of the array. So when you iterate begin it will eventually equal then pass end, at which point you exit.

The second one eludes me:
Why do you have begin cast to a char pointer then casted back to int?

Some other questions: what does consts’ing params do for optimizaiton? What about static consts for INT_STRIDE?

This is why I don’t code for a living. :) Then again, most of the developers that I have to work with never think about this kind of stuff too. It’s let the compiler do the optimization or let the database interepret it and convert it into a more efficient query.

gn: because if you say “begin += 4” it will increment by the address of 4 integers. However, several compilers only see the optimization potential after various loop striding and unrolling optimizations have already been covered.

In most cases the first version of the while loop is sufficient, but in some compilers (notably VC6 and I forget which point gcc 3 picked up this optimization but it was late, along with a number of others) its best to be explicit about it if you want the extra CPU cycles out of it.

The more entries and more often you’re dealing with this code, the more relevant it becomes when saving 1000 cpu cycles a second means servicing an extra query/second :)

I understand that this is a workaround to old compiler optimization. However, I dont see why you casted to char* then back to int* ?

Basic pointer stuff.

KFS1: because if you say “begin += 4″ it will increment by the address of 4 integers.

“begin++” will increment begin by sizeof(*begin) .. ie sizeof(int) bytes in memory .. ie 4 bytes.

But, the point is it doesn’t optimize correctly so I have to explicitly tell the compiler how to advance the pointer.

“begin += sizeof(int)” or “begin += 4” will increment begin by 4*sizeof(int) bytes in memory (i.e. 4 integers, assuming int is 4 bytes).

So since what I want to do is increment it by 4 bytes, I cast it a byte-sized pointer type, but since I want to use it as an int* pointer, I have to cast it back to an int* before assigning it.

ok I see how you’ve written it now. Let me see if I understand this right…

begin = (int*)((char*)begin + INT_STRIDE) ;

will increment begin, an 8 bit pointer, by 4

and

sum += *(begin++) ;

will increment begin, an 64 bit pointer, by 1

Now, the question I have is where is the optimization in 8 bit pointers incremented by 4 versions 64 bit pointers by 1?

So basically, what is the optimization?

Has to do with loop unrolling and instruction pipelining. An example of an unrolled loop is:


for ( i = 0 ; i &lt 16 ; i += 4) {
 sum += array[i] ;
 sum += array[i+1] ;
 sum += array[i+2] ;
 sum += array[i+3] ;
}

But some older – but still current – compilers don’t realize they can perform this optimization with the while … sum += *(begin++) syntax.

I know that doesn’t look like it should be more efficient, and that’s not the perfect example (wiki loop unrolling) but it creates a pipelineable code framework that is pretty efficient where the equivalent simple for loop would actually be executed more slowly.

Ok I think I get it, thanks for posting. I’ll go lookup loop unrolling.

I’ve read the wiki and I understand what you are doing now. The loop unrolled will cycle through the loop less, cause less conditional statements to be executed which allows better pre-processing, yes?

Next question (grin): what does const’ing your function parameters do for optimizaiton?

It’s more a case of ensuring it will perform certain optimizations. GCC prior to 3.23 and VS 6 and 7 won’t always correctly unroll loops or organize code as efficiently: immutable variables allow certain register and code-ordering optimizations. The const “hint” basically ensures that the compiler sees all of those opportunities.

I’ll see if I can dig up the thing I wrote a few years ago about how consting one-time variables, e.g.


int i = strlen(foo);
if ( i > blah )
foo[i] = 0;

as part of a larger function will generate different code on some compilers than


const int i = strlen(foo);
if ( i > blah )
foo[i] = 0;

(this only applies to C++ where you can declare variables at any point)

Doesn’t matter with VS2k5 or GCC 4.

Leave a Reply

Name and email address are required. Your email address will not be published.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

You may use these HTML tags and attributes:

<a href="" title="" rel=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <pre> <q cite=""> <s> <strike> <strong> 

%d bloggers like this: