Be mindful of your wide strings, young padawan

So, you think wchar_t is a type now, wide character type. Visual Studio and GCC both support it and, aside the fact that the ISO C/C++ standards that Microsoft follow seem to be doomed to make cross-platform development more difficult, have fairly consistent manipulation functions.

But the devil is in the details.

The GCC implementation of wchar_t is a 4 byte value. Microsoft’s definition of whcar_t is an unsigned short, i.e. 2 byte value. The Microsoft implementation is pretty handy and – at the moment – is probably my own preference.  Because of the way Unicode uses high bits to determine character length: if the first bit of the first byte is 0, you can assume the first character is one byte long containing an unsigned character value in the range 0x00 – 0x7f, perfect for the standard 7bit ASCII codepage. All multi-byte Unicode sequences have 1s in their highest bit.

The beauty of this is that all those standard C string functions will work quite happily with Unicode strings: although they will misunderstand what the non-zero, non-ascii characters are, they can tell when they have reached the terminating byte without problems.

Take a quick look at this simple code segment:

    // Measure the size of a wide character.
    static const unsigned int WIDE_SIZE = sizeof(wchar_t) ;
    wprintf(L”wchar_t size = %u\n”, WIDE_SIZE) ;

    // “pling” is a Unicode string with some ASCII characters in it.
    wchar_t pling[] = L”a ‘wide’ string” ;

    // Lets look at pling as though it /were/ regular
    // char’s and see what it looks like.
    const char* ptr = (const char*)pling ;
// Have to use != instead of LESS THAN or blog gets confused.
for ( unsigned int i = 0 ; i != WIDE_SIZE ; ++i, ++ptr )
        wprintf(L”byte %u of pling has value %u\n”, i, *ptr) ;

    // For our last trick, let’s see what “wide printf” does if we
    // ask it to print our string through “%s”.
    wprintf(L”pling via %%s: %s\n”, pling) ;

Under Windows the output is like this:

  wchar_t size = 2
  byte 0 of pling has value 97
  byte 1 of pling has value 0
  pling via %s: a 'wide' string

But under GCC o Linux or Mac it looks like this:

  wchar_t size = 4
  byte 0 of pling has value 97
  byte 1 of pling has value 0
  byte 2 of pling has value 0
  byte 3 of pling has value 0
  pling via %s: a

Houston, we have a problem. Compare the last two lines of each output?

Microsoft’s wprintf is smart enough to know that when you say “%s” and the argument it encounters is a wchar_t* to go ahead and treat as wide characters. GNU’s wprintf does what you told it to do and goes ahead and processes the string as individual char bytes. And the second byte of the string is a zero.

To get GNU wprintf to do the right thing, you need to use a modifier: “l” (lowercase L). The good news is that this is portable:

    wprintf(L”pling via %%ls: %ls\n”, pling) ;[/sourcode]


  pling via %ls: a 'wide' string

Across the board.

Ramp has done a pretty sterling job of localizing the client, but the Mac chat buffer was only ever printing single characters. When Ramp finally eliminated other causes and bounced it back to me as “works on my box”, GDB let me down by failing to support wchars so it would be a pain in the rear to see what was being passed around.

So, I found the last point in the call stack before the strings get passed to the UI and … my ongoing love of adjectives (and prefixes/modifiers/designators) came to the rescue… Seeing the arguments as type const wchar_t* it was my great pleasure to type “%ls” without even thinking about it and then notice that the other printf’s were using “%s” instead. (I guess I’ve read about %s and %ls at some point or seen examples of wprintf using %ls)

For someone who was, for a number of years, an out-and-out Perl hack, I have definitely become fond of strict typing and casting. When my integer isn’t supposed to be signed, I specify unsigned (in database schemas too); I declare my const’s and deal with the warnings and errors. Even when I’m writing a trivial SQL query I’ll write SELECT p.callsign FROM wwiionline.wwii_player AS p WHERE p.playerid = 1234; because it almost always saves me typing an hour, week or month later…

And between that, my space-semicolon and my combination of a few other stylistic quirks, my code is fairly identifiable – even to the point where I recently had someone “finding” some of my code in someone elses project:

I couldnt make head or tails of most of the code but this one segment leapt out. It looked like some code snippets of yours I had seen. It was the most readable part of his code. Steven has modified it a bit and extended it a lot but his changes all copy the style of the original code. That was my anchor to understanding the rest of it.

When we were doing the interview code styles came up and I mentioned this saying how the section name looked like code from this guy called “kfs1”. Steven looked a bit puzzled and said “no, I paid a guy called Oliver to write it for me”. My first task was to reindent all the code kfs style!

Hmm. I’m actually probably too lazy to do that. :)


Interesting, I compiled your little code example on my PC (running Linux) and for some reason only the text from the printf statements was output and not from the final wprintf statement. I couldn’t work it out until I came across this quote in a google search:
“You cannot use printf() and wprintf() on the same output stream”

Sure enough, if I comment out the printf statements, the wprintf works. I wonder why you didn’t have the same problem with GCC?

Actually, I had the exact same problem – it worked under Visual Studio but not Linux. I fixed it but then I ran into a formatting issue where the LESS-THAN in the for loop caused wordpress to randomly refuse to see a [/blockquote] terminator and somewhere in all the pasting and shuffling I lost my wprintf changes :)

Fixed again now, thanks :)

Trackbacks and Pingbacks

Bookmarks about StudioOctober 8, 2008 at 2:45 pm

[…] – bookmarked by 3 members originally found by tonyhanna on 2008-09-18 Be mindful of your wide strings, young padawan – bookmarked […]

Leave a Reply

Name and email address are required. Your email address will not be published.

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

You may use these HTML tags and attributes:

<a href="" title="" rel=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <pre> <q cite=""> <s> <strike> <strong> 

%d bloggers like this: