Home > Coding > Unicode literals and GCC

Unicode literals and GCC

Meet my Chinese friend, yuè: 龠

Can you see him, or do you just see a funny box or question mark? You should be seeing a small version of this neat little fella on the right if you have the right fonts etc installed.

Yuè is my Unicode/UTF guinea pig. He is “U+9FA0″ in UTF-16 and in UTF-8 he is E0 BE A0.

I can’t get him to print :(

Well, I can get him to print, but not when I want and need to. The following little bit of code works fine:

printf(“Meet yuè: 龠.\n”) ;
printf(“And I can print him this way too: \xe9\xbe\xa0.\n”) ;

But that’s because they’re both embedding the 3-byte UTF-8 character sequence. The following do not work:

wprintf(L”Meet yuè: 龠.\n”) ;
wprintf(L”And I can print him this way too: \xe9\xbe\xa0.\n”) ;

Ok, I can understand why they don’t work, the 3-byte character sequence is being treated as 3 separate wide characters.

So how do you hardcode Unicode literals? The following doesn’t work either:

wchar_t a = L’龠’, b = L’\u9fa0′, c = L’\x9fa0′, d = L’\xe9bea0′ ;

wprintf(L”a = %lc or %c”, a, a) ; wprintf(L”\n”) ;
wprintf(L”b = %lc or %c”, b, b) ; wprintf(L”\n”) ;
wprintf(L”c = %lc or %c”, c, c) ; wprintf(L”\n”) ;
wprintf(L”d = %lc or %c”, d, d) ; wprintf(L”\n”) ;

Or the simple “wchar_t x = 0x9fa0″.

A little experimentation reveals that L’龠’ stores the UTF-16 value (0x9FA0). This is good… If I could figure out how the hell to wprint that and produce 龠.

The reason for all this … Chinese game names will be limited to the roman alphabet, numerals and the characters defined in GB2312. This allows me to compress each character down to either 6 or 14 bits, allowing 4 Chinese characters plus one roman, 10 roman characters – or some mix thereof – in the 64-bit storage we have available.

The fact that I can’t wprint L’龠’ means I’m going to have to figure out how to switch between the two sets of encodings.

Categories: Coding
  1. Horse
    August 26, 2008 at 8:53 pm | #1

    Firefox 2 on XP, fairly vanilla, and Yue’s glyph display’s fine, both in the body of the blog post and in lines 01 & 06 of the first code chunk and line 01 of the 2nd chunk.

  2. August 26, 2008 at 9:36 pm | #2

    same with Firefox 3. Can see everything just fine.

    I’m running into issues doing something similar.

    Actually getting the UTF-8 characters saved to the database. I think it might be client encoding options that are sticking me. Since it isn’t urgent. I haven’t fully dug into my issue.

  3. OldZeke
    August 26, 2008 at 9:39 pm | #3

    I see a little chinese house lookin dealy yuè:
    Vista builders copy runnin IE.

  4. OldZeke
    August 26, 2008 at 9:42 pm | #4

    Humm he don’t copy paste back on this page will paste to notepad. 龠
    And now he did…I think

  5. easting
    August 26, 2008 at 10:44 pm | #5

    Looks fine to me, Vista 32 bidnez and IE.

  6. X15
    August 26, 2008 at 11:57 pm | #6

    Have you tried the same thing on different compilers and systems? Unicode support is far from universal…

  7. X15
    August 27, 2008 at 12:16 am | #7

    Also, this newsgroup thread should help shed some light on things (or confuse them further):

    http://groups.google.com/group/comp.lang.c++/browse_thread/thread/591dd9ef557dd824/782c6174e9fd7f22

  8. August 27, 2008 at 12:58 am | #8

    Thanks for al the responses as to being able to see our friend Yuè, I wasn’t too worried though :)

    Different results on Windows, of course, where it works as you would expect. I’ve complimented MS/Windows/Visual Studio on their handling of Unicode and UTF-8: it’s only to be expected that the Open Source solution is … probably academically correct regardless of how totally impractical that is in every day use.

    But the host doesn’t run on Windows, and the Mac uses GCC to compile, so I need to find solutions for them.

  9. Tuure Laurinolli
    August 27, 2008 at 2:41 am | #9

    The fgetwc/fputwc/fwide/wprintf/fwprintf/wctomb/mbtowc/etc. family of functions indeed doesn’t seem to contain a way to specify anything about the byte encoding. How useless.

  10. August 27, 2008 at 5:24 am | #10

    What’s bizzare is that they only seem to be useful for printing ASCII characters. I’m clearly not understanding how to express Unicode characters in a wchar_t* string.

    We already have conversion routines in our UI-layer so I just need to export that so I can share it with the host, I think.

  11. memoi
    August 27, 2008 at 12:51 pm | #11

    Works fine on Safari running on Vista 64.

  12. Midas
    October 14, 2009 at 2:39 pm | #12

    I assume you are coding in C. You have to add the following code to your program and set LC_LANG to an appropriate encoding (e.g. en_GB.UTF-8):

    #include
    setlocale(LC_CTYPE, “”);

  13. October 16, 2009 at 2:47 pm | #13

    Ah – yeah, that makes it work. Cool.

  14. Penguin
    February 7, 2011 at 3:28 pm | #14

    kfsone, can you post the full working code please. I am stuck with a similar problem and would love to see a working solution. Ta!

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 183 other followers

%d bloggers like this: