Unicode literals and GCC

Meet my Chinese friend, yuè: 龠

Can you see him, or do you just see a funny box or question mark? You should be seeing a small version of this neat little fella on the right if you have the right fonts etc installed.

Yuè is my Unicode/UTF guinea pig. He is “U+9FA0” in UTF-16 and in UTF-8 he is E0 BE A0.

I can’t get him to print :(

Well, I can get him to print, but not when I want and need to. The following little bit of code works fine:

printf(“Meet yuè: 龠.\n”) ;
printf(“And I can print him this way too: \xe9\xbe\xa0.\n”) ;

But that’s because they’re both embedding the 3-byte UTF-8 character sequence. The following do not work:

wprintf(L”Meet yuè: 龠.\n”) ;
wprintf(L”And I can print him this way too: \xe9\xbe\xa0.\n”) ;

Ok, I can understand why they don’t work, the 3-byte character sequence is being treated as 3 separate wide characters.

So how do you hardcode Unicode literals? The following doesn’t work either:

wchar_t a = L’龠’, b = L’\u9fa0′, c = L’\x9fa0′, d = L’\xe9bea0′ ;

wprintf(L”a = %lc or %c”, a, a) ; wprintf(L”\n”) ;
wprintf(L”b = %lc or %c”, b, b) ; wprintf(L”\n”) ;
wprintf(L”c = %lc or %c”, c, c) ; wprintf(L”\n”) ;
wprintf(L”d = %lc or %c”, d, d) ; wprintf(L”\n”) ;

Or the simple “wchar_t x = 0x9fa0”.

A little experimentation reveals that L’龠’ stores the UTF-16 value (0x9FA0). This is good… If I could figure out how the hell to wprint that and produce 龠.

The reason for all this … Chinese game names will be limited to the roman alphabet, numerals and the characters defined in GB2312. This allows me to compress each character down to either 6 or 14 bits, allowing 4 Chinese characters plus one roman, 10 roman characters – or some mix thereof – in the 64-bit storage we have available.

The fact that I can’t wprint L’龠’ means I’m going to have to figure out how to switch between the two sets of encodings.

14 Comments

Firefox 2 on XP, fairly vanilla, and Yue’s glyph display’s fine, both in the body of the blog post and in lines 01 & 06 of the first code chunk and line 01 of the 2nd chunk.

same with Firefox 3. Can see everything just fine.

I’m running into issues doing something similar.

Actually getting the UTF-8 characters saved to the database. I think it might be client encoding options that are sticking me. Since it isn’t urgent. I haven’t fully dug into my issue.

I see a little chinese house lookin dealy yuè:
Vista builders copy runnin IE.

Humm he don’t copy paste back on this page will paste to notepad. 龠
And now he did…I think

Looks fine to me, Vista 32 bidnez and IE.

Have you tried the same thing on different compilers and systems? Unicode support is far from universal…

Also, this newsgroup thread should help shed some light on things (or confuse them further):

http://groups.google.com/group/comp.lang.c++/browse_thread/thread/591dd9ef557dd824/782c6174e9fd7f22

Thanks for al the responses as to being able to see our friend Yuè, I wasn’t too worried though :)

Different results on Windows, of course, where it works as you would expect. I’ve complimented MS/Windows/Visual Studio on their handling of Unicode and UTF-8: it’s only to be expected that the Open Source solution is … probably academically correct regardless of how totally impractical that is in every day use.

But the host doesn’t run on Windows, and the Mac uses GCC to compile, so I need to find solutions for them.

The fgetwc/fputwc/fwide/wprintf/fwprintf/wctomb/mbtowc/etc. family of functions indeed doesn’t seem to contain a way to specify anything about the byte encoding. How useless.

What’s bizzare is that they only seem to be useful for printing ASCII characters. I’m clearly not understanding how to express Unicode characters in a wchar_t* string.

We already have conversion routines in our UI-layer so I just need to export that so I can share it with the host, I think.

Works fine on Safari running on Vista 64.

I assume you are coding in C. You have to add the following code to your program and set LC_LANG to an appropriate encoding (e.g. en_GB.UTF-8):

#include
setlocale(LC_CTYPE, “”);

Ah – yeah, that makes it work. Cool.

kfsone, can you post the full working code please. I am stuck with a similar problem and would love to see a working solution. Ta!

Leave a Reply

Name and email address are required. Your email address will not be published.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

You may use these HTML tags and attributes:

<a href="" title="" rel=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <pre> <q cite=""> <s> <strike> <strong> 

%d bloggers like this: