Unicode literals and GCC
Meet my Chinese friend, yuè: 龠
Can you see him, or do you just see a funny box or question mark? You should be seeing a small version of this neat little fella on the right if you have the right fonts etc installed.
Yuè is my Unicode/UTF guinea pig. He is “U+9FA0″ in UTF-16 and in UTF-8 he is E0 BE A0.
I can’t get him to print :(
Well, I can get him to print, but not when I want and need to. The following little bit of code works fine:
printf(“Meet yuè: 龠.\n”) ;
printf(“And I can print him this way too: \xe9\xbe\xa0.\n”) ;
But that’s because they’re both embedding the 3-byte UTF-8 character sequence. The following do not work:
wprintf(L”Meet yuè: 龠.\n”) ;
wprintf(L”And I can print him this way too: \xe9\xbe\xa0.\n”) ;
Ok, I can understand why they don’t work, the 3-byte character sequence is being treated as 3 separate wide characters.
So how do you hardcode Unicode literals? The following doesn’t work either:
wchar_t a = L’龠’, b = L’\u9fa0′, c = L’\x9fa0′, d = L’\xe9bea0′ ;
wprintf(L”a = %lc or %c”, a, a) ; wprintf(L”\n”) ;
wprintf(L”b = %lc or %c”, b, b) ; wprintf(L”\n”) ;
wprintf(L”c = %lc or %c”, c, c) ; wprintf(L”\n”) ;
wprintf(L”d = %lc or %c”, d, d) ; wprintf(L”\n”) ;
Or the simple “wchar_t x = 0x9fa0″.
A little experimentation reveals that L’龠’ stores the UTF-16 value (0x9FA0). This is good… If I could figure out how the hell to wprint that and produce 龠.
The reason for all this … Chinese game names will be limited to the roman alphabet, numerals and the characters defined in GB2312. This allows me to compress each character down to either 6 or 14 bits, allowing 4 Chinese characters plus one roman, 10 roman characters – or some mix thereof – in the 64-bit storage we have available.
The fact that I can’t wprint L’龠’ means I’m going to have to figure out how to switch between the two sets of encodings.