[Proj] Unicode

Glynn Clements glynn at gclements.plus.com
Mon Jun 8 19:02:04 EST 2009


Gerald I. Evenden wrote:

> 	...
> >     printf("%ls\n", L"Schöne Grüße");
> 
> for my edification I grabbed a portion of the above string and:
> 
> gie at charon:~$ echo 'L"Schöne Grüße");' >foo
> gie at charon:~$ m foo
> L"Schöne Grüße");
> gie at charon:~$ hd foo
> 00000000  4c 22 53 63 68 c3 b6 6e  65 20 47 72 c3 bc c3 9f  |L"Sch..ne Gr....|
> 00000010  65 22 29 3b 0a                                    |e");.|
> 00000015
> gie at charon:~$
> 
> I see that the "normal text is taking up 1 byte per character and when hitting 
> a funky character it escapes with c3 and a code.  So it seems that when 
> everything is in ASCII we are in normal byte mode and when an extended 
> character comes along it is handled with a two byte sequence.
> 
> Fair enough.  This *is not* the impression I got various previous descriptions 
> as the 16 bit aspect kept comming up and made one think that the whole damn 
> string was in 16-bit code.

Unix normally uses 32 bits for wide characters. But you don't normally
use that for storage or interchange (apart from anything else, you
have endian-ness issues).

But some forms of text processing are inconvenient on multi-byte
representations; e.g. you can't iterate over a char[] processing each
element independently. So it's quite common to convert to wide
characters for processing.

OTOH, life is still much simpler with the ISO-8859-* encodings where
one byte is one character, which is one reason why they're still
widely used.

> As an aside, I dropped the string into vim and it displayed it properly.  
> Alas, how does one enter this stuff without dropping into a character map 
> display and wear your mouse out with drag-and-drop?

There are various options.

Most X keyboard layouts configure AltGr plus the punctuation keys on
the RHS of the keyboard as "dead" accents, so e.g. AltGr+semicolon
then e gives eacute; "xmodmap -pk | grep dead_" should list the
combinations.

But I have Shift+AltGr configured as a "Compose" key[1], which allows
mnemonic sequences, e.g. Shift+AltGr then e then single-quote gives
eacute, Shift+AltGr then o then c gives the copyright symbol, etc. 
This would be tedious if you need to use accented characters a lot,
but it's adequate (and easier than remembering all of the dead keys)
for occasional non-ASCII characters.

[1] xmodmap -e 'keycode 113 = ISO_Level3_Shift Multi_key'

-- 
Glynn Clements <glynn at gclements.plus.com>


More information about the Proj mailing list