[Proj] More Unicode etc.
glynn at gclements.plus.com
Wed Jun 10 12:35:41 EST 2009
Gerald I. Evenden wrote:
> > I have included a simple text file decode in different (Windows) formats.
> > "Test file: abcdöäå<CR><LF<CR><LF>"
> This email brings up another issue: what do all the various mail handlers in
> use by member of this list see: what do you see in the above "test file"
I see what you would expect: a,b,c,d,odiaeresis,adiaeresis,aring
followed by "<CR><LF<CR><LF>".
The actual message body contains:
> "Test file: abcd=F6=E4=E5<CR><LF<CR><LF>"
Given the headers:
Content-Type: text/plain; charset="iso-8859-1"
all mail clients should show the string the same way.
If the non-ASCII characters were embedded literally (rather than using
quoted-printable encoding), without any encoding specified, the
results might vary.
> On my kmail screen I see a lower case o and a with diaeresis and a lower case
> a with ring above. The M$ newline sequence is expanded as <> bracket CR-LF.
> Is this what everybody sees on their mail screen?
It should be.
> I see this sequence wherever I drag and drop them on my system but from this
> thread I am getting the impression that this is not a universal condition.
> And I would not have given this factor further thought and assume this was a
> universal condition.
> I guess it is not.
Problems arise when you have a sequence of bytes with no encoding
specified, and you need to interpret those bytes as text. Email
messages normally specify an encoding via the Content-Type header.
Transfer of text between X applications normally uses a specific
encoding, either ISO-2022 (historical) or UTF-8 (more recent).
For terminal I/O, the program (or library) will normally use the
LC_CTYPE locale setting to determine the terminal's encoding.
[There's no guarantee that this setting will be accurate, but it's
usually the best that you can manage.]
Glynn Clements <glynn at gclements.plus.com>
More information about the Proj