[Proj] More Unicode etc.

Wed Jun 10 12:35:41 EST 2009

Gerald I. Evenden wrote:

> > I have included a simple text file decode in different (Windows) formats.
> >
> > "Test file: abcdöäå<CR><LF<CR><LF>"
> 
> This email brings up another issue: what do all the various mail handlers in 
> use by member of this list see: what do you see in the above "test file" 
> line?

I see what you would expect: a,b,c,d,odiaeresis,adiaeresis,aring
followed by "<CR><LF<CR><LF>".

The actual message body contains:

	> "Test file: abcd=F6=E4=E5<CR><LF<CR><LF>"

Given the headers:

	MIME-Version: 1.0
	Content-Type: text/plain; charset="iso-8859-1"
	Content-Transfer-Encoding: quoted-printable

all mail clients should show the string the same way.

If the non-ASCII characters were embedded literally (rather than using
quoted-printable encoding), without any encoding specified, the
results might vary.

> On my kmail screen I see a lower case o and a with diaeresis and a lower case 
> a with ring above.  The M$ newline sequence is expanded as <> bracket CR-LF.  
> Is this what everybody sees on their mail screen?

It should be.

> I see this sequence wherever I drag and drop them on my system but from this 
> thread I am getting the impression that this is not a universal condition.  
> And I would not have given this factor further thought and assume this was a 
> universal condition.
> 
> I guess it is not.

Problems arise when you have a sequence of bytes with no encoding
specified, and you need to interpret those bytes as text. Email
messages normally specify an encoding via the Content-Type header. 
Transfer of text between X applications normally uses a specific
encoding, either ISO-2022 (historical) or UTF-8 (more recent).

For terminal I/O, the program (or library) will normally use the
LC_CTYPE locale setting to determine the terminal's encoding.

[There's no guarantee that this setting will be accurate, but it's
usually the best that you can manage.]

-- 
Glynn Clements <glynn at gclements.plus.com>