Wish List Another Beginning

Nigel Clarke pmmail@rpglink.com
Thu, 16 Dec 1999 06:47:01 -0500 (EST)


On Thu, 16 Dec 1999 04:37:56 +0200, Cristian Secara wrote:

snip
>Few days ago I wrote a message in german language, using german
>national characters (=E4, =F6, =FC, =DF).
>Signing (and sending) that message, I found by chance that my (already
>sent) message had all national characters corrupted, something like
>=3D84, =3D81, =3DE1 instead.
>Further testings revealed that the only way to keep the characters
>unchanged, was PMMail -> Properties -> Encoding format =3D Quoted
>printable AND 'Do not perform character set translation' box left
>unchecked (off ? I suspect this switch acts inverse). This was the only=

>valid version, out of four possible.
>Corruption was also not similar on all tests, it depends on the above
>settings combination.
>
>That message and all tests with PMMail -> Properties -> Default
>character set =3D ISO 8859-2 (Latin 2).
>Without signing, national characters were not corrupted in any way.
>

I identified this problem in 1997 and with Mats Dufberg wrote a small 
monograph on the problem which I attach. For a long time it was 
available via mail from me and I passed it on to the SouthSide team
when I took a less active role in PMMail support.

Nigel

PMMail 1.9x, Encoding and ISO-8859 character sets Setup Information
Version 1.5                                              19 June 1997

Compiled by :-
Nigel Clarke <nclarke@bda-hp.bda.nasa.gov> 
and Mats Dufberg <mats.dufberg@abc.se>.

PMMail Settings
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
PMMail ->Settings ->General ->Locale Settings ->Default Character Set

The Default Character Set as installed is US ASCII. Western European 
language users should change this setting to ISO 8859-1 (Latin 1) in 
most cases. This is also normally used by Scandinavian users in 
preference to ISO 8859-10. See below for more information about 
character sets. If you use the ISO 8859-1 character set you must 
ensure that your primary code page setting in the config.sys file is
850 (for example CODEPAGE=3D850).

PMMail ->Settings ->General ->Locale Settings ->Encoding Format

Make sure that this option is set to Quoted-Printable rather than 8 Bit =

unless you have a specfic instruction to set 8 Bit here. Many US mail 
gateways will mangle your message if you don't. See below for more 
information.

PMMail ->Settings ->General ->Locale Settings 
                         ->Do NOT Perform Character Set Translation

Note that no help is available in PMMail 1.92 should you search on 
'Character Set Translation'.

In the opinion of a well informed Scandanavian PMMail user (Mats) 
this should never be set to on (the box should never be checked) as it 
produces mail not in accordance with the MIME standard (RFC2045). See 
below for more information.

Checking your PMMail Setup
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
You can use the services of the mime test service at 
<mime-test@relay.surfnet.nl> by sending a message with a Subject
of "iso-8859-1" (no quotes).

This should send a reply to you containing the following (or similar).

"This is an example of a text/plain; charset=3Diso-8859-1 message.

Norwegian characters:

=C6 (ae ligature), =D8 (o slash), =C5 (a ring); lowercase: =E6 =F8 =E5

The complete ISO 8859-1 character set:

 32:   ! " #  % & ' ( ) * + , - . / 
 48: 0 1 2 3 4 5 6 7 8 9 : ; < =3D > ?
 64: @ A B C D E F G H I J K L M N O
 80: P Q R S T U V W X Y Z [ \ ] ^ _
 96: ` a b c d e f g h i j k l m n o
112: p q r s t u v w x y z { | } ~  
160: =A0 =A1 =A2 =A3 =A4 =A5 =A6 =A7 =A8 =A9 =AA =AB =AC =AD =AE  
176: =B0 =B1 =B2 =B3 =B4   =B6 =B7   =B9 =BA =BB =BC =BD =BE =BF
192: =C0 =C1 =C2 =C3 =C4 =C5 =C6 =C7 =C8 =C9 =CA =CB =CC =CD =CE =CF
208: =D0 =D1 =D2 =D3 =D4 =D5 =D6 =D7 =D8 =D9 =DA =DB =DC =DD =DE =DF
224: =E0 =E1 =E2 =E3 =E4 =E5 =E6 =E7 =E8 =E9 =EA =EB =EC =ED =EE =EF
240: =F0 =F1 =F2 =F3 =F4 =F5 =F6 =F7 =F8 =F9 =FA =FB =FC =FD =FE =FF
(Thanks to Harald Alvestrand of Uninett)

SURFnet                                              EH'95"

If you have complete support for ISO-8859-1 all cells of the table 
should contain one printable character (16 cells per row) with the 
following exceptions. 32 and 160 are space characters, regular and 
non-breaking, respectively. 127 has no printable character. 

PMMail does not seem to be able to display codes 175, 181 and 184.

More locations for testing your MIME setup
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
From Part 3 of the MIME FAQ
"11.3) Where can I get some sample MIME messages?

Here are two sources:

ftp://thumper.bellcore.com/pub/nsb/samples/
http://www-dsed.llnl.gov/documents/tests/email.html

Here're more sources:

    [ Patrik Faltstrom <paf@bunyip.com> 13-Dec-1994 ]

    At 12:55 AM 12/11/94, Richard Willis wrote:
    >Could someone tell me what the address of the person in Sweden
    >is who kindly provided a set of MIME-conformancy tests via
    >listserver...
    
    My address is paf@bunyip.com, and the address of the listserver
    is mimeback@bunyip.com. Send the command (actually the name of the
    file you want) as the subject in the message. Start with the command=

    "HELP".
---------------------------------------------------------
These diagrams should help clarify some of the possible options open to
users in setting up support for non US and non English language writing =

PMMail users.

Char. Set:    =3D Character Set setting in mail header.
(Where iso-8859-1 is used any of the iso-8859 options can be 
substituted).
Bit Range:    =3D Actual bit range of characters used in the body of the=
 
message. 
Encoding:     =3D Encoding required by RFC 2045.
(Base64 is an acceptable alternative to Q-P).
CTE:          =3D Content-Transfer-Encoding setting in mail header.

Typical path for mail from English writing US based users
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D

/--------------------------\                  /---------------------\
| Char. set: ASCII         | <-> Internet <-> |                     |
| Bit range: 7 bit         |                  | Any mail reader     |
| Encoding:  not necessary |                  | can read the mail   |
| CTE:       7bit          |                  |                     |
\--------------------------/                  \---------------------/
This equates to the following PMMail Settings
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Default Character Set ->US ASCII
Encoding Format ->Q-P

and generates the outgoing PMMail or MIME mailer header
Content-Type: text/plain; charset=3D"us-ascii"
Content-Transfer-Encoding: 7bit

Typical path for mail from non native English writing users
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

1) The PMMail user is connected to a traditional mail 
(SMTP) server that cannot (reliably) handle high octets
(This case is usually found in the US)

/-------------------------------\
| MIME compliant mailer (PMMail)|
| Char. set: iso 8859-1         |
| Bit range: 8 bit              | --> SMTP server -->
| Encoding:  Quoted-Printable   |
| CTE:       Quoted-printable   |
\-------------------------------/

              /---------------------------------------\
    SMTP      | The mail can be read by any MIME      |
--> server    | compliant mail reader (if it supports |
    POP   --> | iso 8859-1). Receiving program will   |
    server    | decode quoted-printable to correct    |
              | characters in iso 8859-1.             |
              \---------------------------------------/

This equates to the following PMMail Settings
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Default Character Set ->ISO-8859-1
Encoding Format ->Q-P

and generates the outgoing PMMail or MIME mailer header
Content-Type: text/plain; charset=3D"iso-8859-1"
Content-Transfer-Encoding: quoted-printable

2) The PMMail user is connected to a modern mail (SMTP)
server that can reliably handle high octets.
(This case is most commonly found in European countries)

/-------------------------------\
| MIME compliant mailer (PMMail)|
| Char. set: iso 8859-1         |
| Bit range: 8 bit              | --->
| Encoding:  not necessary      |     
| CTE:       8bit               |     
\-------------------------------/

     /------------------------------------\
---> | ESMTP server with 8BITMIME support |
     | The server will convert to quoted- | --> (E)SMTP -->
     | printable if next (E)SMTP server   |
     | doesn't support 8BITMIME.          |
     \------------------------------------/

(Note: The conversion usually generates a line in the mail header.
Example:-
X-Mime-Autoconverted: from 8bit to quoted-printable by aaa.bbb.ccc)

              /---------------------------------------\
    SMTP      | The mail can be read by any MIME      |
--> server    | compliant mail reader (if it supports |
    POP   --> | iso 8859-1). Receiving program will   |
    server    | decode any quoted-printable to correct|
              | charcters in iso 8859-1.              |
              \---------------------------------------/

This equates to the following PMMail Settings
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Default Character Set ->ISO-8859-1
Encoding Format ->8 Bit

and generates the outgoing PMMail or MIME mailer header
Content-Type: text/plain; charset=3D"iso-8859-1"
Content-Transfer-Encoding: 8bit

Note on non MIME compliant receiving mail readers. If
the receiving mail reader can handle 8 bit mail, and
no encoding has been done the program can correctly
display the mail. We cannot, however, guarantee that
the mail does not take a different route that will
trigger encoding.

3) The PMMail user is connected to a mail (SMTP) server 
that can reliably handle high octets locally (non-ESMTP).
(This case is usually found in Europe where older or non-MIME
mail systems with 8 bit non English language support are found)

The situation is like number 2, but the difference is
that if the mail takes a route via a traditional SMTP
server, there won't be any encoding but all high octets
will be corrupted.

--------------------------------------------------------------
The material that follows goes into the subject of code pages, encoding
and character sets in considerable detail and can be considered for
reference only.

Character Sets.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
US ASCII was designed to transmit US English characters and as such uses=
 
binary sequences, that translate to the decimal numbers 0 through 127, t=
o 
represent those characters.

The ASCII set consists of some control characters, a space character (va=
lue 
32) and the following printable letters with values between 33 and 126.

!"#%&'()*+,-./0123456789:;<=3D>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ
[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

In order to represent the characters used in languages other than US 
English IBM and Microsoft developed the concept of code pages containing=

character sets that use the numbers above 127 to represent different
letters. Different code pages use these values to produce the
appropriate characters on screen in conjunction with the correct country=

setting. Common code pages are 850 as a generic International code page
and 437 for US English. 

ISO 8859 (International Organisation for Standardisation standard 8859)
lists character sets corresponding to the appropriate national
languages. These are normally registered via ECMA (European Computer
Manufacturers Association) with ISO. 

The character sets available in PMMail 1.92 are:-

US ASCII                 Used in the United States
ISO 8859-1   (Latin 1)   Used for Western Europe and Latin America
ISO 8859-2   (Latin 2)   Used for Eastern Europe
ISO 8859-3   (Latin 3)   Used for Southern Europe
ISO 8859-4   (Latin 4)   Used for Scandanavia (also 8859-1 and 10)
ISO 8859-5   (Cyrillic)
ISO 8859-6   (Arabic)    
ISO 8859-7   (Modern Greek)  
ISO 8859-8   (Hebrew)   
ISO 8859-9   (Latin 5)   Used for Turkey  (also 8859-3)
ISO 8859-10  (Latin 6)   Used for Scandanavia (also 8859-1 and 4)
KOI8-R       (Russian)

Now if you are writing in Arabic, Hebrew or Greek the choices are 
obvious. I believe that KOI8-R is the preferred character set for use on=

a Russian computer. Complete support for Arabic, Greek and Hebrew (as
well as Thai and DBCS pages) is only found in those versions of Warp 
that are specifically designed for these countries.

ISO 8859-1 (Latin 1) supports the following languages:
Afrikaans, Albanian, Catalan, Danish, Dutch, English, Faeroese, 
Finnish, French, German, Galician, Irish, Icelandic, Italian, Norwegian,=

Portuguese, Spanish and Swedish. It doesn't cover Welsh (and possibly
Breton).

ISO 8859-2 (Latin 2) supports the following languages:
Albanian, Croat, Czech, German, Hungarian, Polish, Romanian, Slovak and =

Slovenian.

ISO 8859-3 (Latin 3) supports the following languages:
Esperanto, Galician, Maltese and Turkish.

ISO 8859-4 (Latin 4) supports the following languages:
Estonian, Latvian and Lithuanian.
It is an incomplete precursor of the ISO 8859-10 (Latin 6) set.

ISO 8859-5 (Cyrillic) supports the following languages:
Bulgarian, Byelorussian, Macedonian, Serbian and Ukrainian.

ISO 8859-9 (Latin 5) replaces the rarely used Icelandic letters from 
ISO 8859-1 (Latin 1) with Turkish letters.

ISO 8859-10 (Latin 6) adds the last letters from Greenlandic and Lapp 
which were missing in ISO 8859-4 (Latin 4) and thereby covers all
Scandinavia.

Code Page Setting's Effect on Displayed Characters
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
So far our experience is limited to code pages 850 and 437. We welcome 
experience with other code page settings, and use of other characters se=
ts 
than US-ASCII and ISO-8859-1.

If 850 primary code page is selected almost all ISO 8859-1 characters 
are displayed correctly (characters with decimal codes 175, 181, 184 are=

not displayed).

If 850 is primary code page PMMail does not seem to be able to handle 
ISO 8859-2, -3, -4 and -6 correctly. The OS/2 help on the CodePage
command shows that 852 is the correct page for Latin 2 (ISO 8859-2), 857=

for Latin 3 (ISO 8859-3), 921 or 922 for Latin 4 (ISO 8859-4).

------------------------------------------------------------------
Encoding Format - Quoted Printable
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

Quoting from RFC 2045
   "The Quoted-Printable encoding is intended to represent data that
   largely consists of octets that correspond to printable characters in=

   the US-ASCII character set.  It encodes the data in such a way that
   the resulting octets are unlikely to be modified by mail transport.
   If the data being encoded are mostly US-ASCII text, the encoded form
   of the data remains largely recognizable by humans.  A body which is
   entirely US-ASCII may also be encoded in Quoted-Printable to ensure
   the integrity of the data should the message pass through a
   character-translating, and/or line-wrapping gateway."

NOTE: PMMail only uses Quoted-Printable encoding if characters with a 
numeric value greater than 127 are present in the message.

and also

"NOTE: The quoted-printable encoding represents something of a
   compromise between readability and reliability in transport.  Bodies
   encoded with the quoted-printable encoding will work reliably over
   most mail gateways, but may not work perfectly over a few gateways,
   notably those involving translation into EBCDIC.  A higher level of
   confidence is offered by the base64 Content-Transfer-Encoding." 

Mats Dufberg <mats.dufberg@abc.se> wrote:

"If your mail is quoted-printable encoded and the receiving
party has an MIME compliant mail reader (and support for
the character set ISO 8859-1) the encoded characters will
be translated back into their original shape. That is,
you won't even notice the encoding.

If the receiving party does not have an MIME compliant
mail reader, the characters will be presented in their
encoded form which means "=3DFC=3DDC=3DE4=3DC4=3DF6=3DD6" for "=FC=DC=E4=
=C4=F6=D6"
(the German Umlauts).

Transliterating them into ue, ae and oe has nothing to do
with MIME, but could be done automatically in PMMail
with some REXX program."
------------------------------------------------------------------
Character Set Translation
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
Mats Dufberg <mats.dufber@abc.se> wrote:

"The behavior of PMMail 1.91 is a little bit strange when it comes
to the character set translation setting. You'll actually have
three choices, "8bit", "quoted-printable" and "no translation".
When "no translation" is selected "8bit" or "quoted-printable"
gives the same result.

Let's say you'll compose an email with high octets with ISO 8859-1
as the selected character set. The result will be the following:

CHOICE  CHARSET    ENCODING     Characters    High       Legal?
        value      value        encoded?      octets?

8bit    iso-8859-1 8bit         no            yes         YES
QP      iso-8859-1 quoted-pr.   yes           no          YES
no tr.  us-ascii   7bit         no            yes         NO

That is, when "no translation" is selected the mail header
says "all characters are ascii and 7bit" even if the mail
body contains high octets.

By selecting "no translation" PMMail produces email that are
NOT compliant with the MIME protocol. I hope it's a bug, not a
feature. :-) As it is it should not be used."

RFC2045 Section 6.2 states:-
 "The proper Content-Transfer-Encoding label must always be used.
   Labelling unencoded data containing 8bit characters as "7bit" is not
   allowed, nor is labelling unencoded non-line-oriented data as
   anything other than "binary" allowed."
------------------------------------------------------------------
More Information on Characters Sets and Electronic Mail
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D
The following URL is very informative on the problems with 8 bit charact=
ers 
and mail although it was specifically written for UNIX users.
<http://www.ioc.ee/home/tarvi/mime_pem/FAQ-ISO-8859-1.html> 

You can find out more about ISO 8859 character sets at :- 
<http://www.isoc.org:8080/codage/iso8859/jeuxiso.en.htm> in English
<http://www.isoc.org:8080/codage/iso8859/jeuxiso.fr.htm> in French

And also another site from Mats for European users:-
<http://www.uni-passau.de/~ramsch/iso8859-1.html>

The Online RFC web site contains RFC 1345 which defines various 
character sets. 
<http://info.internet.isi.edu/1s/in-notes/rfc/files>
The RFC's are also available for European users from:-
<http://ftp.sunet.se/pub/Internet-documents/rfc/>

Other character sets you may come across are:-
     VISCII 8 bit Latin and Vietnamese
     ISO-2022-JP Latin and Japanese
     ISO-2022-KR Latin and Korean
     UNICODE-1-1 Unicode
     UNICODE-1-1-UTF-7 Mail-safe Unicode
     ISO-2022-JP-2 Multilingual

As of PMMail 1.92 there is no support for these character sets.
---------------------------------------------------------------
Definitions 
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

RFC 2045 defines 7bit Data as:

   " "7bit data" refers to data that is all represented as relatively
   short lines with 998 octets or less between CRLF line separation
   sequences [RFC-821].  No octets with decimal values greater than 127
   are allowed and neither are NULs (octets with decimal value 0).  CR
   (decimal value 13) and LF (decimal value 10) octets only occur as
   part of CRLF line separation sequences."

and 8bit Data as:

   " "8bit data" refers to data that is all represented as relatively
   short lines with 998 octets or less between CRLF line separation
   sequences [RFC-821]), but octets with decimal values greater than 127=

   may be used.  As with "7bit data" CR and LF octets only occur as part=

   of CRLF line separation sequences and no NULs are allowed."

--
Nigel J. Clarke