Ticket #1879 (new defect)

Opened 5 years ago

Last modified 20 months ago

mutt: Wrong ISO2022 -> locale charset conversion

Reported by: Marco d'Itri <md@…> Owned by: mutt-dev
Priority: minor Milestone:
Component: charset Version:
Keywords: Cc: 249626@…

Description (last modified by brendan) (diff)

Package: mutt
Version: 1.5.6-1
Severity: important

[NOTE: this bug report has been submitted to the debian BTS as Bug#249626.
Please Cc all your replies to 249626@bugs.debian.org .]

From: Ambrose Li <a.c.li@ieee.org>
Subject: mutt: Wrong ISO2022 -> locale charset conversion
Date: Tue, 18 May 2004 11:20:07 -0400

Suppose mutt is to display some kanji on a Big5 terminal (either in the
header or in the message body), say

   <k1><k2><k3><k4><k5>

where each <k_n> is a kanji.

Now further suppose that <k3> and <k4> do not exist in Big5.

Often, mutt will display unexpected output, for example

   <k1><k2>?<k6>?<k5>

where <k6> is a random kanji, apparently having no relation to <k3> or
<k4> or any of their EUC-JP, Shift_JIS, or UTF8 forms.

The expected behaviour is to display either ???? for <k3><k4>, or to
display some kanji equivalent to <k3><k4>; neither of which is mutt's
current behaviour.


-- System Information:
Debian Release: testing/unstable
  APT prefers unstable
  APT policy: (500, 'unstable')
Architecture: i386 (i686)
Kernel: Linux 2.4.23-ow1
Locale: LANG=zh_TW.Big5, LC_CTYPE=zh_TW.Big5

Versions of packages mutt depends on:
ii  libc6                       2.3.2.ds1-12 GNU C Library: Shared libraries an
ii  libgnutls10                 1.0.4-3      GNU TLS library - runtime library
ii  libidn11                    0.4.1-1      GNU libidn library, implementation
ii  libncursesw5                5.4-3        Shared libraries for terminal hand
ii  libsasl2                    2.1.18-4     Authentication abstraction library
ii  postfix [mail-transport-age 2.0.19-1     A high-performance mail transport 

-- no debconf information


>How-To-Repeat:
	
>Fix:

Attachments

1st-example-kanjis-2022-JP (21 bytes) - added by Alain Bench <veronatif@…> 5 years ago.
1st-example-kanjis-2022-JP

Change History

Changed 5 years ago by Alain Bench <veronatif@…>

Hello Ambrose,

On Tuesday, May 18, 2004 at 11:20:07 AM -0400, Ambrose C. Li wrote:

> mutt is to display some kanji on a Big5 terminal (either in the header
> or in the message body), say <k1><k2><k3><k4><k5> where each <k_n> is
> a kanji. [...] <k3> and <k4> do not exist in Big5.

   Could you please provide us the original ISO-2022-?? sequence of
bytes for the five Kanjis? I lack a Big5 terminal, but can simulate with
"iconv -f iso-2022-?? -t big5".


> mutt will display unexpected output, for example <k1><k2>?<k6>?<k5>
> where <k6> is a random kanji

   You can look at thread "display of CP-1258" in mutt-dev archives
beginning at <20040202231445.GA1685@free.fr> to study if the "eaten
chars" effect applies to your case.


Bye!	Alain.
-- 
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

Changed 5 years ago by Ambrose Li <a.c.li@…>

Hi,

On Fri, May 21, 2004 at 11:46:27PM +0200, Alain Bench wrote:
> 
>     Could you please provide us the original ISO-2022-?? sequence of
> bytes for the five Kanjis? I lack a Big5 terminal, but can simulate with
> "iconv -f iso-2022-?? -t big5".

one example is this string in the body of an email in ISO-2022-JP:

  0000000 033   $   B   F   |   K   \   J   *   M   }   3   X   2   q  ;
  0000020   o 033   (   B

It comes out from mutt on a Big5 terminal as

  ¤é¥»ª«²z?·Y?»x
  (a4 e9 a5 bb aa ab b2 7a 3f b7 59 3f bb 78)

The correct rendering should be

  ¤é¥»ª«²z????»x
  (a4 e9 a5 bb aa ab b2 7a 3f 3f 3f 3f bb 78)

(I would prefer, but I know is impossible, ¤é¥»ª«²z¾ÇÂø»x.)

Another example, from the Subject header, is

  Subject: =?ISO-2022-JP?B?GyRCIVobKEJQT1NUTUFOIFByZXNz?=
   =?ISO-2022-JP?B?GyRCIVshIUAkMyYkR09DQmokTiFYJTclJyVVJTohISVGITwlViVrGyhC?=
   =?ISO-2022-JP?B?GyRCIVkkQyRGMj8hKRsoQg==?=

It comes out from mutt on a Big5 terminal as

 ¡iPOSTMAN Press¡j¡@¥@¬É?°¨½ÏÃǵgѥů????¦¹¡D¸³Ã¢ÉK?¡z?¨°°ó«ø?

  (0000000 a1 69 50 4f 53 54 4d 41 4e 20 50 72 65 73 73 a1
   0000020 6a a1 40 a5 40 ac c9 3f b0 a8 bd cf c3 c7 b5 67
   0000040 d1 a5 c5 af 3f 3f 3f 3f a6 b9 a1 44 b8 b3 c3 a2
   0000060 c9 4b 3f a1 7a 3f a8 b0 b0 f3 ab f8 3f)

which I would expect to be rendered

 ¡iPOSTMAN Press¡j¡@¥@¬É??¸ÜÃD??¡y????????¡@????????¡z????¦ó¡H

  (0000000 a1 69 50 4f 53 54 4d 41 4e 20 50 72 65 73 73 a1
   0000020 6a a1 40 a5 40 ac c9 3f 3f b8 dc c3 44 3f 3f a1
   0000040 79 3f 3f 3f 3f 3f 3f 3f 3f a1 40 3f 3f 3f 3f 3f
   0000060 3f 3f 3f a1 7a 3f 3f 3f 3f a6 f3 a1 48)


Best regards,
-- 
Ambrose LI Cheuk-Wing  <a.c.li@ieee.org>

http://ada.dhs.org/~acli/

Changed 5 years ago by Ambrose Li <a.c.li@…>

Hello Alain,

On Fri, May 21, 2004 at 11:46:27PM +0200, Alain Bench wrote:

>     You can look at thread "display of CP-1258" in mutt-dev
> archives beginning at <20040202231445.GA1685@free.fr> to study
> if the "eaten chars" effect applies to your case.

just reading what the thread has said (without compiling test
programs and trying), it would seem to be a different problem,
because in the ISO-2022-JP -> Big5 case, characters don't get
turned into ?'s after an untranslatable character; rather,
certain characters which should have been turned into ??'s have
random kanji replacing ?? instances in impossible positions
(the random kanji straddling actual kanji boundaries).

Best regards,
-- 
Ambrose LI Cheuk-Wing  <a.c.li@ieee.org>

http://ada.dhs.org/~acli/

Changed 5 years ago by Alain Bench <veronatif@…>

1st-example-kanjis-2022-JP

Changed 5 years ago by Alain Bench <veronatif@…>

On Saturday, May 22, 2004 at 3:52:41 PM -0400, Ambrose C. Li wrote:

> one example is this string in the body of an email in ISO-2022-JP:
>    0000000 033   $   B   F   |   K   \   J   *   M   }   3   X   2   q   ;
>    0000020   o 033   (   B

   Unescaped 2022-JP "F|K\J*M}3X2q;o" that's seven kanjis: U+65E5,
U+672C, U+7269, U+7406, U+5B66, U+4F1A, and U+8A8C.

   Two of them, #5 and #6, are not convertible to Big5: U+5B66 and
U+4F1A. That's 2022 unescaped "3X2q".

| $ echo -ne "\e\$B3X2q\e(B" |iconv -c -f iso-2022-jp -t big5 |hex
| [nothing]


> It comes out from mutt on a Big5 terminal as (a4 e9 a5 bb aa ab b2 7a
> 3f b7 59 3f bb 78)

   The two lacking kanjis are falsely rendered as "? U+614D ?", where
the strange U+614D (coded B7 59 in Big5 or "X2" in 2022) comes from
conversion of 2nd byte of 5th and 1st byte of 6th kanji coded in 2022.

Unicode	| 2022	| Big5	|
--------+-------+-------+
U+5B66	| 3X	| n/a	|
U+4F1A	| 2q	| n/a	|
--------+-------+-------+
U+614D	| X2	| B7 59	|
--------+-------+-------+

| $ echo -ne "\e\$BX2\e(B" |iconv -f iso-2022-jp -t big5 |hex
| B7 59

   The second question mark, after U+614D, is probably the result of
failing to iconv "q;" (U+9942 non existant in Big5).


> The correct rendering should be (a4 e9 a5 bb aa ab b2 7a 3f 3f 3f 3f
> bb 78)

   Yes. 4 question marks. And we saw that iconv -c behaves correctly
(ignoring the two 2022 unconvertable kanjis == 4 bytes). Using Edmund's
test program from 1258 thread:

| test("\e$BF|K\\J*M}3X2q;o\e(B", "ISO-2022-JP", "BIG5");
| Converting from ISO-2022-JP to BIG5
| iconv returned -1
| Read 11 bytes and wrote 8 bytes

   Seems good to me... So it's not the "eaten char" syndrom, but
probably something as on failure advancing 1 byte when one should
advance 1 character. Edmund?


Bye!	Alain.
-- 
Give your computer's unused idle processor cycles to a scientific goal:
The Folding@home project at <URL:http://folding.stanford.edu/>.

Changed 20 months ago by brendan

  • component changed from mutt to charset
  • description modified (diff)
Note: See TracTickets for help on using tickets.