Ticket #2956 (new defect)

Opened 10 months ago

Last modified 10 months ago

Recipient address broken if containing Š character (UTF-8 code: 0xc5 0xA0)

Reported by: phr Assigned to: mutt-dev
Priority: major Milestone: 1.6
Component: charset Version: 1.5.16
Keywords: Cc:

Description

When trying to send email to a recipient who's name contains the Š character, mutt corrupts the recipient's fullname by omitting Š and another character.

I.e. recipient's fullname is Šorman but gets corrupted to oman

There was already some bug describing strange mutt's behaviour when 0xA0 is present in a string - and indeed, the UTF-8 representation of Š is 0xC5 0xA0

Attachments

part0001.pgp (196 bytes) - added by Kyle Wheeler on 2007-09-14 08:29:41.
Added by email2trac
part0001.2.pgp (196 bytes) - added by Kyle Wheeler on 2007-09-17 07:37:51.
Added by email2trac
part0001.3.pgp (196 bytes) - added by Kyle Wheeler on 2007-09-18 07:25:25.
Added by email2trac
part0001.4.pgp (196 bytes) - added by Kyle Wheeler on 2007-09-18 07:28:41.
Added by email2trac

Change History

(in reply to: ↑ description ) 2007-09-14 05:00:55 changed by pdmef

  • owner changed from mutt-dev to pdmef.
  • status changed from new to assigned.
  • version changed from 1.4 to 1.5.16.
  • component changed from mutt to charset.
  • milestone set to 1.6.

Replying to phr:

When trying to send email to a recipient who's name contains the Š character, mutt corrupts the recipient's fullname by omitting Š and another character.

I.e. recipient's fullname is Šorman but gets corrupted to oman

There was already some bug describing strange mutt's behaviour when 0xA0 is present in a string - and indeed, the UTF-8 representation of Š is 0xC5 0xA0

What are your locale settings?

I can confirm that a recipient like this cannot be handled even with hg tip.

However, a quick test showed that this is not in general the case with multibyte characters.

Just a really really wild guess: 0xA0 is non-breaking-space (at least in latin1) so it may be due to improper multibyte handling by simply using issapce() or something like that.

For example, this code

#include <ctype.h>                                                                                                                                                                                      
#include <stdio.h>                                                                                                                                                                                      
                                                                                                                                                                                                        
int main(int argc,char** argv) {                                                                                                                                                                        
  if (!setlocale(NULL,"C")) return 1;                                                                                                                                                                   
  printf("%d\n",isspace(' '));                                                                                                                                                                          
  printf("%d\n",isspace(0xA0));                                                                                                                                                                         
  if (!setlocale(NULL,"de_DE.ISO8859-1")) return 1;                                                                                                                                                     
  printf("%d\n",isspace(' '));                                                                                                                                                                          
  printf("%d\n",isspace(0xA0));                                                                                                                                                                         
  return 0;
}                                                                                                                                                                                                       

correctly gives 1, 0, 1 and 1 on OS X.

2007-09-14 05:01:32 changed by pdmef

  • owner changed from pdmef to mutt-dev.
  • status changed from assigned to new.

2007-09-14 06:17:38 changed by phr

My locale is en_US.UTF-8 on FreeBSD-stable

Indeed isspace(0xA0) returns 1 with UTF-8 locales.

So it's most probably the true reason for this problem.

(follow-up: ↓ 8 ) 2007-09-14 08:05:36 changed by Vincent Lefevre

On 2007-09-14 13:17:39 -0000, Mutt wrote:
>  My locale is en_US.UTF-8 on FreeBSD-stable
> 
>  Indeed isspace(0xA0) returns 1 with UTF-8 locales.

AFAIK, isspace() doesn't have any meaning in multibyte locales.

Mutt uses isspace() and isprint() at various places. I don't think
this is correct. Either Mutt needs to know what is a space or what
is printable on ASCII strings, in which case it should use its own
functions, or it needs to know such information on wide characters
(as this is what should be used on non-ASCII strings), in which
case it should use iswspace() and so on.

2007-09-14 08:10:14 changed by Vincent Lefevre

And Mutt has problems if the subject contains such a character too.
See the generated Subject header:

Subject: Re: [Mutt] #2956: =?UTF-8//TRANSLIT?Q?Reci?=
        =?UTF-8//TRANSLIT?Q?pient_address_broken_if_containing_?=
        =?UTF-8//TRANSLIT?Q?=C5?= character (UTF-8 code: 0xc5 0xA0)

2007-09-14 08:29:41 changed by Kyle Wheeler

  • attachment part0001.pgp added.

Added by email2trac

2007-09-14 08:29:41 changed by Kyle Wheeler

On Friday, September 14 at 03:10 PM, quoth Mutt:
> And Mutt has problems if the subject contains such a character too.
> See the generated Subject header:
>
> Subject: Re: [Mutt] #2956: =?UTF-8//TRANSLIT?Q?Reci?=
>         =?UTF-8//TRANSLIT?Q?pient_address_broken_if_containing_?=
>         =?UTF-8//TRANSLIT?Q?=C5?= character (UTF-8 code: 0xc5 0xA0)

That's broken in more ways than just the space. It should NEVER send 
strings encoded as "UTF-8//TRANSLIT". I think your $send_charset 
setting might be invalid (it should not contain //TRANSLIT).

~Kyle

(follow-up: ↓ 9 ) 2007-09-14 08:49:27 changed by phr

A quick fix for FreeBSD seems to be replace all ocurrences of isspace like this:

#include <wctype.h>
#undef isspace
#define isspace(c) iswspace(btowc(c))

See: http://mail.python.org/pipermail/python-checkins/2004-August/042343.html

Unlike the comment says, it's still needed for FreeBSD 6 and 7.

(in reply to: ↑ 4 ) 2007-09-14 08:57:20 changed by pdmef

Replying to Vincent Lefevre:

Mutt uses isspace() and isprint() at various places. I don't think this is correct. Either Mutt needs to know what is a space or what is printable on ASCII strings, in which case it should use its own functions, or it needs to know such information on wide characters (as this is what should be used on non-ASCII strings), in which case it should use iswspace() and so on.

Wanting to make mutt know about it seems wrong to me, as e.g. some random single byte locale could use 0xA0 as non-breaking-space and some random other could choose not to. I think this should be done by the C library and not by mutt.

Making the decision at runtime whether the input is single or multibyte at the location of the caller also seems wrong to me as that likely means to write duplicate code.

I think the only practical solution is to fix the places where single byte locale functions are used and input may be multibyte, e.g. isspace().

Going with wchar_t instead of char would IMHO increase the memory requirements quite a lot so I don't know whether a consistent use is the way to go.

(in reply to: ↑ 7 ; follow-up: ↓ 10 ) 2007-09-14 09:02:14 changed by pdmef

Replying to phr:

A quick fix for FreeBSD seems to be replace all ocurrences of isspace like this:

#define isspace(c) iswspace(btowc(c))

I don't think this is right. The problem is that 0xC5 and 0xA0 are two bytes and functions like isspace() likely get called for 0xC5 and 0xA0 while looping byte-wise over some input (to e.g. find word boundaries).

(in reply to: ↑ 9 ) 2007-09-14 10:05:55 changed by phr

Replying to pdmef:

I don't think this is right. The problem is that 0xC5 and 0xA0 are two bytes and functions like isspace() likely get called for 0xC5 and 0xA0 while looping byte-wise over some input (to e.g. find word boundaries).

Sure it's not the right solution. That's why I called it a "quick fix", which was meant more as a check if isspace() is the real problem.

2007-09-16 04:31:05 changed by Vincent Lefevre

On 2007-09-14 10:29:34 -0500, Kyle Wheeler wrote:
> On Friday, September 14 at 03:10 PM, quoth Mutt:
>> And Mutt has problems if the subject contains such a character too.
>> See the generated Subject header:
>>
>> Subject: Re: [Mutt] #2956: =?UTF-8//TRANSLIT?Q?Reci?=
>>         =?UTF-8//TRANSLIT?Q?pient_address_broken_if_containing_?=
>>         =?UTF-8//TRANSLIT?Q?=C5?= character (UTF-8 code: 0xc5 0xA0)
>
> That's broken in more ways than just the space. It should NEVER send 
> strings encoded as "UTF-8//TRANSLIT". I think your $send_charset setting 
> might be invalid (it should not contain //TRANSLIT).

I do not set $send_charset. My $charset contains //TRANSLIT, but this
one is correct.

2007-09-16 04:36:12 changed by Vincent Lefevre

On 2007-09-14 15:57:21 -0000, Mutt wrote:
> #2956: Recipient address broken if containing Š character (UTF-8 code: 0xc5 0xA0)
> 
> Comment (by pdmef):
> 
>  Replying to [comment:4 Vincent Lefevre]:
> 
>  > Mutt uses isspace() and isprint() at various places. I don't think
>  > this is correct. Either Mutt needs to know what is a space or what
>  > is printable on ASCII strings, in which case it should use its own
>  > functions, or it needs to know such information on wide characters
>  > (as this is what should be used on non-ASCII strings), in which
>  > case it should use iswspace() and so on.
> 
>  Wanting to make mutt know about it seems wrong to me, as e.g. some random
>  single byte locale could use 0xA0 as non-breaking-space and some random
>  other could choose not to. I think this should be done by the C library
>  and not by mutt.

That's precisely why I suggested iswspace(), but...

>  Going with wchar_t instead of char would IMHO increase the memory
>  requirements quite a lot so I don't know whether a consistent use
>  is the way to go.

Yes, that's another problem.

2007-09-16 10:00:30 changed by David Champion

> >> Subject: Re: [Mutt] #2956: =?UTF-8//TRANSLIT?Q?Reci?=
> >>         =?UTF-8//TRANSLIT?Q?pient_address_broken_if_containing_?=
> >>         =?UTF-8//TRANSLIT?Q?=C5?= character (UTF-8 code: 0xc5 0xA0)
> 
> I do not set $send_charset. My $charset contains //TRANSLIT, but this
> one is correct.

I thought that //TRANSLIT was a libiconv extension, not defined by
spec.  It's definitely not supported by all iconv implementations, so
those wouldn't be able to parse this encoding string (regardless of
correctness).

2007-09-17 01:38:52 changed by Vincent Lefevre

On 2007-09-16 11:59:33 -0500, David Champion wrote:
> > >> Subject: Re: [Mutt] #2956: =?UTF-8//TRANSLIT?Q?Reci?=
> > >>         =?UTF-8//TRANSLIT?Q?pient_address_broken_if_containing_?=
> > >>         =?UTF-8//TRANSLIT?Q?=C5?= character (UTF-8 code: 0xc5 0xA0)
> > 
> > I do not set $send_charset. My $charset contains //TRANSLIT, but this
> > one is correct.
> 
> I thought that //TRANSLIT was a libiconv extension, not defined by
> spec.  It's definitely not supported by all iconv implementations, so
> those wouldn't be able to parse this encoding string (regardless of
> correctness).

The //TRANSLIT is supported by the libiconv implementations I use here.
So, that's fine. But this doesn't explain why Mutt uses //TRANSLIT to
generate this particular header. It is related to the isspace bug, as
this problem doesn't occur with the Euro symbol for instance.

If I try to send a message with the subject "test Š" under UTF-8
locales, I get:

Subject: test =?UTF-8//TRANSLIT?Q?=C5?=

Note that the =A0 is missing. But in any case, Mutt shouldn't generate
invalid UTF-8 sequences (like here).

2007-09-17 07:37:50 changed by Kyle Wheeler

On Monday, September 17 at 10:38 AM, quoth Vincent Lefevre:
> On 2007-09-16 11:59:33 -0500, David Champion wrote:
>>>>> Subject: Re: [Mutt] #2956: =?UTF-8//TRANSLIT?Q?Reci?=
>>>>>         =?UTF-8//TRANSLIT?Q?pient_address_broken_if_containing_?=
>>>>>         =?UTF-8//TRANSLIT?Q?=C5?= character (UTF-8 code: 0xc5 0xA0)
>>> 
>>> I do not set $send_charset. My $charset contains //TRANSLIT, but this 
>>> one is correct.
>> 
>> I thought that //TRANSLIT was a libiconv extension, not defined by 
>> spec.  It's definitely not supported by all iconv implementations, so 
>> those wouldn't be able to parse this encoding string (regardless of 
>> correctness).
>
> The //TRANSLIT is supported by the libiconv implementations I use here. 

The fact that it's supported by the software that you use doesn't make 
it a valid encoding to send over the internet. We have standards for a 
reason! None of the encoding names that are valid for use in email 
include //TRANSLIT. That encoding name is invalid. Just because iconv 
accepts it doesn't make it valid. If mutt is generating this, then 
something is wrong... perhaps mutt should recognize and strip out 
//TRANSLIT strings?

> So, that's fine. But this doesn't explain why Mutt uses //TRANSLIT to 
> generate this particular header. It is related to the isspace bug, as 
> this problem doesn't occur with the Euro symbol for instance.

Wait, so, your mutt generates an encoding that doesn't contain 
//TRANSLIT if it's encoding a Euro symbol? That really is quite 
strange.

> But in any case, Mutt shouldn't generate invalid UTF-8 sequences 
> (like here).

Agreed.

~Kyle

2007-09-17 07:37:51 changed by Kyle Wheeler

  • attachment part0001.2.pgp added.

Added by email2trac

2007-09-17 08:10:35 changed by Rocco Rutte

[ could somebody please fix trac to not mangle folded subjects? :) ]

Hi,

* Kyle Wheeler [07-09-17 09:39:02 -0500] wrote:
> On Monday, September 17 at 10:38 AM, quoth Vincent Lefevre:
>> On 2007-09-16 11:59:33 -0500, David Champion wrote:
>>>>>> Subject: Re: [Mutt] #2956: =?UTF-8//TRANSLIT?Q?Reci?=
>>>>>>         =?UTF-8//TRANSLIT?Q?pient_address_broken_if_containing_?=
>>>>>>         =?UTF-8//TRANSLIT?Q?=C5?= character (UTF-8 code: 0xc5 0xA0)

>>>> I do not set $send_charset. My $charset contains //TRANSLIT, but this one 
>>>> is correct.

>>> I thought that //TRANSLIT was a libiconv extension, not defined by spec.  
>>> It's definitely not supported by all iconv implementations, so those 
>>> wouldn't be able to parse this encoding string (regardless of 
>>> correctness).

>> The //TRANSLIT is supported by the libiconv implementations I use here. 

> The fact that it's supported by the software that you use doesn't make it a 
> valid encoding to send over the internet. We have standards for a reason!

I'm absolutely sure he knows this and supports your position strongly. 
That's not the point... :)

>If mutt is generating this, then something is 
> wrong... perhaps mutt should recognize and strip out //TRANSLIT strings?

Yes, but that's another issue (it already has a table of valid character 
set names, it just needs to match them for $send_charset).

> Wait, so, your mutt generates an encoding that doesn't contain //TRANSLIT if 
> it's encoding a Euro symbol? That really is quite strange.

Yes and no. First, this one really is ISSPACE() related I think. After 
editing a message when composing it, mutt removes trailing spaces in 
muttt_read_rfc822_line().

As a consequence, 1) you can't write messages with a subject having a 
trailing slash (nor any other header, btw) (really, not even in us-ascii  
unless you change it in the compose menu) and 2) it's the cause why 0xc5 
0xa0 can't be encoded properly: the 0xa0 likely is removed so that mutt 
only needs to encode 0xc5.

So technically it invalidates the UTF-8 itself and afterwards more or 
less encodes the broken subject, I think.

   bye, Rocco

2007-09-17 10:23:35 changed by phr

I do not set $send_charset. My $charset contains //TRANSLIT, but this one is correct.

You can simply omit the //TRANSLIT. Any charset is fully convertible into UTF-8, so //TRANSLIT is useless here.

So technically it invalidates the UTF-8 itself and afterwards more or less encodes the broken subject, I think.

Yes, that's exactly what's happening. When I "fix" isspace not to recognize 0xA0, the subject Test Š is encoded and sent correctly as Test =?utf-8?B?xaA=?=

2007-09-17 17:11:47 changed by Vincent Lefevre

On 2007-09-17 17:09:55 +0200, Rocco Rutte wrote:
> * Kyle Wheeler [07-09-17 09:39:02 -0500] wrote:
>> If mutt is generating this, then something is wrong... perhaps mutt
>> should recognize and strip out //TRANSLIT strings?
>
> Yes, but that's another issue (it already has a table of valid character 
> set names, it just needs to match them for $send_charset).

Mutt doesn't need to recognize //TRANSLIT strings. The fact that I have
a //TRANSLIT string in my $charset should not have any effect on what
Mutt sends. $charset is only a terminal-related variable:

  Character set your terminal uses to display and enter textual data.

[...]
> So technically it invalidates the UTF-8 itself and afterwards more
> or less encodes the broken subject, I think.

Yes, but this doesn't explain why Mutt generates something with
//TRANSLIT in the headers.

2007-09-17 17:24:22 changed by Vincent Lefevre

On 2007-09-17 17:23:35 -0000, Mutt wrote:
> Comment (by phr):
>  > I do not set $send_charset. My $charset contains //TRANSLIT, but
>  > this one is correct.
> 
>  You can simply omit the //TRANSLIT. Any charset is fully
>  convertible into UTF-8, so //TRANSLIT is useless here.

Of course, but //TRANSLIT is generated automatically by my config
scripts: I do not always use UTF-8 locales, and in other character
sets, //TRANSLIT is necessary. I don't want to bloat my config
scripts just because of a bug in Mutt (and the isspace bug would
still be there anyway, so the workaround wouldn't be complete).

>  Yes, that's exactly what's happening. When I "fix" isspace not to
>  recognize 0xA0, the subject Test Š is encoded and sent correctly as
>  Test =?utf-8?B?xaA=?=

But your "fix" still cannot recognize U+00A0 as a space in UTF-8
locales.

2007-09-18 00:13:34 changed by phr

Mutt doesn't need to recognize //TRANSLIT strings. The fact that I have a //TRANSLIT string in my $charset should not have any effect on what Mutt sends. $charset is only a terminal-related variable.

Mutt tries to convert your terminal input in $charset into each of charsets speciefied in $send_charset. If it fails with the first one, it tries the second etc, but if all of them fail, mutt uses your terminal's $charset.

Normally this works fine, since the last item in $send_charset is utf-8 and everything should be convertible into UTF-8. But due to isspace() bug, the corrupted string is invalid even in UTF-8 so mutt thinks none of $send_charsets are suitable and uses your $charset with //TRANSLIT.

Thus we really need to fix the isspace() problem - and IMHO not only in 1.6 but some simple fix is also needed for 1.4 and 1.5. In mbyte.c we already have:

int iswspace (wint_t wc)
{
  if (Charset_is_utf8 || charset_is_ja)
    return (9 <= wc && wc <= 13) || wc == 32;
  else
    return (0 <= wc && wc < 256) ? isspace (wc) : 0;
}

So I think the easiest solution for 1.4 and 1.5 would be to write local isspace() function the same way - and for 1.6 consider the proper solution. Personally I don't care if 0xA0 wouldn't be recognized as space - probably noone uses NBSP to delimit several email addresses within recipient list.

However, not being able to type UTF-8 characters which contain 0xA0 is a major problem.

(follow-up: ↓ 24 ) 2007-09-18 01:17:42 changed by Vincent Lefevre

On 2007-09-18 07:13:35 -0000, Mutt wrote:
>  Mutt tries to convert your terminal input in $charset into each of
>  charsets speciefied in $send_charset. If it fails with the first one, it
>  tries the second etc, but if all of them fail, mutt uses your terminal's
>  $charset.

This last feature is a bug: this isn't documented and this isn't what
the user wants in general (if the user wants to use the terminal's
$charset, he can include it explicitly in $send_charset). There are
better solutions such as using a replacement character or returning
an error so that the user can fix the header.

>  Normally this works fine, since the last item in $send_charset is utf-8
>  and everything should be convertible into UTF-8.

Yes, except that even if the isspace() bug is fixed, the user may
configure $send_charset without utf-8 in it, or the user may still
generate invalid sequences for some reason.

>  Personally I don't care if 0xA0 wouldn't be recognized as space - probably
>  noone uses NBSP to delimit several email addresses within recipient list.

I don't think this should be regarded as correct anyway. But Mutt
should be consistent, and treat NBSP in the same way, whatever the
charmap is. Otherwise this may confuse those who use several locales.
I think that

  return (9 <= wc && wc <= 13) || wc == 32;

would be sufficient in general, except if Mutt can also use a charmap
not based on ASCII (but I suppose that it would always be EBCDIC in
this case).

2007-09-18 01:58:44 changed by Rocco Rutte

Hi,

* Mutt [07-09-18 07:13:35 -0000] wrote:

> > Mutt doesn't need to recognize //TRANSLIT strings. The fact that I have
> > a //TRANSLIT string in my $charset should not have any effect on what
> > Mutt sends. $charset is only a terminal-related variable.

> Mutt tries to convert your terminal input in $charset into each of
> charsets speciefied in $send_charset. If it fails with the first one, it
> tries the second etc, but if all of them fail, mutt uses your terminal's
> $charset.

Some code to illustrate from rfc2047.c:

   /* Choose target charset. */
   tocode = fromcode;
   if (icode)
   {
     if ((tocode1 = mutt_choose_charset (icode, charsets, u, ulen, 0, 0)))
       tocode = tocode1;
     else
       ret = 2, icode = 0;
   }

In case no item from $send_charset matches, mutt_choose_charset() fails 
and thus tocode remains $charset.

Maybe the docs should clearly state this.

> Thus we really need to fix the isspace() problem - and IMHO not only in
> 1.6 but some simple fix is also needed for 1.4 and 1.5. In mbyte.c we
> already have:

> int iswspace (wint_t wc)
> {
>   if (Charset_is_utf8 || charset_is_ja)
>     return (9 <= wc && wc <= 13) || wc == 32;
>   else
>     return (0 <= wc && wc < 256) ? isspace (wc) : 0;

These are only used if your system lacks wide character functions or 
you told configure to ignore them.

> So I think the easiest solution for 1.4 and 1.5 would be to write local
> isspace() function the same way - and for 1.6 consider the proper
> solution.

Do you mean something like:

   #define isspace(c)    ((c) == ' ' || (c) == '\t' || ...)

I see the need for fixing this quickly, but you really have to make sure 
nothing else breaks. So I'd rather prefer some analysis what places 
exactly are using isspace() and shouldn't it.

> Personally I don't care if 0xA0 wouldn't be recognized as space - probably
> noone uses NBSP to delimit several email addresses within recipient list.

I didn't look it up, but I even think it's wrong to accept NBSP in such 
places. For mutt in general the only interesting places needing proper 
NBSP recognition are those which may brake lines (e.g. the f=f handler, 
header folding in the pager, etc.), IMHO.

   bye, Rocco

2007-09-18 02:03:06 changed by Rocco Rutte

Hi,

* Vincent Lefevre [07-09-18 02:11:02 +0200] wrote:
>On 2007-09-17 17:09:55 +0200, Rocco Rutte wrote:
>> * Kyle Wheeler [07-09-17 09:39:02 -0500] wrote:

>>> If mutt is generating this, then something is wrong... perhaps mutt
>>> should recognize and strip out //TRANSLIT strings?

>> Yes, but that's another issue (it already has a table of valid character 
>> set names, it just needs to match them for $send_charset).

>Mutt doesn't need to recognize //TRANSLIT strings. The fact that I have
>a //TRANSLIT string in my $charset should not have any effect on what
>Mutt sends. $charset is only a terminal-related variable:

>  Character set your terminal uses to display and enter textual data.

See the other mail, $charset is a fallback for $send_charset. So yes, I 
think mutt should make sure that at least the charset chosen for sending 
something is valid. This way it's still perfectly legal to have a 
$charset value only valid with one's local iconv() implementation.

On the other hand, the question is what to do when all charsets in 
$send_charset fail and $charset does not match an officially assigned 
charset name, like in your case.

The options are 1) go ahead with a charset possibly not supported on the 
receiving side or 2) go ahead with a possible broken encoding in a valid 
charset.

I'm somewhat tending to prefer 2).

   bye, Rocco

(in reply to: ↑ 21 ) 2007-09-18 03:03:27 changed by pdmef

Replying to Vincent Lefevre:

On 2007-09-18 07:13:35 -0000, Mutt wrote:

Mutt tries to convert your terminal input in $charset into each of charsets speciefied in $send_charset. If it fails with the first one, it tries the second etc, but if all of them fail, mutt uses your terminal's $charset.

This last feature is a bug: this isn't documented and this isn't what the user wants in general (if the user wants to use the terminal's $charset, he can include it explicitly in $send_charset). There are better solutions such as using a replacement character or returning an error so that the user can fix the header.

I'd say lacking documentation is the bug (which I've fixed for now). For me it makes sense to use $charset as fallback because the text already is/was encoded in it, hence mutt can assume it can be converted exactly into it.

However, mutt has to make sure that text fed into it can be properly represented in your $charset where IHMO replacement chars and other strategies may come into play. It doesn't sound right to me to keep invalid data in a session and use some techniques to "validate" it only for outgoing data, it should be "valid" in your local session, too so that there's no need to "validate" it for outgoing data.

2007-09-18 06:09:35 changed by Vincent Lefevre

On 2007-09-18 11:02:59 +0200, Rocco Rutte wrote:
> On the other hand, the question is what to do when all charsets in 
> $send_charset fail and $charset does not match an officially assigned 
> charset name, like in your case.
>
> The options are 1) go ahead with a charset possibly not supported on
> the receiving side or 2) go ahead with a possible broken encoding in
> a valid charset.
>
> I'm somewhat tending to prefer 2).

Me too. At least one can hope that some characters only won't be
correctly displayed (instead of the whole header field). If the
broken encoding can be detected by Mutt, then a replacement
character should be used.

2007-09-18 06:17:34 changed by Vincent Lefevre

On 2007-09-18 10:03:27 -0000, Mutt wrote:
>  I'd say lacking documentation is the bug (which I've fixed for now). For
>  me it makes sense to use $charset as fallback because the text already
>  is/was encoded in it, hence mutt can assume it can be converted exactly
>  into it.

It doesn't necessarily make sense as the $charset may be completely
local to the machine (e.g. 'x-my-charset'). I think that trying to
convert the local charset to the last item of $send_charset, which
should be the most general charset (e.g. utf-8), makes more sense.
I think it is important to let the user control the fallback.

2007-09-18 07:25:24 changed by Kyle Wheeler

On Tuesday, September 18 at 09:03 AM, quoth Mutt:
> On the other hand, the question is what to do when all charsets in 
> $send_charset fail and $charset does not match an officially 
> assigned charset name, like in your case.
>
> The options are 1) go ahead with a charset possibly not supported on the 
> receiving side or 2) go ahead with a possible broken encoding in a valid 
> charset.

There's also option 3) refuse to send the mail, and display an error 
message. I prefer this option. I don't like the idea of mutt being a 
potential source of bad email.

~Kyle

2007-09-18 07:25:25 changed by Kyle Wheeler

  • attachment part0001.3.pgp added.

Added by email2trac

2007-09-18 07:28:40 changed by Kyle Wheeler

On Sunday, September 16 at 01:30 PM, quoth Vincent Lefevre:
>On 2007-09-14 10:29:34 -0500, Kyle Wheeler wrote:
>> On Friday, September 14 at 03:10 PM, quoth Mutt:
>>> And Mutt has problems if the subject contains such a character too.
>>> See the generated Subject header:
>>>
>>> Subject: Re: [Mutt] #2956: =?UTF-8//TRANSLIT?Q?Reci?=
>>>         =?UTF-8//TRANSLIT?Q?pient_address_broken_if_containing_?=
>>>         =?UTF-8//TRANSLIT?Q?=C5?= character (UTF-8 code: 0xc5 0xA0)
>>
>> That's broken in more ways than just the space. It should NEVER send 
>> strings encoded as "UTF-8//TRANSLIT". I think your $send_charset setting 
>> might be invalid (it should not contain //TRANSLIT).
>
>I do not set $send_charset. My $charset contains //TRANSLIT, but this
>one is correct.

It just occurred to me... given that your $charset is UTF-8, what's 
the purpose of //TRANSLIT? All possible characters are displayable in 
UTF-8, so //TRANSLIT shouldn't actually be providing any benefit (that 
I'm aware of). Is there something I'm missing?

~Kyle

2007-09-18 07:28:41 changed by Kyle Wheeler

  • attachment part0001.4.pgp added.

Added by email2trac

2007-09-18 08:23:40 changed by Vincent Lefevre

On 2007-09-18 09:30:14 -0500, Kyle Wheeler wrote:
> It just occurred to me... given that your $charset is UTF-8, what's the 
> purpose of //TRANSLIT? All possible characters are displayable in UTF-8, so 
> //TRANSLIT shouldn't actually be providing any benefit (that I'm aware of). 
> Is there something I'm missing?

This is because I use:

set charset=`codeset 2> /dev/null || locale charmap`//TRANSLIT

The charmap is not always UTF-8, that's why I need //TRANSLIT.

(And FYI, codeset is a small program I wrote for a buggy OS that
doesn't understand "locale charmap".)

2007-09-18 23:26:14 changed by Rocco Rutte

Hi,

* Kyle Wheeler [07-09-18 09:27:06 -0500] wrote:
> On Tuesday, September 18 at 09:03 AM, quoth Mutt:

>> On the other hand, the question is what to do when all charsets in 
>> $send_charset fail and $charset does not match an officially assigned 
>> charset name, like in your case.

>> The options are 1) go ahead with a charset possibly not supported on the 
>> receiving side or 2) go ahead with a possible broken encoding in a valid 
>> charset.

> There's also option 3) refuse to send the mail, and display an error 
> message. I prefer this option. I don't like the idea of mutt being a 
> potential source of bad email.

With messages created from scratch locally that could be an option (or 
at least issue a warning). But given you received a mail with a broken 
encoding already, 3) would mean you couldn't reply to it unless you 
manually fix the encoding.

I didn't look at the source yet to see if a prompt would be possible to 
choose between 2) and 3) though (for mailx mode, you wouldn't want to 
interrupt with a prompt).

   bye, Rocco

2007-09-18 23:44:46 changed by Rocco Rutte

Hi,

* Vincent Lefevre [07-09-18 15:17:30 +0200] wrote:
>On 2007-09-18 10:03:27 -0000, Mutt wrote:

>>  I'd say lacking documentation is the bug (which I've fixed for now). For
>>  me it makes sense to use $charset as fallback because the text already
>>  is/was encoded in it, hence mutt can assume it can be converted exactly
>>  into it.

>It doesn't necessarily make sense as the $charset may be completely
>local to the machine (e.g. 'x-my-charset'). I think that trying to
>convert the local charset to the last item of $send_charset, which
>should be the most general charset (e.g. utf-8), makes more sense.

In theory I agree. But $send_charset is user configurable and doesn't 
have to contain utf-8, it could even by empty. And still then, even with 
utf-8 (as in your case), conversion may fail not because the last item 
isn't generic enough but because the input is invalid.

Even in that case mutt has to do something.

Without another conversion test, you could still let it default to utf-8 
instead of $charset though.

>I think it is important to let the user control the fallback.

I don't think that makes lots of sense since it's kind of 
micro-optimization, IMHO. Because at that point, no charset did fit and 
mutt is likely going to send out broken content anyway, so by letting 
the user control it you only give him the control in what specific way 
it's broken, not if it's broken at all.

For the case that all conversions failed because $send_charset is 
wrongly configured and the input is valid, $charset is the best choice, 
so I think it's really only about the case of broken input.

   bye, Rocco

2007-09-20 03:30:02 changed by Vincent Lefevre

On 2007-09-19 08:26:04 +0200, Rocco Rutte wrote:
> * Kyle Wheeler [07-09-18 09:27:06 -0500] wrote:
>> There's also option 3) refuse to send the mail, and display an error 
>> message. I prefer this option. I don't like the idea of mutt being a 
>> potential source of bad email.
>
> With messages created from scratch locally that could be an option
> (or at least issue a warning). But given you received a mail with a
> broken encoding already, 3) would mean you couldn't reply to it
> unless you manually fix the encoding.

Mutt should fix the encoding for the user. In particular when starting
the editor (for a reply), Mutt should make sure that everything is
valid in the specified encoding.

2007-09-20 03:43:19 changed by Vincent Lefevre

On 2007-09-19 08:44:38 +0200, Rocco Rutte wrote:
> * Vincent Lefevre [07-09-18 15:17:30 +0200] wrote:
>> It doesn't necessarily make sense as the $charset may be completely
>> local to the machine (e.g. 'x-my-charset'). I think that trying to
>> convert the local charset to the last item of $send_charset, which
>> should be the most general charset (e.g. utf-8), makes more sense.
>
> In theory I agree. But $send_charset is user configurable and
> doesn't have to contain utf-8, it could even by empty. And still
> then, even with utf-8 (as in your case), conversion may fail not
> because the last item isn't generic enough but because the input is
> invalid.
>
> Even in that case mutt has to do something.

Conversion may also fail for $charset. So, Mutt has to do something in
this case too, and I think that Mutt should use replacement characters.
In any case, there should be a way to avoid $charset being used as a
fallback for $send_charset, as this doesn't always makes sense.

>> I think it is important to let the user control the fallback.
>
> I don't think that makes lots of sense since it's kind of
> micro-optimization, IMHO. Because at that point, no charset did fit
> and mutt is likely going to send out broken content anyway, so by
> letting the user control it you only give him the control in what
> specific way it's broken, not if it's broken at all.

Sending a subject encoded in UTF-8 with some replacement characters
for invalid sequences that could have occurred is much less broken
that sending a subject using a non-standard charset (leading to
completely-unreadable subject).

> For the case that all conversions failed because $send_charset is
> wrongly configured and the input is valid, $charset is the best
> choice, so I think it's really only about the case of broken input.

I think that utf-8 would be better than $charset as at least one knows
that it is a standard charset (whereas $charset isn't necessarily).