Ticket #3033 (closed defect: fixed)

Opened 9 months ago

Last modified 3 months ago

text_enriched_handler and UTF-8 (and charsets as a whole)

Reported by: iazz Owned by: mutt-dev
Priority: minor Milestone: 1.6
Component: charset Version:
Keywords: patch Cc:

Description

This handler treats each byte as a single character and thus incorrectly outputs for example bold letters in case UTF-8 is used for display. For character U+00C0 (LATIN CAPITAL LETTER A WITH GRAVE) and the bold attribute, it outputs "\303\b\303\200\b\200" instead of "\303\200\b\303\200".

Mutt's way of dealing with character sets seems completely broken anyway. It should treat all characters as wchar_t internally and convert them only on input/output. Besides, instead of specifying the displayed charset in the configuration file, shouldn't it rather rely on locale settings?

Attachments

part0001.pgp (196 bytes) - added by Kyle Wheeler 9 months ago.
Added by email2trac
enriched-multibyte.diff (10.2 kB) - added by pdmef 4 months ago.

Change History

  Changed 9 months ago by Moritz Barsnick

>  Besides, instead of specifying the displayed charset in the
>  configuration file, shouldn't it rather rely on locale settings?

This has had me riddled for quite some time now as well. Even if I set
LC_CTYPE=de_DE.iso885915 (or similar, like "@euro"), I need to tell
mutt about it with "set charset=iso-8859-15". Or am I overseeing
something?  

Changed 9 months ago by Kyle Wheeler

Added by email2trac

follow-up: ↓ 3   Changed 9 months ago by Kyle Wheeler

On Monday, February 18 at 09:09 PM, quoth Mutt:
> This has had me riddled for quite some time now as well. Even if I 
> set LC_CTYPE=de_DE.iso885915 (or similar, like "@euro"), I need to 
> tell mutt about it with "set charset=iso-8859-15". Or am I 
> overseeing something?

Mutt gets its charset from the output of nl_langinfo(CODESET) (if you 
have the langinfo.h header). Otherwise, it simply assumes iso-8859-1. 
So I'd say you should investigate that function. Here's a simple test 
program:

     #include <locale.h>
     #include <langinfo.h>
     #include <stdio.h>

     int main()
     {
         setlocale(LC_CTYPE, "");
         printf("%s\n", nl_langinfo(CODESET));
         return 0;
     }

See what that C program prints out, and do what you must to make it 
print out the right thing (check the setlocale and nl_langinfo man 
pages).

For what it's worth, my system doesn't recognize de_DE.iso885915. It 
DOES recognize de_DE.ISO8859-15, though. When I run
`locale -a | grep de_DE`, all I see as options are:

     de_DE
     de_DE.ISO8859-1
     de_DE.ISO8859-15
     de_DE.UTF-8

Run that command (`locale -a | grep de_DE`) to see what your system 
supports, and set LC_CTYPE to one of those.

~Kyle

in reply to: ↑ 2   Changed 9 months ago by iazz

Replying to Kyle Wheeler:

{{{ On Monday, February 18 at 09:09 PM, quoth Mutt:

This has had me riddled for quite some time now as well. Even if I set LC_CTYPE=de_DE.iso885915 (or similar, like "@euro"), I need to tell mutt about it with "set charset=iso-8859-15". Or am I overseeing something?

Mutt gets its charset from the output of nl_langinfo(CODESET) (if you have the langinfo.h header). Otherwise, it simply assumes iso-8859-1.

Okay, very well then. This is indeed what I was talking about. Nevertheless, my point about the fact that Mutt should treat all characters as wide (wchar_t) internally remains.

  Changed 9 months ago by Moritz Barsnick

On Mon, Feb 18, 2008 at 22:19:31 -0000, Mutt wrote:
>  Mutt gets its charset from the output of nl_langinfo(CODESET) (if you
>  have the langinfo.h header). Otherwise, it simply assumes iso-8859-1.
>  So I'd say you should investigate that function.

I investigated my configuration instead. I was actually setting charset
per location, probably because long ago, I thought it was necessary.
(Very old legacy config here, you see. ;->)

Unsetting charset and letting locale do the magic now actually works.
Thanks for letting me revisit this. :) (Now how do I pass my locale
info through SSH?)

>       de_DE
>       de_DE.ISO8859-1
>       de_DE.ISO8859-15
>       de_DE.UTF-8
Different here, but that's where I had my en_US info from.

So, sorry for OT, back to the original problem...

  Changed 6 months ago by pdmef

  • component changed from display to charset
  • milestone set to 1.6

Changed 4 months ago by pdmef

  Changed 4 months ago by pdmef

Attach patch for text/enriched handler to base it in wchar_t and fgetwc() instead of char and fgetc(). This is largely untested but looks good to me and works for the few text/enriched mails I have.

  Changed 3 months ago by pdmef

  • keywords patch added

  Changed 3 months ago by brendan

This patch looks ok to me -- go ahead and apply it.

  Changed 3 months ago by Rocco Rutte <pdmef@…>

  • status changed from new to closed
  • resolution set to fixed

(In [3e850c6e43fd]) Make text/enriched handler multibyte aware. Closes #3033.

Note: See TracTickets for help on using tickets.