Import mbox maps charset=windows-1252 to �

dachary · November 1, 2020, 10:12am

Bonjour,

When importing the mbox containing this message

it shows like this:

It is likely an encoding problem because it has:

Content-Type: text/plain; charset=windows-1252; format=flowed

Other messages with non-UTF-8 charset, i.e. iso-8859-1 are imported correctly.

Before I try to figure out the root of the problem by exploring the sources starting from script/import_scripts/mbox/support/indexer.rb, does anyone have an idea ? Could it be environmental and not in the code base ? Does this also happen when a user running in mailing list mode sends a reply with this encoding ?

Thank in advance

gerhard · November 1, 2020, 8:39pm

I did a quick test and Email::Receiver seems to work fine. It converts the input to UTF-8. I can’t think of a reason why the encoding should be wrong afterwards.

[1] pry(main)> raw_email = File.read("/tmp/windows.txt");
[2] pry(main)> receiver = Email::Receiver.new(raw_email, convert_plaintext: true, skip_trimming: false);
[3] pry(main)> body = receiver.select_body;
[4] pry(main)> receiver.mail.charset
=> "windows-1252"
[5] pry(main)> body.first.encoding
=> #<Encoding:UTF-8>
[6] pry(main)> puts body.first;
cette réflexion me fait penser : y\-a\-il une obligation/raison \(en dehors du coup de maintenannce\) à avoir un même outil pour les 2 fonctionnalités \(interactions vs galerie\) ?

dachary · November 1, 2020, 8:47pm

Thanks for the quick test: I would not know how to do it myself Could it be that something in the import container is missing ? I would very much like to reproduce what you did and explore from there. If I don’t find anything I’ll provide instructions to reproduce the problem using the mbox import procedure with an inbox containing just this one mail.

gerhard · November 1, 2020, 8:52pm

You can try it out by running rails console in the container.

dachary · November 1, 2020, 8:58pm

I get the same results as you did so the problem is not there. I’ll run an import with this mail alone and a new category to verify this is not a side effect of some kind.

dachary · November 1, 2020, 9:09pm

Here is what I did, on an install of 2.5.4:

unmodified shared/standalone/import/settings.yml
removed shared/standalone/import/data/index.db from the previous import
changed the Message-ID: header
copy windows.txt into shared/standalone/import/data/windows4/windows.mbox
./launcher enter import
ran the import with

root@forum:/var/www/discourse# import_mbox.sh 
The mbox import is starting...

Loading existing groups...
Loading existing users...
Loading existing categories...
Loading existing posts...
Loading existing topics...

creating index
indexing files in /shared/import/data/windows4
indexing /shared/import/data/windows4/windows.mbox

indexing replies and users

creating categories
        1 / 1 (100.0%)  [8121278 items/min]  
creating users
Skipping 1 already imported users

creating topics and posts
        1 / 1 (100.0%)  [219 items/min]  

Updating topic status

Updating bumped_at on topics

Updating last posted at on users

Updating last seen at on users

Updating first_post_created_at...

Updating user post_count...

Updating user topic_count...

Updating topic users

Updating post timings

Updating featured topic users

Updating featured topics in categories
        9 / 9 (100.0%)  [1562 items/min]  ]  
Resetting topic counters


Done (00h 00min 09sec)

Got the same result as above, which you can see here.

dachary · November 1, 2020, 10:05pm

Could it be because Email::Receiver is not called in the same way by the importer ?

Email::Receiver.new(row[‘raw_message’])

instead of

receiver = Email::Receiver.new(raw_email, convert_plaintext: true, skip_trimming: false);

dachary · November 2, 2020, 11:25am

Or the difference is in how the message is extracted from the mbox file: this is where the code path is different. The above raw_email = File.read("/tmp/windows.mbox") is different from splitting the file with regexps and maybe that’s where things go wrong.

dachary · November 2, 2020, 11:41am

And indeed, adding File.open('/tmp/message.txt', 'w') { |file| file.write(receiver.raw_email) } after this line produces the following file, which is different from the original file.

message.txt (3.7 KB)

When running from the rails console, receiver.raw_email is also different from the original file: it is correctly encoded as UTF-8.

Any idea where this incorrect modification happens ?

riking · November 2, 2020, 12:32pm

You may need to add a call to .force_encoding after reading the file to tell Ruby what encoding the email file has.

dachary · November 2, 2020, 12:42pm

Sorry if this is a newbie question but I’m not familiar with the code base Do you have a suggestion as to where such a change would be beneficial ?

dachary · November 2, 2020, 12:56pm

After narrowing down where the undesirable transformation happens, it seems to be here:

https://github.com/discourse/discourse/blob/master/script/import_scripts/mbox/support/indexer.rb#L163-L165

line.scrub is responsible for transforming the content into something that’s different from the original. If removed, the regexp fails with:

...
         1: from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:174:in `block in each_mail'                                                                         
/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:174:in `=~': invalid byte sequence in UTF-8 (ArgumentError)

Because it is not UTF-8, indeed

Any idea how this should be resolved ? Maybe a first pass on the mail headers only looking for the charset ? There seems to be a and problem here.

dachary · November 2, 2020, 1:15pm

https://github.com/discourse/discourse/blob/master/script/import_scripts/mbox/support/indexer.rb#L163-L165

Replacing:


    line = line.scrub

    if line =~ @split_regex

with

    if line.scrub =~ @split_regex

seems to be working:

but I’m not sure if this is the right way to fix this.

gerhard · November 2, 2020, 1:50pm

Looks like a perfect way of fixing the problem.

system · December 2, 2020, 2:03pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Importing mbox files fails at creating topics with Chinese characters due to invalid byte sequence Bug	6	1448	October 31, 2018
Import_mbox.sh not working with e-mails from Samsung phone sent via a listserv server Support	8	746	May 9, 2022
Job exception in mail receiver Bug	2	943	March 30, 2016
Error reading post from email Bug email	5	1272	July 21, 2022
Error importing from vanilla: invalid byte sequence in UTF-8 Migration	25	2118	October 18, 2023

Import mbox maps charset=windows-1252 to �

Related topics