Import mbox maps charset=windows-1252 to �

dachary · November 1, 2020, 10:12am

Bonjour,

When importing the mbox containing this message

it shows like this:

It is likely an encoding problem because it has:

Content-Type: text/plain; charset=windows-1252; format=flowed

Other messages with non-UTF-8 charset, i.e. iso-8859-1 are imported correctly.

Before I try to figure out the root of the problem by exploring the sources starting from script/import_scripts/mbox/support/indexer.rb, does anyone have an idea ? Could it be environmental and not in the code base ? Does this also happen when a user running in mailing list mode sends a reply with this encoding ?

Thank in advance

gerhard · November 1, 2020, 8:39pm

I did a quick test and Email::Receiver seems to work fine. It converts the input to UTF-8. I can’t think of a reason why the encoding should be wrong afterwards.

[1] pry(main)> raw_email = File.read("/tmp/windows.txt");
[2] pry(main)> receiver = Email::Receiver.new(raw_email, convert_plaintext: true, skip_trimming: false);
[3] pry(main)> body = receiver.select_body;
[4] pry(main)> receiver.mail.charset
=> "windows-1252"
[5] pry(main)> body.first.encoding
=> #<Encoding:UTF-8>
[6] pry(main)> puts body.first;
cette réflexion me fait penser : y\-a\-il une obligation/raison \(en dehors du coup de maintenannce\) à avoir un même outil pour les 2 fonctionnalités \(interactions vs galerie\) ?

dachary · November 1, 2020, 8:47pm

Thanks for the quick test: I would not know how to do it myself Could it be that something in the import container is missing ? I would very much like to reproduce what you did and explore from there. If I don’t find anything I’ll provide instructions to reproduce the problem using the mbox import procedure with an inbox containing just this one mail.

gerhard · November 1, 2020, 8:52pm

You can try it out by running rails console in the container.

dachary · November 1, 2020, 8:58pm

I get the same results as you did so the problem is not there. I’ll run an import with this mail alone and a new category to verify this is not a side effect of some kind.

dachary · November 1, 2020, 9:09pm

Here is what I did, on an install of 2.5.4:

unmodified shared/standalone/import/settings.yml
removed shared/standalone/import/data/index.db from the previous import
changed the Message-ID: header
copy windows.txt into shared/standalone/import/data/windows4/windows.mbox
./launcher enter import
ran the import with

root@forum:/var/www/discourse# import_mbox.sh 
The mbox import is starting...

Loading existing groups...
Loading existing users...
Loading existing categories...
Loading existing posts...
Loading existing topics...

creating index
indexing files in /shared/import/data/windows4
indexing /shared/import/data/windows4/windows.mbox

indexing replies and users

creating categories
        1 / 1 (100.0%)  [8121278 items/min]  
creating users
Skipping 1 already imported users

creating topics and posts
        1 / 1 (100.0%)  [219 items/min]  

Updating topic status

Updating bumped_at on topics

Updating last posted at on users

Updating last seen at on users

Updating first_post_created_at...

Updating user post_count...

Updating user topic_count...

Updating topic users

Updating post timings

Updating featured topic users

Updating featured topics in categories
        9 / 9 (100.0%)  [1562 items/min]  ]  
Resetting topic counters


Done (00h 00min 09sec)

Got the same result as above, which you can see here.

dachary · November 1, 2020, 10:05pm

Could it be because Email::Receiver is not called in the same way by the importer ?

Email::Receiver.new(row[‘raw_message’])

instead of

receiver = Email::Receiver.new(raw_email, convert_plaintext: true, skip_trimming: false);

dachary · November 2, 2020, 11:25am

Or the difference is in how the message is extracted from the mbox file: this is where the code path is different. The above raw_email = File.read("/tmp/windows.mbox") is different from splitting the file with regexps and maybe that’s where things go wrong.

dachary · November 2, 2020, 11:41am

And indeed, adding File.open('/tmp/message.txt', 'w') { |file| file.write(receiver.raw_email) } after this line produces the following file, which is different from the original file.

message.txt (3.7 KB)

When running from the rails console, receiver.raw_email is also different from the original file: it is correctly encoded as UTF-8.

Any idea where this incorrect modification happens ?

riking · November 2, 2020, 12:32pm

You may need to add a call to .force_encoding after reading the file to tell Ruby what encoding the email file has.

dachary · November 2, 2020, 12:42pm

Sorry if this is a newbie question but I’m not familiar with the code base Do you have a suggestion as to where such a change would be beneficial ?

dachary · November 2, 2020, 12:56pm

After narrowing down where the undesirable transformation happens, it seems to be here:

https://github.com/discourse/discourse/blob/master/script/import_scripts/mbox/support/indexer.rb#L163-L165

line.scrub is responsible for transforming the content into something that’s different from the original. If removed, the regexp fails with:

...
         1: from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:174:in `block in each_mail'                                                                         
/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:174:in `=~': invalid byte sequence in UTF-8 (ArgumentError)

Because it is not UTF-8, indeed

Any idea how this should be resolved ? Maybe a first pass on the mail headers only looking for the charset ? There seems to be a and problem here.

dachary · November 2, 2020, 1:15pm

https://github.com/discourse/discourse/blob/master/script/import_scripts/mbox/support/indexer.rb#L163-L165

Replacing:


    line = line.scrub

    if line =~ @split_regex

with

    if line.scrub =~ @split_regex

seems to be working:

but I’m not sure if this is the right way to fix this.

gerhard · November 2, 2020, 1:50pm

Looks like a perfect way of fixing the problem.

system · December 2, 2020, 2:03pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Import_mbox.sh not working with e-mails from Samsung phone sent via a listserv server Support	8	707	May 9, 2022
Importing mbox files fails at creating topics with Chinese characters due to invalid byte sequence Bug	6	1432	October 31, 2018
Yahoo Groups Importation Errors Migration	7	1342	January 18, 2020
Migrate a mailing list to Discourse (mbox, Listserv, Google Groups, etc) Migrating to Discourse how-to	101	22988	December 17, 2024
Mbox import: crash when Validation failed: Name is too long Support	3	469	November 2, 2021

Import mbox maps charset=windows-1252 to �

Related topics