استيراد خرائط mbox charset=windows-1252 إلى �

dachary · 1 نوفمبر 2020، 10:12ص

Bonjour,

When importing the mbox containing this message

it shows like this:

It is likely an encoding problem because it has:

Content-Type: text/plain; charset=windows-1252; format=flowed

Other messages with non-UTF-8 charset, i.e. iso-8859-1 are imported correctly.

Before I try to figure out the root of the problem by exploring the sources starting from script/import_scripts/mbox/support/indexer.rb, does anyone have an idea ? Could it be environmental and not in the code base ? Does this also happen when a user running in mailing list mode sends a reply with this encoding ?

Thank in advance

gerhard · 1 نوفمبر 2020، 8:39م

I did a quick test and Email::Receiver seems to work fine. It converts the input to UTF-8. I can’t think of a reason why the encoding should be wrong afterwards.

[1] pry(main)> raw_email = File.read("/tmp/windows.txt");
[2] pry(main)> receiver = Email::Receiver.new(raw_email, convert_plaintext: true, skip_trimming: false);
[3] pry(main)> body = receiver.select_body;
[4] pry(main)> receiver.mail.charset
=> "windows-1252"
[5] pry(main)> body.first.encoding
=> #<Encoding:UTF-8>
[6] pry(main)> puts body.first;
cette réflexion me fait penser : y\-a\-il une obligation/raison \(en dehors du coup de maintenannce\) à avoir un même outil pour les 2 fonctionnalités \(interactions vs galerie\) ?

dachary · 1 نوفمبر 2020، 8:47م

Thanks for the quick test: I would not know how to do it myself Could it be that something in the import container is missing ? I would very much like to reproduce what you did and explore from there. If I don’t find anything I’ll provide instructions to reproduce the problem using the mbox import procedure with an inbox containing just this one mail.

gerhard · 1 نوفمبر 2020، 8:52م

You can try it out by running rails console in the container.

dachary · 1 نوفمبر 2020، 8:58م

I get the same results as you did so the problem is not there. I’ll run an import with this mail alone and a new category to verify this is not a side effect of some kind.

dachary · 1 نوفمبر 2020، 9:09م

Here is what I did, on an install of 2.5.4:

unmodified shared/standalone/import/settings.yml
removed shared/standalone/import/data/index.db from the previous import
changed the Message-ID: header
copy windows.txt into shared/standalone/import/data/windows4/windows.mbox
./launcher enter import
ran the import with

root@forum:/var/www/discourse# import_mbox.sh 
The mbox import is starting...

Loading existing groups...
Loading existing users...
Loading existing categories...
Loading existing posts...
Loading existing topics...

creating index
indexing files in /shared/import/data/windows4
indexing /shared/import/data/windows4/windows.mbox

indexing replies and users

creating categories
        1 / 1 (100.0%)  [8121278 items/min]  
creating users
Skipping 1 already imported users

creating topics and posts
        1 / 1 (100.0%)  [219 items/min]  

Updating topic status

Updating bumped_at on topics

Updating last posted at on users

Updating last seen at on users

Updating first_post_created_at...

Updating user post_count...

Updating user topic_count...

Updating topic users

Updating post timings

Updating featured topic users

Updating featured topics in categories
        9 / 9 (100.0%)  [1562 items/min]  ]  
Resetting topic counters


Done (00h 00min 09sec)

Got the same result as above, which you can see here.

dachary · 1 نوفمبر 2020، 10:05م

Could it be because Email::Receiver is not called in the same way by the importer ?

Email::Receiver.new(row[‘raw_message’])

instead of

receiver = Email::Receiver.new(raw_email, convert_plaintext: true, skip_trimming: false);

dachary · 2 نوفمبر 2020، 11:25ص

Or the difference is in how the message is extracted from the mbox file: this is where the code path is different. The above raw_email = File.read("/tmp/windows.mbox") is different from splitting the file with regexps and maybe that’s where things go wrong.

dachary · 2 نوفمبر 2020، 11:41ص

And indeed, adding File.open('/tmp/message.txt', 'w') { |file| file.write(receiver.raw_email) } after this line produces the following file, which is different from the original file.

message.txt (3.7 KB)

When running from the rails console, receiver.raw_email is also different from the original file: it is correctly encoded as UTF-8.

Any idea where this incorrect modification happens ?

riking · 2 نوفمبر 2020، 12:32م

You may need to add a call to .force_encoding after reading the file to tell Ruby what encoding the email file has.

dachary · 2 نوفمبر 2020، 12:42م

Sorry if this is a newbie question but I’m not familiar with the code base Do you have a suggestion as to where such a change would be beneficial ?

dachary · 2 نوفمبر 2020، 12:56م

After narrowing down where the undesirable transformation happens, it seems to be here:

https://github.com/discourse/discourse/blob/master/script/import_scripts/mbox/support/indexer.rb#L163-L165

line.scrub is responsible for transforming the content into something that’s different from the original. If removed, the regexp fails with:

...
         1: from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:174:in `block in each_mail'                                                                         
/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:174:in `=~': invalid byte sequence in UTF-8 (ArgumentError)

Because it is not UTF-8, indeed

Any idea how this should be resolved ? Maybe a first pass on the mail headers only looking for the charset ? There seems to be a and problem here.

dachary · 2 نوفمبر 2020، 1:15م

https://github.com/discourse/discourse/blob/master/script/import_scripts/mbox/support/indexer.rb#L163-L165

Replacing:


    line = line.scrub

    if line =~ @split_regex

with

    if line.scrub =~ @split_regex

seems to be working:

but I’m not sure if this is the right way to fix this.

gerhard · 2 نوفمبر 2020، 1:50م

Looks like a perfect way of fixing the problem.

system · 2 ديسمبر 2020، 2:03م

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

الموضوع		الردود	مرات العرض
Importing mbox files fails at creating topics with Chinese characters due to invalid byte sequence Bug	6	1459	31 أكتوبر 2018
Import_mbox.sh not working with e-mails from Samsung phone sent via a listserv server Support	8	753	9 مايو 2022
Job exception in mail receiver Bug	2	943	30 مارس 2016
Error reading post from email Bug email	5	1282	21 يوليو 2022
Error importing from vanilla: invalid byte sequence in UTF-8 Migration	25	2135	18 أكتوبر 2023

استيراد خرائط mbox charset=windows-1252 إلى �

الموضوعات ذات الصلة