استيراد خرائط mbox charset=windows-1252 إلى �

Bonjour,

When importing the mbox containing this message

windows.txt (3.7 KB)

it shows like this:

It is likely an encoding problem because it has:

Content-Type: text/plain; charset=windows-1252; format=flowed

Other messages with non-UTF-8 charset, i.e. iso-8859-1 are imported correctly.

Before I try to figure out the root of the problem by exploring the sources starting from script/import_scripts/mbox/support/indexer.rb, does anyone have an idea ? Could it be environmental and not in the code base ? Does this also happen when a user running in mailing list mode sends a reply with this encoding ?

Thank in advance :slight_smile:

I did a quick test and Email::Receiver seems to work fine. It converts the input to UTF-8. I can’t think of a reason why the encoding should be wrong afterwards.

[1] pry(main)> raw_email = File.read("/tmp/windows.txt");
[2] pry(main)> receiver = Email::Receiver.new(raw_email, convert_plaintext: true, skip_trimming: false);
[3] pry(main)> body = receiver.select_body;
[4] pry(main)> receiver.mail.charset
=> "windows-1252"
[5] pry(main)> body.first.encoding
=> #<Encoding:UTF-8>
[6] pry(main)> puts body.first;
cette réflexion me fait penser : y\-a\-il une obligation/raison \(en dehors du coup de maintenannce\) à avoir un même outil pour les 2 fonctionnalités \(interactions vs galerie\) ?
3 إعجابات

Thanks for the quick test: I would not know how to do it myself :slight_smile: Could it be that something in the import container is missing ? I would very much like to reproduce what you did and explore from there. If I don’t find anything I’ll provide instructions to reproduce the problem using the mbox import procedure with an inbox containing just this one mail.

إعجاب واحد (1)

You can try it out by running rails console in the container.

إعجاب واحد (1)

I get the same results as you did so the problem is not there. I’ll run an import with this mail alone and a new category to verify this is not a side effect of some kind.

Here is what I did, on an install of 2.5.4:

  • unmodified shared/standalone/import/settings.yml
  • removed shared/standalone/import/data/index.db from the previous import
  • changed the Message-ID: header
  • copy windows.txt into shared/standalone/import/data/windows4/windows.mbox
  • ./launcher enter import
  • ran the import with
root@forum:/var/www/discourse# import_mbox.sh 
The mbox import is starting...

Loading existing groups...
Loading existing users...
Loading existing categories...
Loading existing posts...
Loading existing topics...

creating index
indexing files in /shared/import/data/windows4
indexing /shared/import/data/windows4/windows.mbox

indexing replies and users

creating categories
        1 / 1 (100.0%)  [8121278 items/min]  
creating users
Skipping 1 already imported users

creating topics and posts
        1 / 1 (100.0%)  [219 items/min]  

Updating topic status

Updating bumped_at on topics

Updating last posted at on users

Updating last seen at on users

Updating first_post_created_at...

Updating user post_count...

Updating user topic_count...

Updating topic users

Updating post timings

Updating featured topic users

Updating featured topics in categories
        9 / 9 (100.0%)  [1562 items/min]  ]  
Resetting topic counters


Done (00h 00min 09sec)
  • Got the same result as above, which you can see here.
إعجاب واحد (1)

Could it be because Email::Receiver is not called in the same way by the importer ?

Email::Receiver.new(row[‘raw_message’])

instead of

receiver = Email::Receiver.new(raw_email, convert_plaintext: true, skip_trimming: false);

Or the difference is in how the message is extracted from the mbox file: this is where the code path is different. The above raw_email = File.read("/tmp/windows.mbox") is different from splitting the file with regexps and maybe that’s where things go wrong.

And indeed, adding File.open('/tmp/message.txt', 'w') { |file| file.write(receiver.raw_email) } after this line produces the following file, which is different from the original file.

message.txt (3.7 KB)

When running from the rails console, receiver.raw_email is also different from the original file: it is correctly encoded as UTF-8.

Any idea where this incorrect modification happens ?

You may need to add a call to .force_encoding after reading the file to tell Ruby what encoding the email file has.

إعجاب واحد (1)

Sorry if this is a newbie question but I’m not familiar with the code base :slight_smile: Do you have a suggestion as to where such a change would be beneficial ?

After narrowing down where the undesirable transformation happens, it seems to be here:

https://github.com/discourse/discourse/blob/master/script/import_scripts/mbox/support/indexer.rb#L163-L165

line.scrub is responsible for transforming the content into something that’s different from the original. If removed, the regexp fails with:

...
         1: from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:174:in `block in each_mail'                                                                         
/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:174:in `=~': invalid byte sequence in UTF-8 (ArgumentError)

Because it is not UTF-8, indeed :slight_smile:

Any idea how this should be resolved ? Maybe a first pass on the mail headers only looking for the charset ? There seems to be a :chicken: and :egg: problem here.

إعجاب واحد (1)

https://github.com/discourse/discourse/blob/master/script/import_scripts/mbox/support/indexer.rb#L163-L165

Replacing:


    line = line.scrub

    if line =~ @split_regex

with

    if line.scrub =~ @split_regex

seems to be working:

but I’m not sure if this is the right way to fix this.

3 إعجابات

Looks like a perfect way of fixing the problem.

3 إعجابات

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.