Import mbox maps charset=windows-1252 to �

Bonjour,

When importing the mbox containing this message

windows.txt (3.7 KB)

it shows like this:

It is likely an encoding problem because it has:

Content-Type: text/plain; charset=windows-1252; format=flowed

Other messages with non-UTF-8 charset, i.e. iso-8859-1 are imported correctly.

Before I try to figure out the root of the problem by exploring the sources starting from script/import_scripts/mbox/support/indexer.rb, does anyone have an idea ? Could it be environmental and not in the code base ? Does this also happen when a user running in mailing list mode sends a reply with this encoding ?

Thank in advance :slight_smile:

I did a quick test and Email::Receiver seems to work fine. It converts the input to UTF-8. I can’t think of a reason why the encoding should be wrong afterwards.

[1] pry(main)> raw_email = File.read("/tmp/windows.txt");
[2] pry(main)> receiver = Email::Receiver.new(raw_email, convert_plaintext: true, skip_trimming: false);
[3] pry(main)> body = receiver.select_body;
[4] pry(main)> receiver.mail.charset
=> "windows-1252"
[5] pry(main)> body.first.encoding
=> #<Encoding:UTF-8>
[6] pry(main)> puts body.first;
cette réflexion me fait penser : y\-a\-il une obligation/raison \(en dehors du coup de maintenannce\) à avoir un même outil pour les 2 fonctionnalités \(interactions vs galerie\) ?
3 Likes

Thanks for the quick test: I would not know how to do it myself :slight_smile: Could it be that something in the import container is missing ? I would very much like to reproduce what you did and explore from there. If I don’t find anything I’ll provide instructions to reproduce the problem using the mbox import procedure with an inbox containing just this one mail.

1 Like

You can try it out by running rails console in the container.

1 Like

I get the same results as you did so the problem is not there. I’ll run an import with this mail alone and a new category to verify this is not a side effect of some kind.

Here is what I did, on an install of 2.5.4:

  • unmodified shared/standalone/import/settings.yml
  • removed shared/standalone/import/data/index.db from the previous import
  • changed the Message-ID: header
  • copy windows.txt into shared/standalone/import/data/windows4/windows.mbox
  • ./launcher enter import
  • ran the import with
root@forum:/var/www/discourse# import_mbox.sh 
The mbox import is starting...

Loading existing groups...
Loading existing users...
Loading existing categories...
Loading existing posts...
Loading existing topics...

creating index
indexing files in /shared/import/data/windows4
indexing /shared/import/data/windows4/windows.mbox

indexing replies and users

creating categories
        1 / 1 (100.0%)  [8121278 items/min]  
creating users
Skipping 1 already imported users

creating topics and posts
        1 / 1 (100.0%)  [219 items/min]  

Updating topic status

Updating bumped_at on topics

Updating last posted at on users

Updating last seen at on users

Updating first_post_created_at...

Updating user post_count...

Updating user topic_count...

Updating topic users

Updating post timings

Updating featured topic users

Updating featured topics in categories
        9 / 9 (100.0%)  [1562 items/min]  ]  
Resetting topic counters


Done (00h 00min 09sec)
  • Got the same result as above, which you can see here.
1 Like

Could it be because Email::Receiver is not called in the same way by the importer ?

Email::Receiver.new(row[‘raw_message’])

instead of

receiver = Email::Receiver.new(raw_email, convert_plaintext: true, skip_trimming: false);

Or the difference is in how the message is extracted from the mbox file: this is where the code path is different. The above raw_email = File.read("/tmp/windows.mbox") is different from splitting the file with regexps and maybe that’s where things go wrong.

And indeed, adding File.open('/tmp/message.txt', 'w') { |file| file.write(receiver.raw_email) } after this line produces the following file, which is different from the original file.

message.txt (3.7 KB)

When running from the rails console, receiver.raw_email is also different from the original file: it is correctly encoded as UTF-8.

Any idea where this incorrect modification happens ?

You may need to add a call to .force_encoding after reading the file to tell Ruby what encoding the email file has.

1 Like

Sorry if this is a newbie question but I’m not familiar with the code base :slight_smile: Do you have a suggestion as to where such a change would be beneficial ?

After narrowing down where the undesirable transformation happens, it seems to be here:

https://github.com/discourse/discourse/blob/master/script/import_scripts/mbox/support/indexer.rb#L163-L165

line.scrub is responsible for transforming the content into something that’s different from the original. If removed, the regexp fails with:

...
         1: from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:174:in `block in each_mail'                                                                         
/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:174:in `=~': invalid byte sequence in UTF-8 (ArgumentError)

Because it is not UTF-8, indeed :slight_smile:

Any idea how this should be resolved ? Maybe a first pass on the mail headers only looking for the charset ? There seems to be a :chicken: and :egg: problem here.

1 Like

https://github.com/discourse/discourse/blob/master/script/import_scripts/mbox/support/indexer.rb#L163-L165

Replacing:


    line = line.scrub

    if line =~ @split_regex

with

    if line.scrub =~ @split_regex

seems to be working:

but I’m not sure if this is the right way to fix this.

3 Likes

Looks like a perfect way of fixing the problem.

3 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.