Other messages with non-UTF-8 charset, i.e. iso-8859-1 are imported correctly.
Before I try to figure out the root of the problem by exploring the sources starting from script/import_scripts/mbox/support/indexer.rb, does anyone have an idea ? Could it be environmental and not in the code base ? Does this also happen when a user running in mailing list mode sends a reply with this encoding ?
I did a quick test and Email::Receiver seems to work fine. It converts the input to UTF-8. I can’t think of a reason why the encoding should be wrong afterwards.
[1] pry(main)> raw_email = File.read("/tmp/windows.txt");
[2] pry(main)> receiver = Email::Receiver.new(raw_email, convert_plaintext: true, skip_trimming: false);
[3] pry(main)> body = receiver.select_body;
[4] pry(main)> receiver.mail.charset
=> "windows-1252"
[5] pry(main)> body.first.encoding
=> #<Encoding:UTF-8>
[6] pry(main)> puts body.first;
cette réflexion me fait penser : y\-a\-il une obligation/raison \(en dehors du coup de maintenannce\) à avoir un même outil pour les 2 fonctionnalités \(interactions vs galerie\) ?
Thanks for the quick test: I would not know how to do it myself Could it be that something in the import container is missing ? I would very much like to reproduce what you did and explore from there. If I don’t find anything I’ll provide instructions to reproduce the problem using the mbox import procedure with an inbox containing just this one mail.
I get the same results as you did so the problem is not there. I’ll run an import with this mail alone and a new category to verify this is not a side effect of some kind.
Or the difference is in how the message is extracted from the mbox file: this is where the code path is different. The above raw_email = File.read("/tmp/windows.mbox") is different from splitting the file with regexps and maybe that’s where things go wrong.
And indeed, adding File.open('/tmp/message.txt', 'w') { |file| file.write(receiver.raw_email) } after this line produces the following file, which is different from the original file.
After narrowing down where the undesirable transformation happens, it seems to be here:
line.scrub is responsible for transforming the content into something that’s different from the original. If removed, the regexp fails with:
...
1: from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:174:in `block in each_mail'
/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:174:in `=~': invalid byte sequence in UTF-8 (ArgumentError)
Because it is not UTF-8, indeed
Any idea how this should be resolved ? Maybe a first pass on the mail headers only looking for the charset ? There seems to be a and problem here.