Other messages with non-UTF-8 charset, i.e. iso-8859-1 are imported correctly.
Before I try to figure out the root of the problem by exploring the sources starting from script/import_scripts/mbox/support/indexer.rb, does anyone have an idea ? Could it be environmental and not in the code base ? Does this also happen when a user running in mailing list mode sends a reply with this encoding ?
I did a quick test and Email::Receiver seems to work fine. It converts the input to UTF-8. I can’t think of a reason why the encoding should be wrong afterwards.
[1] pry(main)> raw_email = File.read("/tmp/windows.txt");
[2] pry(main)> receiver = Email::Receiver.new(raw_email, convert_plaintext: true, skip_trimming: false);
[3] pry(main)> body = receiver.select_body;
[4] pry(main)> receiver.mail.charset
=> "windows-1252"
[5] pry(main)> body.first.encoding
=> #<Encoding:UTF-8>
[6] pry(main)> puts body.first;
cette réflexion me fait penser : y\-a\-il une obligation/raison \(en dehors du coup de maintenannce\) à avoir un même outil pour les 2 fonctionnalités \(interactions vs galerie\) ?
Thanks for the quick test: I would not know how to do it myself Could it be that something in the import container is missing ? I would very much like to reproduce what you did and explore from there. If I don’t find anything I’ll provide instructions to reproduce the problem using the mbox import procedure with an inbox containing just this one mail.
I get the same results as you did so the problem is not there. I’ll run an import with this mail alone and a new category to verify this is not a side effect of some kind.
Or the difference is in how the message is extracted from the mbox file: this is where the code path is different. The above raw_email = File.read("/tmp/windows.mbox") is different from splitting the file with regexps and maybe that’s where things go wrong.
And indeed, adding File.open('/tmp/message.txt', 'w') { |file| file.write(receiver.raw_email) } after this line produces the following file, which is different from the original file.
line.scrub is responsible for transforming the content into something that’s different from the original. If removed, the regexp fails with:
...
1: from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:174:in `block in each_mail'
/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:174:in `=~': invalid byte sequence in UTF-8 (ArgumentError)
Because it is not UTF-8, indeed
Any idea how this should be resolved ? Maybe a first pass on the mail headers only looking for the charset ? There seems to be a and problem here.