mboxマップのインポート charset=windows-1252 を � へ

dachary · 2020 年 11 月 1 日午前 10:12

こんにちは、

このメッセージを含む mbox のインポートを行うと、以下のように表示されてしまいます。

おそらくエンコーディングの問題だと思われます。以下の設定が含まれているためです。

Content-Type: text/plain; charset=windows-1252; format=flowed

iso-8859-1 など、UTF-8 以外の文字セットを持つ他のメッセージは正しくインポートされています。

script/import_scripts/mbox/support/indexer.rb からソースコードを調査して問題の根本原因を特定する前に、何かご存知の方がいらっしゃいませんか？これはコードベースの問題ではなく、環境に起因するものかもしれませんか？また、メーリングリストモードで動作しているユーザーがこのエンコーディングで返信を送った場合にも同様の現象が発生しますか？

ご教示のほど、よろしくお願いいたします

gerhard · 2020 年 11 月 1 日午後 8:39

簡単なテストを行いましたが、Email::Receiver は問題なく動作しているようです。入力データを UTF-8 に変換しています。その後でエンコーディングが誤る理由が思いつきません。

[1] pry(main)> raw_email = File.read("/tmp/windows.txt");
[2] pry(main)> receiver = Email::Receiver.new(raw_email, convert_plaintext: true, skip_trimming: false);
[3] pry(main)> body = receiver.select_body;
[4] pry(main)> receiver.mail.charset
=> "windows-1252"
[5] pry(main)> body.first.encoding
=> #<Encoding:UTF-8>
[6] pry(main)> puts body.first;
cette réflexion me fait penser : y-a-il une obligation/raison (en dehors du coup de maintenannce) à avoir un même outil pour les 2 fonctionnalités (interactions vs galerie) ?

dachary · 2020 年 11 月 1 日午後 8:47

迅速なテストをありがとうございます。私自身ではどうすればよいかわかりませんインポートコンテナに何か不足しているのでしょうか？あなたがやったことを再現し、そこから探ってみたいと強く思っています。何も見つからなければ、このメールだけが入った受信トレイを使用して mbox インポート手順で問題を再現するための手順を提供します。

gerhard · 2020 年 11 月 1 日午後 8:52

コンテナ内で rails console を実行して試してみてください。

dachary · 2020 年 11 月 1 日午後 8:58

私もおっしゃる通り同じ結果になりましたので、問題はそこにはありません。このメールのみと新しいカテゴリを使用してインポートを実行し、何らかの副作用ではないかを確認します。

dachary · 2020 年 11 月 1 日午後 9:09

以下は、2.5.4 のインストール環境で行った手順です：

変更を加えていない shared/standalone/import/settings.yml
以前のインポートから shared/standalone/import/data/index.db を削除
Message-ID: ヘッダーを変更
windows.txt を shared/standalone/import/data/windows4/windows.mbox にコピー
./launcher enter import を実行
以下でインポートを実行

root@forum:/var/www/discourse# import_mbox.sh 
The mbox import is starting...

Loading existing groups...
Loading existing users...
Loading existing categories...
Loading existing posts...
Loading existing topics...

creating index
indexing files in /shared/import/data/windows4
indexing /shared/import/data/windows4/windows.mbox

indexing replies and users

creating categories
        1 / 1 (100.0%)  [8121278 items/min]  
creating users
Skipping 1 already imported users

creating topics and posts
        1 / 1 (100.0%)  [219 items/min]  

Updating topic status

Updating bumped_at on topics

Updating last posted at on users

Updating last seen at on users

Updating first_post_created_at...

Updating user post_count...

Updating user topic_count...

Updating topic users

Updating post timings

Updating featured topic users

Updating featured topics in categories
        9 / 9 (100.0%)  [1562 items/min]  ]  
Resetting topic counters


Done (00h 00min 09sec)

上記と同じ結果を得ました。これはこちらで確認できます。

dachary · 2020 年 11 月 1 日午後 10:05

もしかすると、インポーターが Email::Receiver を同じように呼び出していないことが原因でしょうか？

Email::Receiver.new(row[‘raw_message’])

ではなく、

receiver = Email::Receiver.new(raw_email, convert_plaintext: true, skip_trimming: false);

のように呼び出す必要があるのかもしれません。

dachary · 2020 年 11 月 2 日午前 11:25

あるいは、mbox ファイルからメッセージを抽出する方法に違いがあるのかもしれません。ここがコードパスの分岐点です。上記の raw_email = File.read("/tmp/windows.mbox") は、正規表現でファイルを分割する方法とは異なり、問題が発生しているのはおそらくこの部分だと思われます。

dachary · 2020 年 11 月 2 日午前 11:41

確かに、この行の後に File.open('/tmp/message.txt', 'w') { |file| file.write(receiver.raw_email) } を追加すると、以下のようなファイルが生成され、それは元のファイルとは異なります。

message.txt (3.7 KB)

Rails コンソールから実行した場合でも、receiver.raw_email は元のファイルとは異なり、正しく UTF-8 でエンコードされています。

この誤った変更がどこで発生しているのか、心当たりはありますか？

riking · 2020 年 11 月 2 日午後 12:32

ファイルを読み込んだ後に .force_encoding を呼び出して、Ruby にメールファイルのエンコーディングを指定する必要がある場合があります。

dachary · 2020 年 11 月 2 日午後 12:42

初心者向けの質問で恐縮ですが、コードベースに詳しくありませんこのような変更が有効な場所について、ご提案はありますか？

dachary · 2020 年 11 月 2 日午後 12:56

望ましくない変換がどこで発生するかを絞り込んだ結果、どうやらここが原因のようです：

github.com/discourse/discourse

script/import_scripts/mbox/support/indexer.rb

main


      
            yield receiver, filename, opts if receiver.present?
          end

line.scrub が、コンテンツを元のものと異なる内容に変換する役割を担っています。これを削除すると、正規表現が以下のようなエラーで失敗します：

...
         1: from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:174:in `block in each_mail'                                                                         
/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:174:in `=~': invalid byte sequence in UTF-8 (ArgumentError)

実際、これは UTF-8 ではないためです

これをどう解決すべきかアイデアはありますか？もしかすると、まずメールヘッダーのみに焦点を当てて文字セットを検索する最初のパスが必要かもしれません。ここにはとのような問題があるようです。

dachary · 2020 年 11 月 2 日午後 1:15

github.com/discourse/discourse

script/import_scripts/mbox/support/indexer.rb

main


      
            yield receiver, filename, opts if receiver.present?
          end

変更前：


    line = line.scrub

    if line =~ @split_regex

変更後：

    if line.scrub =~ @split_regex

これは機能しているようです：

ただし、これが正しい修正方法かどうかは確信が持てません。

gerhard · 2020 年 11 月 2 日午後 1:50

問題解決の完璧な方法のようです。

トピック		返信	表示
Importing mbox files fails at creating topics with Chinese characters due to invalid byte sequence Bug	5	1525	2017 年 5 月 26 日
Import_mbox.sh not working with e-mails from Samsung phone sent via a listserv server Support	8	847	2022 年 5 月 9 日
Job exception in mail receiver Bug	1	963	2016 年 3 月 30 日
Rejected mails are unreadable Bug email	8	209	2026 年 2 月 5 日
Error reading post from email Bug email	4	1341	2015 年 7 月 2 日

mboxマップのインポート charset=windows-1252 を � へ

関連トピック