导入 mbox 映射 charset=windows-1252 到 �

dachary · 2020 年11 月 1 日 10:12

你好，

在导入包含此消息的 mbox 文件时：

显示如下：

这很可能是一个编码问题，因为它包含：

Content-Type: text/plain; charset=windows-1252; format=flowed

其他使用非 UTF-8 字符集（例如 iso-8859-1）的消息都能正确导入。

在我尝试通过查看源代码（从 script/import_scripts/mbox/support/indexer.rb 开始）来找出问题根源之前，有人有什么想法吗？这是否可能是环境问题而非代码库的问题？当以邮件列表模式运行的用户以这种编码发送回复时，是否也会出现同样的问题？

提前感谢

gerhard · 2020 年11 月 1 日 20:39

我快速测试了一下，Email::Receiver 似乎运行正常。它会将输入转换为 UTF-8。我想不出为什么之后的编码会出错。

[1] pry(main)> raw_email = File.read("/tmp/windows.txt");
[2] pry(main)> receiver = Email::Receiver.new(raw_email, convert_plaintext: true, skip_trimming: false);
[3] pry(main)> body = receiver.select_body;
[4] pry(main)> receiver.mail.charset
=> "windows-1252"
[5] pry(main)> body.first.encoding
=> #<Encoding:UTF-8>
[6] pry(main)> puts body.first;
cette réflexion me fait penser : y-a-il une obligation/raison (en dehors du coup de maintenannce) à avoir un même outil pour les 2 fonctionnalités (interactions vs galerie) ?

dachary · 2020 年11 月 1 日 20:47

感谢您的快速测试：我自己都不知道该如何操作是否可能是导入容器中缺少了某些内容？我非常希望能重现您所执行的操作，并在此基础上进一步探索。如果我没有发现任何问题，我将提供使用 mbox 导入流程重现该问题的操作说明，该流程将包含一个仅有一封邮件的收件箱。

gerhard · 2020 年11 月 1 日 20:52

您可以在容器中运行 rails console 来试用。

dachary · 2020 年11 月 1 日 20:58

我得到的结果和你一样，所以问题不在那里。我将仅使用这个邮箱并创建一个新类别来运行一次导入，以验证这是否是某种副作用。

dachary · 2020 年11 月 1 日 21:09

以下是我在 2.5.4 版本安装上执行的操作：

未修改 shared/standalone/import/settings.yml
删除了上次导入中的 shared/standalone/import/data/index.db
修改了 Message-ID: 头部
将 windows.txt 复制到 shared/standalone/import/data/windows4/windows.mbox
执行 ./launcher enter import
运行导入命令：

root@forum:/var/www/discourse# import_mbox.sh 
The mbox import is starting...

Loading existing groups...
Loading existing users...
Loading existing categories...
Loading existing posts...
Loading existing topics...

creating index
indexing files in /shared/import/data/windows4
indexing /shared/import/data/windows4/windows.mbox

indexing replies and users

creating categories
        1 / 1 (100.0%)  [8121278 items/min]  
creating users
Skipping 1 already imported users

creating topics and posts
        1 / 1 (100.0%)  [219 items/min]  

Updating topic status

Updating bumped_at on topics

Updating last posted at on users

Updating last seen at on users

Updating first_post_created_at...

Updating user post_count...

Updating user topic_count...

Updating topic users

Updating post timings

Updating featured topic users

Updating featured topics in categories
        9 / 9 (100.0%)  [1562 items/min]  
Resetting topic counters


Done (00h 00min 09sec)

得到了与上述相同的结果，您可以在此处查看。

dachary · 2020 年11 月 1 日 22:05

这是否是因为 Email::Receiver 的调用方式与导入器中不同见此处？

Email::Receiver.new(row[‘raw_message’])

而不是

receiver = Email::Receiver.new(raw_email, convert_plaintext: true, skip_trimming: false);

dachary · 2020 年11 月 2 日 11:25

或者差异在于如何从 mbox 文件中提取消息：这正是代码路径不同的地方。上述 raw_email = File.read("/tmp/windows.mbox") 与使用正则表达式分割文件不同，问题可能就出在这里。

dachary · 2020 年11 月 2 日 11:41

确实，在这一行之后添加 File.open('/tmp/message.txt', 'w') { |file| file.write(receiver.raw_email) } 会生成以下文件，该文件与原始文件不同。

message.txt|附件 (3.7 KB)

在 Rails 控制台中运行时，receiver.raw_email 也与原始文件不同：它被正确编码为 UTF-8。

请问这个错误的修改发生在哪里？

riking · 2020 年11 月 2 日 12:32

您可能需要在读取文件后添加对 .force_encoding 的调用，以告知 Ruby 电子邮件文件的编码方式。

dachary · 2020 年11 月 2 日 12:42

抱歉，如果这是个新手问题，但我不太熟悉代码库您建议在哪里进行此类修改会更有成效？

dachary · 2020 年11 月 2 日 12:56

在缩小了问题发生的范围后，似乎问题出在这里：

github.com/discourse/discourse

script/import_scripts/mbox/support/indexer.rb

main


      
            yield receiver, filename, opts if receiver.present?
          end

line.scrub 负责将内容转换为与原始内容不同的形式。如果移除它，正则表达式会失败并报错：

...
         1: from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:174:in `block in each_mail'                                                                         \n/var/www/discourse/script/import_scripts/mbox/support/indexer.rb:174:in `=~': invalid byte sequence in UTF-8 (ArgumentError)

因为内容确实不是 UTF-8 编码的：slight_smile

有什么解决思路吗？也许可以先对邮件头进行一轮扫描，仅查找字符集？这里似乎存在一个“先有鸡还是先有蛋”的问题。

dachary · 2020 年11 月 2 日 13:15

github.com/discourse/discourse

script/import_scripts/mbox/support/indexer.rb

main


      
            yield receiver, filename, opts if receiver.present?
          end

将：


    line = line.scrub

    if line =~ @split_regex

替换为：

    if line.scrub =~ @split_regex

似乎可以解决问题：

但我不确定这是否是正确的修复方式。

gerhard · 2020 年11 月 2 日 13:50

看来这是解决问题的完美方式。

话题		回复	浏览量
Importing mbox files fails at creating topics with Chinese characters due to invalid byte sequence Bug	5	1524	2017 年5 月 26 日
Import_mbox.sh not working with e-mails from Samsung phone sent via a listserv server Support	8	845	2022 年5 月 9 日
Job exception in mail receiver Bug	1	963	2016 年3 月 30 日
Rejected mails are unreadable Bug email	8	209	2026 年2 月 5 日
Error reading post from email Bug email	4	1341	2015 年7 月 2 日

导入 mbox 映射 charset=windows-1252 到 �

相关话题