Error importing from vanilla: invalid byte sequence in UTF-8

dpkoch · 2018 年12 月 14 日 23:30

I’m trying to import from a Vanilla forum using the instructions posted here. However, I get the following error when I run the vanilla.rb import script:

Loading existing groups...
Loading existing users...
Loading existing categories...
Loading existing posts...
Loading existing topics...
parsing file...
reading file...
Traceback (most recent call last):
	5: from script/import_scripts/vanilla.rb:254:in `<main>'
	4: from /var/www/discourse/script/import_scripts/base.rb:47:in `perform'
	3: from script/import_scripts/vanilla.rb:17:in `execute'
	2: from script/import_scripts/vanilla.rb:37:in `parse_file'
	1: from script/import_scripts/vanilla.rb:72:in `read_file'
script/import_scripts/vanilla.rb:72:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)

I’ve tried changing the MySQL database character set to UTF8 following the instructions here and then re-exporting the porter file, but that didn’t resolve the issue. Any suggestions?

pfaffman · 2018 年12 月 15 日 13:09

You either need to keep trying to get it to really be UTF-8 or modify the import script to do it. It is a frustrating problem.

Nick_Chomey · 2021 年11 月 4 日 14:11

@dpkoch Did you ever figure this out?

pfaffman · 2021 年11 月 4 日 14:45

You can google stuff about UTF-8 encoding. You need to do something that will coerce the table into UTF-8. The time that I did it, there were further complications because some rows were in one format and others in another format. I think that I did some nonsense where I coerced things on a value-by-value basis.

Nick_Chomey · 2021 年11 月 4 日 14:52

Sounds awful… We’ll have to tinker with the table encoding and see what happens. Thanks!

pfaffman · 2021 年11 月 4 日 14:56

Oh. It’s aweful. You best bet, based on a vague recollection of a single time I did this over a year ago, is to play around with as many different conversions as you can until you can finally hit on one that works for all or most data. I think that I did a bunch of one-by-one transformations that ended up being a waste of time when I stumbled on some conversion that worked for all (most?) data.

Here is what I did. Use at your own risk. (This was vbulletin, FWIW).

  def char_map(raw_original)
    raw = raw_original.dup
    debug = false # (raw.length > 50)

    # windows 1252
    all = ''
    win_encoded = ''

    ### WIN1252 encoding
    win_encoded = ''
    begin
      win_encoded = raw.force_encoding('utf-8').encode("Windows-1252",
                            invalid: :replace, undef: :replace, replace: ""
                           ).force_encoding('utf-8').scrub
    rescue => e
      puts "\n#{'-'*50}\nWin1252 failed for \n\n#{raw}\n\n"
      win_encoded = ''
    end

    ### ISO 8859 encoding
    iso_encoded = ''
    if all.length == 0 && win_encoded.length > 0 && win_encoded != raw
      all = (debug ? "Win1252--" : '') + win_encoded
    else
      all = raw
    end
    all = old_char_map(all)
    all
  end

Nick_Chomey · 2021 年11 月 4 日 15:00

That code is used within the import script or on the server/database side?

pfaffman · 2021 年11 月 4 日 15:01

In the import script. I don’t like to mess with the database.

Somewhere you call this function on raw to fix raw (and maybe titles?).

Nick_Chomey · 2021 年11 月 4 日 15:02

Ok, thanks very much! This should give me a huge head-start in debugging this.

Nick_Chomey · 2021 年11 月 11 日 12:27

我们通过添加一个简单的命令来解决这个问题，该命令在读取文件时将其编码为 UTF-8，例如在 vanilla.rb 导入脚本的第 76-80 行使用 encode\"UTF-8\"。

我正在等待提供该命令的人确认确切的语法。收到后我会更新此信息。

Nick_Chomey · 2021 年11 月 13 日 21:59

这是他用来修复此问题的方法，从 vanilla.rb 的第 76 行开始

def read_file
  puts "reading file..."
  string = [File.read](http://file.read/)(@vanilla_file)
    .force_encoding('UTF-8').encode("UTF-8").gsub("\\N", "")
    .force_encoding('UTF-8').encode("UTF-8").gsub(/\\$\n/m, "\\n")
    .force_encoding('UTF-8').encode("UTF-8").gsub("\\,", ",")
    .force_encoding('UTF-8').encode("UTF-8").gsub(/(?<!\\)\\"/, '""')
    .force_encoding('UTF-8').encode("UTF-8").gsub(/\\\\\\"/, '\\\"\"')
  [StringIO.new](http://stringio.new/)(string)
end

Canapin · 2023 年10 月 17 日 09:52

一个帖子被拆分到一个新主题：如何在生产服务器上编辑导入脚本？

ddeveloper · 2023 年10 月 17 日 12:33

这不起作用。同样的错误。

pfaffman · 2023 年10 月 17 日 14:05

您需要搜索有关编码的资料，找出如何修复损坏的编码。

southpaw · 2023 年10 月 17 日 15:02

您好 @ddeveloper，

几个月前我也经历过这个过程（我不是开发者），并成功地将自托管的 Vanilla 论坛迁移到了自托管的 Discourse。对我来说，关键的一点是，在使用 Vanilla Porter 导出数据时，确保在第一个下拉菜单中选择“Vanilla 2”作为 Source Forum Type（源论坛类型）。

我使用的是在这里作为 zip 文件提供的 Vanilla Porter 2.6 版本：Vanilla Porter 2.6 RC1 — Vanilla Forums Migrate a Vanilla forum to Discourse 中链接的 2.5 版本。

如果我没记错的话，当我使用较新的 Vanilla Porter 脚本和“Vanilla 2”论坛类型时，我没有再遇到 UTF-8 错误。

如果这两个建议对您的导入没有帮助，请提供一些关于您到目前为止所采取步骤的详细信息，以及您具体看到的情况。有时，“相同的错误”之间会有细微的差别，这在故障排除时可能会产生很大的影响。

ddeveloper · 2023 年10 月 17 日 17:06

我已经遵循了相同的指南，只是使用了 porter 2.6 版本。我将从 2.6 版本导出文件并在此处更新。

ddeveloper · 2023 年10 月 17 日 17:35

好的，我已经尝试了 porter 2.6，结果出现了同样的 UTF-8 错误：

到目前为止，我一直遵循这个指南：Migrate a Vanilla forum to Discourse

一切都很顺利，直到出现这个 UTF-8 编码错误。有些人已经解决了这个问题。我尝试了他们的方法，但对我来说不起作用。

我尝试了 @Nick_Chomey 上面的解决方案；尝试在读取 txt 文件时强制使用 utf-8 编码，但徒劳无功，也没有奏效。

southpaw · 2023 年10 月 17 日 17:55

为了确保万无一失，您在 Vanilla Porter 下拉菜单中选择了哪种 Source Forum Type？

您能告诉我们您使用的是哪种类型的计算机吗？将文件转换为 UTF-8 编码的说明会有所不同。

ddeveloper · 2023 年10 月 17 日 18:09

感谢您花时间帮助另一位论坛用户。

我在“源论坛类型”中选择了“Vanilla 2”。

我可以使用基于 Windows 和 Linux 的设备，并且可以访问两者。

话题		回复	浏览量
Migrate a Vanilla forum to Discourse Sysadmins how-to	44	16219	2023 年1 月 30 日
[Paid] Need a Vanilla 2 Import tool Marketplace	67	10956	2015 年5 月 2 日
Error when importing from Vanilla Migration	5	1974	2024 年6 月 8 日
Migrate/Convert WP Posts to Discourse Topics Dev	3	683	2021 年11 月 4 日
Migrating vBulletin 5 database - Import script errors Migration vbulletin5	46	2337	2023 年3 月 8 日

Error importing from vanilla: invalid byte sequence in UTF-8

相关话题