Error importing from vanilla: invalid byte sequence in UTF-8

dpkoch · 2018 年 12 月 14 日午後 11:30

I’m trying to import from a Vanilla forum using the instructions posted here. However, I get the following error when I run the vanilla.rb import script:

Loading existing groups...
Loading existing users...
Loading existing categories...
Loading existing posts...
Loading existing topics...
parsing file...
reading file...
Traceback (most recent call last):
	5: from script/import_scripts/vanilla.rb:254:in `<main>'
	4: from /var/www/discourse/script/import_scripts/base.rb:47:in `perform'
	3: from script/import_scripts/vanilla.rb:17:in `execute'
	2: from script/import_scripts/vanilla.rb:37:in `parse_file'
	1: from script/import_scripts/vanilla.rb:72:in `read_file'
script/import_scripts/vanilla.rb:72:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)

I’ve tried changing the MySQL database character set to UTF8 following the instructions here and then re-exporting the porter file, but that didn’t resolve the issue. Any suggestions?

pfaffman · 2018 年 12 月 15 日午後 1:09

You either need to keep trying to get it to really be UTF-8 or modify the import script to do it. It is a frustrating problem.

Nick_Chomey · 2021 年 11 月 4 日午後 2:11

@dpkoch Did you ever figure this out?

pfaffman · 2021 年 11 月 4 日午後 2:45

You can google stuff about UTF-8 encoding. You need to do something that will coerce the table into UTF-8. The time that I did it, there were further complications because some rows were in one format and others in another format. I think that I did some nonsense where I coerced things on a value-by-value basis.

Nick_Chomey · 2021 年 11 月 4 日午後 2:52

Sounds awful… We’ll have to tinker with the table encoding and see what happens. Thanks!

pfaffman · 2021 年 11 月 4 日午後 2:56

Oh. It’s aweful. You best bet, based on a vague recollection of a single time I did this over a year ago, is to play around with as many different conversions as you can until you can finally hit on one that works for all or most data. I think that I did a bunch of one-by-one transformations that ended up being a waste of time when I stumbled on some conversion that worked for all (most?) data.

Here is what I did. Use at your own risk. (This was vbulletin, FWIW).

  def char_map(raw_original)
    raw = raw_original.dup
    debug = false # (raw.length > 50)

    # windows 1252
    all = ''
    win_encoded = ''

    ### WIN1252 encoding
    win_encoded = ''
    begin
      win_encoded = raw.force_encoding('utf-8').encode("Windows-1252",
                            invalid: :replace, undef: :replace, replace: ""
                           ).force_encoding('utf-8').scrub
    rescue => e
      puts "\n#{'-'*50}\nWin1252 failed for \n\n#{raw}\n\n"
      win_encoded = ''
    end

    ### ISO 8859 encoding
    iso_encoded = ''
    if all.length == 0 && win_encoded.length > 0 && win_encoded != raw
      all = (debug ? "Win1252--" : '') + win_encoded
    else
      all = raw
    end
    all = old_char_map(all)
    all
  end

Nick_Chomey · 2021 年 11 月 4 日午後 3:00

That code is used within the import script or on the server/database side?

pfaffman · 2021 年 11 月 4 日午後 3:01

In the import script. I don’t like to mess with the database.

Somewhere you call this function on raw to fix raw (and maybe titles?).

Nick_Chomey · 2021 年 11 月 4 日午後 3:02

Ok, thanks very much! This should give me a huge head-start in debugging this.

Nick_Chomey · 2021 年 11 月 11 日午後 12:27

ファイルを読み込む際に encode"UTF-8" のようなコマンドを追加して UTF-8 としてエンコードすることで、この問題を解決しました。これは vanilla.rb のインポートスクリプトの 76〜80 行目に記述されています。

コマンドラインで実行した担当者から正確な構文の確認が取れ次第、更新します。

Nick_Chomey · 2021 年 11 月 13 日午後 9:59

これを修正するために彼が使用したものは、バニラの76行目から始まります。

def read_file
  puts "reading file..."
  string = [File.read](http://file.read/)(@vanilla_file)
    .force_encoding('UTF-8').encode("UTF-8").gsub("\\N", "")
    .force_encoding('UTF-8').encode("UTF-8").gsub(/\\$\\n/m, "\\n")
    .force_encoding('UTF-8').encode("UTF-8").gsub("\\,", ",")
    .force_encoding('UTF-8').encode("UTF-8").gsub(/(?<!\\)\\"/, '""')
    .force_encoding('UTF-8').encode("UTF-8").gsub(/\\\\\\"/, '\\\"\"')
  [StringIO.new](http://stringio.new/)(string)
end

Canapin · 2023 年 10 月 17 日午前 9:52

投稿が新しいトピックに分割されました：本番サーバーでインポートスクリプトを編集する方法は？

ddeveloper · 2023 年 10 月 17 日午後 12:33

これは機能しません。同じエラーが発生します。

pfaffman · 2023 年 10 月 17 日午後 2:05

エンコーディングについてGoogleで調べ、壊れたエンコーディングを修正する方法を見つける必要があります。

southpaw · 2023 年 10 月 17 日午後 3:02

Hi @ddeveloper、

私も数ヶ月前に（開発者ではありませんが）このプロセスを経験し、セルフホストのVanillaフォーラムからセルフホストのDiscourseへの移行を成功させることができました。私にとって重要だったのは、Vanilla Porterでデータエクスポートを行う際に、最初のドロップダウンメニューで「Vanilla 2」を Source Forum Type として選択することでした。

Migrate a Vanilla forum to Discourse でリンクされている2.5バージョンではなく、Vanilla Porter 2.6 RC1 — Vanilla Forums でダウンロード可能なVanilla Porter 2.6バージョンを使用しました。

記憶が正しければ、新しいVanilla Porterスクリプトと「Vanilla 2」フォーラムタイプを使用した際に、UTF-8エラーは再度発生しませんでした。

もしこれらの2つの提案でインポートに違いが見られない場合は、これまでに実行した手順と、具体的に何が表示されているかについて、いくつか詳細を教えていただけますでしょうか。トラブルシューティングにおいて、「同じエラー」でもわずかな違いが大きな違いを生むことがあります。

ddeveloper · 2023 年 10 月 17 日午後 5:06

ガイドに従いましたが、ポーターのバージョンは 2.6 を使用しました。バージョン 2.6 からエクスポートファイルをエクスポートし、ここに更新します。

ddeveloper · 2023 年 10 月 17 日午後 5:35

ポーター 2.6を試しましたが、同じUTF-8エラーが発生しました。

これまでのところ、このガイドに従っています: Migrate a Vanilla forum to Discourse

UTF-8エンコーディングエラーまで、すべて順調に進みました。この問題を解決した人もいます。それらを試しましたが、私にはうまくいきませんでした。

@Nick_Chomey さんの上記の解決策を試しました。txtファイルを読み取る際にutf-8エンコーディングを強制しようとしましたが、残念ながらうまくいきませんでした。

southpaw · 2023 年 10 月 17 日午後 5:55

念のため、Vanilla PorterのドロップダウンメニューでどのSource Forum Typeを選択しましたか？

お使いのコンピューターの種類を教えていただけますか？ファイルをUTF-8エンコーディングに変換する手順は、コンピューターによって異なります。

ddeveloper · 2023 年 10 月 17 日午後 6:09

ご協力いただきありがとうございます。

Source Forum Type で「Vanilla 2」を選択しました。

Windows および Linux デバイスの両方を使用でき、両方にアクセスできます。

トピック		返信	表示
Migrate a Vanilla forum to Discourse Sysadmins how-to	44	16070	2023 年 1 月 30 日
[Paid] Need a Vanilla 2 Import tool Marketplace	67	10864	2015 年 5 月 2 日
Error when importing from Vanilla Migration	5	1954	2024 年 6 月 8 日
Migrate/Convert WP Posts to Discourse Topics Dev	3	638	2021 年 11 月 4 日
Migrating vBulletin 5 database - Import script errors Migration vbulletin5	46	2260	2023 年 3 月 8 日

Error importing from vanilla: invalid byte sequence in UTF-8

関連トピック