Error importing from vanilla: invalid byte sequence in UTF-8

dpkoch · December 14, 2018, 11:30pm

I’m trying to import from a Vanilla forum using the instructions posted here. However, I get the following error when I run the vanilla.rb import script:

Loading existing groups...
Loading existing users...
Loading existing categories...
Loading existing posts...
Loading existing topics...
parsing file...
reading file...
Traceback (most recent call last):
	5: from script/import_scripts/vanilla.rb:254:in `<main>'
	4: from /var/www/discourse/script/import_scripts/base.rb:47:in `perform'
	3: from script/import_scripts/vanilla.rb:17:in `execute'
	2: from script/import_scripts/vanilla.rb:37:in `parse_file'
	1: from script/import_scripts/vanilla.rb:72:in `read_file'
script/import_scripts/vanilla.rb:72:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)

I’ve tried changing the MySQL database character set to UTF8 following the instructions here and then re-exporting the porter file, but that didn’t resolve the issue. Any suggestions?

pfaffman · December 15, 2018, 1:09pm

You either need to keep trying to get it to really be UTF-8 or modify the import script to do it. It is a frustrating problem.

Nick_Chomey · November 4, 2021, 2:11pm

@dpkoch Did you ever figure this out?

pfaffman · November 4, 2021, 2:45pm

You can google stuff about UTF-8 encoding. You need to do something that will coerce the table into UTF-8. The time that I did it, there were further complications because some rows were in one format and others in another format. I think that I did some nonsense where I coerced things on a value-by-value basis.

Nick_Chomey · November 4, 2021, 2:52pm

Sounds awful… We’ll have to tinker with the table encoding and see what happens. Thanks!

pfaffman · November 4, 2021, 2:56pm

Oh. It’s aweful. You best bet, based on a vague recollection of a single time I did this over a year ago, is to play around with as many different conversions as you can until you can finally hit on one that works for all or most data. I think that I did a bunch of one-by-one transformations that ended up being a waste of time when I stumbled on some conversion that worked for all (most?) data.

Here is what I did. Use at your own risk. (This was vbulletin, FWIW).

  def char_map(raw_original)
    raw = raw_original.dup
    debug = false # (raw.length > 50)

    # windows 1252
    all = ''
    win_encoded = ''

    ### WIN1252 encoding
    win_encoded = ''
    begin
      win_encoded = raw.force_encoding('utf-8').encode("Windows-1252",
                            invalid: :replace, undef: :replace, replace: ""
                           ).force_encoding('utf-8').scrub
    rescue => e
      puts "\n#{'-'*50}\nWin1252 failed for \n\n#{raw}\n\n"
      win_encoded = ''
    end

    ### ISO 8859 encoding
    iso_encoded = ''
    if all.length == 0 && win_encoded.length > 0 && win_encoded != raw
      all = (debug ? "Win1252--" : '') + win_encoded
    else
      all = raw
    end
    all = old_char_map(all)
    all
  end

Nick_Chomey · November 4, 2021, 3:00pm

That code is used within the import script or on the server/database side?

pfaffman · November 4, 2021, 3:01pm

In the import script. I don’t like to mess with the database.

Somewhere you call this function on raw to fix raw (and maybe titles?).

Nick_Chomey · November 4, 2021, 3:02pm

Ok, thanks very much! This should give me a huge head-start in debugging this.

Nick_Chomey · November 11, 2021, 12:27pm

We got this to work by adding a simple command to encode the file as UTF-8 while reading it, using something like encode"UTF-8" on lines 76-80 of the vanilla.rb import script.

I’m just waiting for confirmation on the exact syntax from the guy who did it via command line. I will update this when it I have it.

Nick_Chomey · November 13, 2021, 9:59pm

Here is what he used to fix this, starting on line 76 of vanilla.rb

def read_file
  puts "reading file..."
  string = [File.read](http://file.read/)(@vanilla_file)
    .force_encoding('UTF-8').encode("UTF-8").gsub("\\N", "")
    .force_encoding('UTF-8').encode("UTF-8").gsub(/\\$\n/m, "\\n")
    .force_encoding('UTF-8').encode("UTF-8").gsub("\\,", ",")
    .force_encoding('UTF-8').encode("UTF-8").gsub(/(?<!\\)\\"/, '""')
    .force_encoding('UTF-8').encode("UTF-8").gsub(/\\\\\\"/, '\\""')
  [StringIO.new](http://stringio.new/)(string)
end

Canapin · October 17, 2023, 9:52am

A post was split to a new topic: How to edit an import script on the production server?

ddeveloper · October 17, 2023, 12:33pm

This doesn’t work. Same error.

pfaffman · October 17, 2023, 2:05pm

You need to Google about encoding and figure out your to fix your broken encoding.

southpaw · October 17, 2023, 3:02pm

Hi @ddeveloper,

I worked through this process just a couple of months ago (and I am not a developer ) and managed to successfully migrate a self-hosted Vanilla forum to self-hosted Discourse. One thing that was key for me, was making sure when doing the data export with Vanilla Porter, to select “Vanilla 2” as the Source Forum Type in the first drop-down menu.

I used the Vanilla Porter 2.6 version available for download as a zip file here: Vanilla Porter 2.6 RC1 — Vanilla Forums instead of the 2.5 version linked in Migrate a Vanilla forum to Discourse.

If I remember correctly, I didn’t hit the UTF-8 error again when using the newer Vanilla Porter script and the “Vanilla 2” forum type.

If those two suggestions don’t make a difference for your import, please provide a few details about the steps you’ve taken so far and exactly what you’re seeing. Sometimes there are slight variations to “same error” that can make a big difference when troubleshooting.

ddeveloper · October 17, 2023, 5:06pm

I’ve followed the same guide except using the porter version 2.6. I will the export file from version 2.6 and update here.

ddeveloper · October 17, 2023, 5:35pm

Okay, I’ve tried porter 2.6 and it resulted in same UTF-8 error:

So far, I’ve followed this guide: Migrate a Vanilla forum to Discourse

Everything went well until this UTF-8 encoding error. Some people have resolved this issue. I’ve tried those, and it didn’t work out for me.

I tried @Nick_Chomey 's solution above; trying to enforce utf-8 encoding while reading the txt file but to my vain it didn’t work as well.

southpaw · October 17, 2023, 5:55pm

Just to be sure, which Source Forum Type did you select in the Vanilla Porter drop down menu?

Could you tell us what kind of computer you are using? Instructions to convert your file to UTF-8 encoding will vary.

ddeveloper · October 17, 2023, 6:09pm

Thank you for taking time to help a fellow discourser out.

I selected “Vanilla 2” in Source Forum Type.

I can use both Windows and linux based devices and have access to both.

Topic		Replies	Views
Migrate a Vanilla forum to Discourse Sysadmins how-to	44	15956	January 30, 2023
[Paid] Need a Vanilla 2 Import tool Marketplace	67	10831	May 2, 2015
Error when importing from Vanilla Migration	5	1952	June 8, 2024
Migrate/Convert WP Posts to Discourse Topics Dev	3	637	November 4, 2021
Migrating vBulletin 5 database - Import script errors Migration vbulletin5	46	2217	March 8, 2023

Error importing from vanilla: invalid byte sequence in UTF-8

Related topics