Error importing from vanilla: invalid byte sequence in UTF-8

dpkoch · 14 Dicembre 2018, 11:30pm

I’m trying to import from a Vanilla forum using the instructions posted here. However, I get the following error when I run the vanilla.rb import script:

Loading existing groups...
Loading existing users...
Loading existing categories...
Loading existing posts...
Loading existing topics...
parsing file...
reading file...
Traceback (most recent call last):
	5: from script/import_scripts/vanilla.rb:254:in `<main>'
	4: from /var/www/discourse/script/import_scripts/base.rb:47:in `perform'
	3: from script/import_scripts/vanilla.rb:17:in `execute'
	2: from script/import_scripts/vanilla.rb:37:in `parse_file'
	1: from script/import_scripts/vanilla.rb:72:in `read_file'
script/import_scripts/vanilla.rb:72:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)

I’ve tried changing the MySQL database character set to UTF8 following the instructions here and then re-exporting the porter file, but that didn’t resolve the issue. Any suggestions?

pfaffman · 15 Dicembre 2018, 1:09pm

You either need to keep trying to get it to really be UTF-8 or modify the import script to do it. It is a frustrating problem.

Nick_Chomey · 4 Novembre 2021, 2:11pm

@dpkoch Did you ever figure this out?

pfaffman · 4 Novembre 2021, 2:45pm

You can google stuff about UTF-8 encoding. You need to do something that will coerce the table into UTF-8. The time that I did it, there were further complications because some rows were in one format and others in another format. I think that I did some nonsense where I coerced things on a value-by-value basis.

Nick_Chomey · 4 Novembre 2021, 2:52pm

Sounds awful… We’ll have to tinker with the table encoding and see what happens. Thanks!

pfaffman · 4 Novembre 2021, 2:56pm

Oh. It’s aweful. You best bet, based on a vague recollection of a single time I did this over a year ago, is to play around with as many different conversions as you can until you can finally hit on one that works for all or most data. I think that I did a bunch of one-by-one transformations that ended up being a waste of time when I stumbled on some conversion that worked for all (most?) data.

Here is what I did. Use at your own risk. (This was vbulletin, FWIW).

  def char_map(raw_original)
    raw = raw_original.dup
    debug = false # (raw.length > 50)

    # windows 1252
    all = ''
    win_encoded = ''

    ### WIN1252 encoding
    win_encoded = ''
    begin
      win_encoded = raw.force_encoding('utf-8').encode("Windows-1252",
                            invalid: :replace, undef: :replace, replace: ""
                           ).force_encoding('utf-8').scrub
    rescue => e
      puts "\n#{'-'*50}\nWin1252 failed for \n\n#{raw}\n\n"
      win_encoded = ''
    end

    ### ISO 8859 encoding
    iso_encoded = ''
    if all.length == 0 && win_encoded.length > 0 && win_encoded != raw
      all = (debug ? "Win1252--" : '') + win_encoded
    else
      all = raw
    end
    all = old_char_map(all)
    all
  end

Nick_Chomey · 4 Novembre 2021, 3:00pm

That code is used within the import script or on the server/database side?

pfaffman · 4 Novembre 2021, 3:01pm

In the import script. I don’t like to mess with the database.

Somewhere you call this function on raw to fix raw (and maybe titles?).

Nick_Chomey · 4 Novembre 2021, 3:02pm

Ok, thanks very much! This should give me a huge head-start in debugging this.

Nick_Chomey · 11 Novembre 2021, 12:27pm

Siamo riusciti a farlo funzionare aggiungendo un semplice comando per codificare il file come UTF-8 durante la lettura, usando qualcosa come encode\"UTF-8\" alle righe 76-80 dello script di importazione vanilla.rb.

Sto solo aspettando la conferma della sintassi esatta da parte del ragazzo che l’ha fatto tramite riga di comando. Aggiornerò questo quando l’avrò.

Nick_Chomey · 13 Novembre 2021, 9:59pm

Ecco cosa ha usato per risolvere questo problema, a partire dalla riga 76 di vanilla.rb

def read_file
  puts "lettura file..."
  string = [File.read](http://file.read/)(@vanilla_file)
    .force_encoding('UTF-8').encode("UTF-8").gsub("\\N", "")
    .force_encoding('UTF-8').encode("UTF-8").gsub(/\\$\\n/m, "\\n")
    .force_encoding('UTF-8').encode("UTF-8").gsub("\\,", ",")
    .force_encoding('UTF-8').encode("UTF-8").gsub(/(?<!\\)\\"/, '""')
    .force_encoding('UTF-8').encode("UTF-8").gsub(/\\\\\\"/, '\\\"\"')
  [StringIO.new](http://stringio.new/)(string)
end

Canapin · 17 Ottobre 2023, 9:52am

Un post è stato diviso in un nuovo argomento: Come modificare uno script di importazione sul server di produzione?

ddeveloper · 17 Ottobre 2023, 12:33pm

Non funziona. Stesso errore.

pfaffman · 17 Ottobre 2023, 2:05pm

Devi cercare su Google la codifica e capire come risolvere la tua codifica errata.

southpaw · 17 Ottobre 2023, 3:02pm

Ciao @ddeveloper,

Ho lavorato a questo processo solo un paio di mesi fa (e non sono uno sviluppatore ) e sono riuscito a migrare con successo un forum Vanilla self-hosted a Discourse self-hosted. Una cosa che è stata fondamentale per me è stata assicurarmi, durante l’esportazione dei dati con Vanilla Porter, di selezionare “Vanilla 2” come Source Forum Type nel primo menu a discesa.

Ho utilizzato la versione 2.6 di Vanilla Porter disponibile per il download come file zip qui: Vanilla Porter 2.6 RC1 — Vanilla Forums invece della versione 2.5 collegata in Migrate a Vanilla forum to Discourse.

Se ricordo bene, non ho più riscontrato l’errore UTF-8 quando ho utilizzato lo script Vanilla Porter più recente e il tipo di forum “Vanilla 2”.

Se questi due suggerimenti non fanno differenza per la tua importazione, ti prego di fornire alcuni dettagli sui passaggi che hai intrapreso finora e su cosa stai vedendo esattamente. A volte ci sono lievi variazioni allo “stesso errore” che possono fare una grande differenza nella risoluzione dei problemi.

ddeveloper · 17 Ottobre 2023, 5:06pm

Ho seguito la stessa guida tranne che ho usato la versione porter 2.6. Esporterò il file dalla versione 2.6 e aggiornerò qui.

ddeveloper · 17 Ottobre 2023, 5:35pm

Okay, ho provato porter 2.6 e ha prodotto lo stesso errore UTF-8:

Finora, ho seguito questa guida: Migrate a Vanilla forum to Discourse

Tutto è andato bene fino a questo errore di codifica UTF-8. Alcune persone hanno risolto questo problema. Ho provato quelle soluzioni, ma non hanno funzionato per me.

Ho provato la soluzione di @Nick_Chomey sopra; cercando di imporre la codifica utf-8 durante la lettura del file txt, ma con mio grande rammarico non ha funzionato neanche.

southpaw · 17 Ottobre 2023, 5:55pm

Per sicurezza, quale Source Forum Type hai selezionato nel menu a discesa di Vanilla Porter?

Potresti dirci che tipo di computer stai usando? Le istruzioni per convertire il tuo file in codifica UTF-8 varieranno.

ddeveloper · 17 Ottobre 2023, 6:09pm

Grazie per aver dedicato del tempo ad aiutare un altro utente.\n\nHo selezionato “Vanilla 2” in Source Forum Type.\n\nPosso usare sia dispositivi basati su Windows che su Linux e ho accesso a entrambi.

Argomento		Risposte	Visualizzazioni
Migrate a Vanilla forum to Discourse Sysadmins how-to	44	16070	Gennaio 30, 2023
[Paid] Need a Vanilla 2 Import tool Marketplace	67	10864	Maggio 2, 2015
Error when importing from Vanilla Migration	5	1954	Giugno 8, 2024
Migrate/Convert WP Posts to Discourse Topics Dev	3	638	Novembre 4, 2021
Migrating vBulletin 5 database - Import script errors Migration vbulletin5	46	2260	Marzo 8, 2023

Error importing from vanilla: invalid byte sequence in UTF-8

Argomenti correlati