Error importing from vanilla: invalid byte sequence in UTF-8

dpkoch · Décembre 14, 2018, 11:30

I’m trying to import from a Vanilla forum using the instructions posted here. However, I get the following error when I run the vanilla.rb import script:

Loading existing groups...
Loading existing users...
Loading existing categories...
Loading existing posts...
Loading existing topics...
parsing file...
reading file...
Traceback (most recent call last):
	5: from script/import_scripts/vanilla.rb:254:in `<main>'
	4: from /var/www/discourse/script/import_scripts/base.rb:47:in `perform'
	3: from script/import_scripts/vanilla.rb:17:in `execute'
	2: from script/import_scripts/vanilla.rb:37:in `parse_file'
	1: from script/import_scripts/vanilla.rb:72:in `read_file'
script/import_scripts/vanilla.rb:72:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)

I’ve tried changing the MySQL database character set to UTF8 following the instructions here and then re-exporting the porter file, but that didn’t resolve the issue. Any suggestions?

pfaffman · Décembre 15, 2018, 1:09

You either need to keep trying to get it to really be UTF-8 or modify the import script to do it. It is a frustrating problem.

Nick_Chomey · Novembre 4, 2021, 2:11

@dpkoch Did you ever figure this out?

pfaffman · Novembre 4, 2021, 2:45

You can google stuff about UTF-8 encoding. You need to do something that will coerce the table into UTF-8. The time that I did it, there were further complications because some rows were in one format and others in another format. I think that I did some nonsense where I coerced things on a value-by-value basis.

Nick_Chomey · Novembre 4, 2021, 2:52

Sounds awful… We’ll have to tinker with the table encoding and see what happens. Thanks!

pfaffman · Novembre 4, 2021, 2:56

Oh. It’s aweful. You best bet, based on a vague recollection of a single time I did this over a year ago, is to play around with as many different conversions as you can until you can finally hit on one that works for all or most data. I think that I did a bunch of one-by-one transformations that ended up being a waste of time when I stumbled on some conversion that worked for all (most?) data.

Here is what I did. Use at your own risk. (This was vbulletin, FWIW).

  def char_map(raw_original)
    raw = raw_original.dup
    debug = false # (raw.length > 50)

    # windows 1252
    all = ''
    win_encoded = ''

    ### WIN1252 encoding
    win_encoded = ''
    begin
      win_encoded = raw.force_encoding('utf-8').encode("Windows-1252",
                            invalid: :replace, undef: :replace, replace: ""
                           ).force_encoding('utf-8').scrub
    rescue => e
      puts "\n#{'-'*50}\nWin1252 failed for \n\n#{raw}\n\n"
      win_encoded = ''
    end

    ### ISO 8859 encoding
    iso_encoded = ''
    if all.length == 0 && win_encoded.length > 0 && win_encoded != raw
      all = (debug ? "Win1252--" : '') + win_encoded
    else
      all = raw
    end
    all = old_char_map(all)
    all
  end

Nick_Chomey · Novembre 4, 2021, 3:00

That code is used within the import script or on the server/database side?

pfaffman · Novembre 4, 2021, 3:01

In the import script. I don’t like to mess with the database.

Somewhere you call this function on raw to fix raw (and maybe titles?).

Nick_Chomey · Novembre 4, 2021, 3:02

Ok, thanks very much! This should give me a huge head-start in debugging this.

Nick_Chomey · Novembre 11, 2021, 12:27

Nous avons réussi à faire fonctionner cela en ajoutant une commande simple pour encoder le fichier en UTF-8 lors de sa lecture, en utilisant quelque chose comme encode\"UTF-8\" aux lignes 76 à 80 du script d’importation vanilla.rb.

J’attends juste la confirmation de la syntaxe exacte de la part de la personne qui l’a fait via la ligne de commande. Je mettrai à jour ceci lorsque je l’aurai.

Nick_Chomey · Novembre 13, 2021, 9:59

Voici ce qu’il a utilisé pour corriger cela, à partir de la ligne 76 de vanilla.rb

def read_file
  puts "reading file..."
  string = [File.read](http://file.read/)(@vanilla_file)
    .force_encoding('UTF-8').encode("UTF-8").gsub("\\N", "")
    .force_encoding('UTF-8').encode("UTF-8").gsub(/\\$\\n/m, "\\n")
    .force_encoding('UTF-8').encode("UTF-8").gsub("\\,", ",")
    .force_encoding('UTF-8').encode("UTF-8").gsub(/(?<!\\)\\"/, '""')
    .force_encoding('UTF-8').encode("UTF-8").gsub(/\\\\\\"/, '\\\"\"')
  [StringIO.new](http://stringio.new/)(string)
end

Canapin · Octobre 17, 2023, 9:52

Un message a été divisé dans un nouveau sujet : Comment modifier un script d’importation sur le serveur de production ?

ddeveloper · Octobre 17, 2023, 12:33

Cela ne fonctionne pas. Même erreur.

pfaffman · Octobre 17, 2023, 2:05

Vous devez rechercher l’encodage sur Google et trouver comment corriger votre encodage défectueux.

southpaw · Octobre 17, 2023, 3:02

Salut @ddeveloper,

J’ai travaillé sur ce processus il y a quelques mois (et je ne suis pas un développeur ) et j’ai réussi à migrer avec succès un forum Vanilla auto-hébergé vers Discourse auto-hébergé. Une chose qui a été essentielle pour moi, c’était de m’assurer, lors de l’exportation des données avec Vanilla Porter, de sélectionner « Vanilla 2 » comme Source Forum Type dans le premier menu déroulant.

J’ai utilisé la version 2.6 de Vanilla Porter disponible en téléchargement sous forme de fichier zip ici : Vanilla Porter 2.6 RC1 — Vanilla Forums au lieu de la version 2.5 liée dans Migrate a Vanilla forum to Discourse.

Si je me souviens bien, je n’ai plus rencontré l’erreur UTF-8 en utilisant le script Vanilla Porter plus récent et le type de forum « Vanilla 2 ».

Si ces deux suggestions ne font pas de différence pour votre importation, veuillez fournir quelques détails sur les étapes que vous avez suivies jusqu’à présent et sur ce que vous voyez exactement. Parfois, il existe de légères variations à la « même erreur » qui peuvent faire une grande différence lors du dépannage.

ddeveloper · Octobre 17, 2023, 5:06

J’ai suivi le même guide sauf que j’ai utilisé la version 2.6 de Porter. Je vais exporter le fichier de la version 2.6 et mettre à jour ici.

ddeveloper · Octobre 17, 2023, 5:35

D’accord, j’ai essayé porter 2.6 et cela a entraîné la même erreur UTF-8 :

Jusqu’à présent, j’ai suivi ce guide : Migrate a Vanilla forum to Discourse

Tout s’est bien passé jusqu’à cette erreur d’encodage UTF-8. Certaines personnes ont résolu ce problème. J’ai essayé leurs solutions, mais cela n’a pas fonctionné pour moi.

J’ai essayé la solution de @Nick_Chomey ci-dessus ; essayer de forcer l’encodage utf-8 lors de la lecture du fichier txt, mais à ma grande déception, cela n’a pas non plus fonctionné.

southpaw · Octobre 17, 2023, 5:55

Pour être sûr, quel Source Forum Type avez-vous sélectionné dans le menu déroulant Vanilla Porter ?

Pourriez-vous nous dire quel type d’ordinateur vous utilisez ? Les instructions pour convertir votre fichier en encodage UTF-8 varieront.

ddeveloper · Octobre 17, 2023, 6:09

Merci d’avoir pris le temps d’aider un autre membre de la communauté.

J’ai sélectionné « Vanilla 2 » dans Source Forum Type.

Je peux utiliser des appareils sous Windows et Linux et j’y ai accès.

Sujet		Réponses	Vues
Migrate a Vanilla forum to Discourse Sysadmins how-to	44	16074	Janvier 30, 2023
[Paid] Need a Vanilla 2 Import tool Marketplace	67	10864	Mai 2, 2015
Error when importing from Vanilla Migration	5	1954	Juin 8, 2024
Migrate/Convert WP Posts to Discourse Topics Dev	3	638	Novembre 4, 2021
Migrating vBulletin 5 database - Import script errors Migration vbulletin5	46	2260	Mars 8, 2023

Error importing from vanilla: invalid byte sequence in UTF-8

Sujets connexes