Error importing from vanilla: invalid byte sequence in UTF-8

dpkoch · 14 Diciembre, 2018 23:30

I’m trying to import from a Vanilla forum using the instructions posted here. However, I get the following error when I run the vanilla.rb import script:

Loading existing groups...
Loading existing users...
Loading existing categories...
Loading existing posts...
Loading existing topics...
parsing file...
reading file...
Traceback (most recent call last):
	5: from script/import_scripts/vanilla.rb:254:in `<main>'
	4: from /var/www/discourse/script/import_scripts/base.rb:47:in `perform'
	3: from script/import_scripts/vanilla.rb:17:in `execute'
	2: from script/import_scripts/vanilla.rb:37:in `parse_file'
	1: from script/import_scripts/vanilla.rb:72:in `read_file'
script/import_scripts/vanilla.rb:72:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)

I’ve tried changing the MySQL database character set to UTF8 following the instructions here and then re-exporting the porter file, but that didn’t resolve the issue. Any suggestions?

pfaffman · 15 Diciembre, 2018 13:09

You either need to keep trying to get it to really be UTF-8 or modify the import script to do it. It is a frustrating problem.

Nick_Chomey · 4 Noviembre, 2021 14:11

@dpkoch Did you ever figure this out?

pfaffman · 4 Noviembre, 2021 14:45

You can google stuff about UTF-8 encoding. You need to do something that will coerce the table into UTF-8. The time that I did it, there were further complications because some rows were in one format and others in another format. I think that I did some nonsense where I coerced things on a value-by-value basis.

Nick_Chomey · 4 Noviembre, 2021 14:52

Sounds awful… We’ll have to tinker with the table encoding and see what happens. Thanks!

pfaffman · 4 Noviembre, 2021 14:56

Oh. It’s aweful. You best bet, based on a vague recollection of a single time I did this over a year ago, is to play around with as many different conversions as you can until you can finally hit on one that works for all or most data. I think that I did a bunch of one-by-one transformations that ended up being a waste of time when I stumbled on some conversion that worked for all (most?) data.

Here is what I did. Use at your own risk. (This was vbulletin, FWIW).

  def char_map(raw_original)
    raw = raw_original.dup
    debug = false # (raw.length > 50)

    # windows 1252
    all = ''
    win_encoded = ''

    ### WIN1252 encoding
    win_encoded = ''
    begin
      win_encoded = raw.force_encoding('utf-8').encode("Windows-1252",
                            invalid: :replace, undef: :replace, replace: ""
                           ).force_encoding('utf-8').scrub
    rescue => e
      puts "\n#{'-'*50}\nWin1252 failed for \n\n#{raw}\n\n"
      win_encoded = ''
    end

    ### ISO 8859 encoding
    iso_encoded = ''
    if all.length == 0 && win_encoded.length > 0 && win_encoded != raw
      all = (debug ? "Win1252--" : '') + win_encoded
    else
      all = raw
    end
    all = old_char_map(all)
    all
  end

Nick_Chomey · 4 Noviembre, 2021 15:00

That code is used within the import script or on the server/database side?

pfaffman · 4 Noviembre, 2021 15:01

In the import script. I don’t like to mess with the database.

Somewhere you call this function on raw to fix raw (and maybe titles?).

Nick_Chomey · 4 Noviembre, 2021 15:02

Ok, thanks very much! This should give me a huge head-start in debugging this.

Nick_Chomey · 11 Noviembre, 2021 12:27

Logramos que esto funcionara añadiendo un simple comando para codificar el archivo como UTF-8 al leerlo, usando algo como encode\"UTF-8\" en las líneas 76-80 del script de importación vanilla.rb.

Solo estoy esperando la confirmación de la sintaxis exacta del chico que lo hizo por línea de comandos. Actualizaré esto cuando la tenga.

Nick_Chomey · 13 Noviembre, 2021 21:59

Aquí está lo que usó para arreglar esto, comenzando en la línea 76 de vanilla.rb

def read_file
  puts "leyendo archivo..."
  string = [File.read](http://file.read/)(@vanilla_file)
    .force_encoding('UTF-8').encode("UTF-8").gsub("\\N", "")
    .force_encoding('UTF-8').encode("UTF-8").gsub(/\\$\n/m, "\\n")
    .force_encoding('UTF-8').encode("UTF-8").gsub("\\,", ",")
    .force_encoding('UTF-8').encode("UTF-8").gsub(/(?<!\\)\\"/, '""')
    .force_encoding('UTF-8').encode("UTF-8").gsub(/\\\\\\"/, '\\\"\"')
  [StringIO.new](http://stringio.new/)(string)
end

Canapin · 17 Octubre, 2023 09:52

Se dividió una publicación en un nuevo tema: ¿Cómo editar un script de importación en el servidor de producción?

ddeveloper · 17 Octubre, 2023 12:33

Esto no funciona. Mismo error.

pfaffman · 17 Octubre, 2023 14:05

Necesitas buscar en Google sobre codificación y averiguar cómo arreglar tu codificación rota.

southpaw · 17 Octubre, 2023 15:02

Hola @ddeveloper,

Trabajé en este proceso hace solo un par de meses (y no soy desarrollador ) y logré migrar con éxito un foro Vanilla autoalojado a Discourse autoalojado. Algo que fue clave para mí fue asegurarme, al exportar los datos con Vanilla Porter, de seleccionar “Vanilla 2” como el Source Forum Type en el primer menú desplegable.

Utilicé la versión 2.6 de Vanilla Porter disponible para descargar como un archivo zip aquí: Vanilla Porter 2.6 RC1 — Vanilla Forums en lugar de la versión 2.5 enlazada en Migrate a Vanilla forum to Discourse.

Si mal no recuerdo, no volví a encontrar el error UTF-8 al usar el script más reciente de Vanilla Porter y el tipo de foro “Vanilla 2”.

Si esas dos sugerencias no marcan la diferencia en tu importación, por favor, proporciona algunos detalles sobre los pasos que has seguido hasta ahora y exactamente lo que estás viendo. A veces hay ligeras variaciones en el “mismo error” que pueden marcar una gran diferencia al solucionar problemas.

ddeveloper · 17 Octubre, 2023 17:06

He seguido la misma guía excepto usando la versión porter 2.6. Exportaré el archivo de la versión 2.6 y actualizaré aquí.

ddeveloper · 17 Octubre, 2023 17:35

Okay, he intentado con porter 2.6 y ha resultado en el mismo error de UTF-8:

Hasta ahora, he seguido esta guía: Migrate a Vanilla forum to Discourse

Todo salió bien hasta este error de codificación UTF-8. Algunas personas han resuelto este problema. Lo he intentado y no me ha funcionado.

Intenté la solución de @Nick_Chomey anterior; intentando forzar la codificación utf-8 al leer el archivo txt, pero para mi pesar, tampoco funcionó.

southpaw · 17 Octubre, 2023 17:55

Solo para estar seguros, ¿qué Tipo de foro de origen seleccionaste en el menú desplegable de Vanilla Porter?

¿Podrías decirnos qué tipo de computadora estás usando? Las instrucciones para convertir tu archivo a codificación UTF-8 variarán.

ddeveloper · 17 Octubre, 2023 18:09

Gracias por tomarte el tiempo de ayudar a un compañero de Discourse.

Seleccioné “Vanilla 2” en Source Forum Type.

Puedo usar dispositivos basados en Windows y Linux y tengo acceso a ambos.

Tema		Respuestas	Vistas
Migrate a Vanilla forum to Discourse Sysadmins how-to	44	16218	30 Enero 2023
[Paid] Need a Vanilla 2 Import tool Marketplace	67	10955	2 Mayo 2015
Error when importing from Vanilla Migration	5	1974	8 Junio 2024
Migrate/Convert WP Posts to Discourse Topics Dev	3	683	4 Noviembre 2021
Migrating vBulletin 5 database - Import script errors Migration vbulletin5	46	2337	8 Marzo 2023

Error importing from vanilla: invalid byte sequence in UTF-8

Temas relacionados