Problema de codificación de caracteres en base de datos MyBB importada

Hola, tengo una serie de publicaciones e incluso nombres de usuario importados de un foro MyBB que están mostrando caracteres aleatorios como ’ y Â.

Por lo que puedo deducir de informes de comportamientos similares en WordPress, ¿esto podría ser un problema de codificación entre Latin1 y UTF-8?

¿Existe una forma sencilla de eliminarlos después de la importación?

¿A qué caracteres originales corresponden realmente estos símbolos? No puedo imaginar qué caracteres originales podrían haber sido sustituidos por ellos.

Además, veo que algunas publicaciones importadas contienen una gran cantidad de MyCode no procesado; ¿existe alguna forma de que esto se procese en Discourse?

Yes. That’s my guess. I’m working with an import now with similar problems. Most of them are things like curly quotes and emdashes.

It’s far from easy, but you you can do some post-processing that either does a force_encoding or attempts to replace the characters one-by-one.

Something like

Post.all.each do |post|
  post.raw = post.raw.force_encoding('utf-8').encode("Windows-1252").force_encoding('utf-8')
  post.save!
  post.rebake!
end

But I’d test it extensively on a staging site before you run it on your live data.

1 me gusta

Thanks Jay.
Is there any clever way of dealing with the issue at source - i.e re-exporting database from old forum, then re importing free of characters and mycode issues?

If you’ve not gone live so that starting over isn’t an option, That’s the best way to do it.

Site is not officially live - but what is best way to deal with character issues and mycode parsing when exporting from MyBB?

Exporting all data in UTF-8, if possible, will solve those issues.

1 me gusta

I went back to original MyBB installation, and found in admin control panel/Tools and Maintenance/System Health a warning

It is recommend not to use different encodings in your database. This may cause unexpected behavior or MySQL errors.

The tables are listed, and I could see most but not all were in UTF-8 format. Looked like some, particularly associated with plugins, were in an older format

Clicking a ‘Convert all’ link brought up response that /inc/config.php needed editing to support full 4 byte UTF-8

$config[‘database’][‘encoding’] = ‘utf8mb4’;

After editing config.php and trying the conversion again, all now show as matching. Will try re importing to Discourse and report back if this helps with character issues.

Not sure still how to deal with MyCode parsing though?

1 me gusta

You didn’t include any examples or details of this - at this point, may be best to start a new thread and keep this one focused on the followup for the character encoding.

3 Me gusta

Hi, a new thread with an example is here