How to fix formatting issues? - markdown badly migrated to HTML

We managed to migrate from flarum.amybo.org to discourse forum.amybo.org but the forum is now riddled with formatting issues. For example:

@"Gerrit"#p174 I like the idea of a water based liquid (just like Rabaey's) so I checked out the [Nutrisorb Trace Minerals](https://www.biocare.co.uk/nutrisorbr-liquid-trace-minerals-15ml) ingredients:
1. Purified Water, 
2. Sodium Borate, 
3. Preservative (Citric Acid), 

Becomes:

<r><p><POSTMENTION discussionid="25" displayname="Gerrit" id="174" number="8">@"Gerrit"#p174</POSTMENTION> I like the idea of a water based liquid (just like Rabaey's) so I checked out the <URL url="https://www.biocare.co.uk/nutrisorbr-liquid-trace-minerals-15ml"><s>[</s>Nutrisorb Trace Minerals<e>](https://www.biocare.co.uk/nutrisorbr-liquid-trace-minerals-15ml)</e></URL> ingredients:</p>
<LIST type="decimal"><LI><s>1. </s>Purified Water, </LI>
<LI><s>2. </s>Sodium Borate, </LI>
<LI><s>3. </s>Preservative (Citric Acid), </LI>

Is there a recommended way to fix these?

It would be amazing if it could be automated across the whole forum (without showing each post as edited), but, if not, then a semi-automated fix that we could apply post by post would be better than needing to manually remove each HTML tag to revert to markdown.

The time to fix those was when you did the import. As someone who does a lot of migrations, people launching a forum with botched formatting like this is one of my greatest concerns. It would have been pretty easy to fix the import script, and much harder to fix post hoc now that you’ve launched. It would have been maybe an hour or two then, and now it’s 2-5X that much work.

Yeah, one could write a script that would clean the stuff up on the live forum and either (scary) don’t maintain the edits (so there would be no edit, and no notifications), or do the edit and tell it not to bump/notify. I’m pretty sure I’ve done it before. You would do something like:

fixes = Post.where("raw like '%something broken%'")
fixes.each do |p|
   new_raw = p.raw.gsub!(some stuff)
   PostRevisor.new(p).revise!(script_user, {raw: new_raw, edit_reason: 'post fixer!'}, {bypass_bump: true)
end

Does the raw have all of that HTML in it?

2 Likes

I believe you can use the posts:remap rake command (see Replace a string in all posts to achieve that); you should not have an edit with that command.

1 Like

That rake task is fine for simple replacements, but quickly gets unwieldy.

What do you mean? Like if you want to manage complex regular expressions, that’s not practical?

Right. You’re calling the task in a shell, so quickly figuring out whether you’re escaping Bash, Ruby, or the regex becomes difficult or impossible. Also, it looks like the fixes are not going to be simple replacements.

2 Likes

I’m new to discourse, but have now found how to access the raw, and yes it does contain all the HTML:
https://forum.amybo.org/raw/56/9

Did you use the existing flarum_import.rb script? It’s hard to imagine that HTML is in the flarum p.content field (which is what goes into raw in the import script). I am pretty sure that flarum uses markdown, so I don’t know why you’d have HTML inraw. Or maybe the script is just that broken.

But strikethrough is what is in the HTML. Maybe you can just fix that with CSS.

Fixing the <POSTMENTION> is a bit more tricky since Discourse has quotes, but not mentions. The easy solution would be to just change those into a simple @${displayname} (and hope that the username is the same as before the import, or do a lookup in the user_custom_fields to find the updated username. Another thing you could do would be to include a link something like @mention said [here](/t/-/<discourse ID for topic 25>/<8>).

A crazy solution would be to write a script that pulls the markdown from the flarum database and updates the raw field to include that. It would still need to be cleaned up a bit (as for mentions and POSTMENTIONS), but it would fix a lot of stuff.

Another idea would be to freeze your site, do a mass deletion of all of the imported data, fix the import script, and run it again.

But you don’t have a lot of new posts since you moved, so maybe you could save them somehow, do a fresh import on an empty database, and then add those back. Dealing with new users would be a bit harder.

If you’ve got a budget, you can contact me or post in marketplace.

2 Likes

Thanks Jay, I didn’t do the migration myself, so don’t know which script was used. The member who did is down with Flu, but I’ll point them to your excellent advice here, when they return.

1 Like