Clean-up html tags in all posts after migration?

koen360 · December 3, 2020, 10:28am

Hi as mentioned in Makeshift but working flarum import script I have migrated from Flarum to Discourse. My first site and totally digging your work, kudos to you all.

Something that at first glance in dev seemed to be just fine was the way the posts were all migrated. In the browser most look fine but underwater there are a lot of legacy html tags that do break things in loading pictures and importing them to local storage and so on.

Is there a smart way to remove the html mark-up post migration of all the posts?
Other question, I also need to correct lots of urls from http to https.

Thanks for your help. By the way the makeshift script that I provided should definitely be amended to take better care of processing the posts to the Discourse post format (not in my skillset)…

pfaffman · December 3, 2020, 10:53am

It’s easiest to fix these issues in the importer. If you’ve gone live already and that’s not an option then it’s harder. You just write code to modify the raw text and rebake the posts. There is no magic, I’ll afraid.

simon · December 3, 2020, 5:45pm

It sounds like you are running into the issue described here: Fix broken images for posts created by the WP Discourse and RSS plugins. My first reply in that topic gives some details about what causes the issue. The issue affects images in posts that were created with HTML. I’ll update that topic’s title to make it clear that it affects more than just posts created with the WP DIscourse plugin or an RSS feed.

Ideally the Discourse markdown parser would be able to handle HTML image tags that are wrapped in other HTML tags. I think it’s a difficult problem to fix though.

koen360 · December 4, 2020, 1:52pm

Yes this is exactly the phenomenon with my broken images inside other HTML tags.

I have started correcting manually but it is laborious and is compounded by the fact that it bumps the post to the top of the latest list. That requires a manual bump reset etc.

I will go ahead and try to figure out the logic for removing the html tags by looking at a couple of really bad posts. Then I’d might need some help automating that across the whole database, going to try dataexplorer to figure that part out. Would dataexplorer allow to act as an IDE for doing these post transformations?

Looking forward to the learning curve.

Koen

simon · December 4, 2020, 6:02pm

No, the Data Explorer plugin only allows you to read from the site’s database. It does not allow you to write to the site’s database.

koen360 · December 21, 2020, 11:12am

Hi I have figured out how to clean up in a manual fashion and also get the broken images fixed. However I’d like to do it in an automated fashion.

What I would like to do is find a way to remove all the html tags like [P] and [/P] and [BR/].

I have searched the forum but I can’t find anything close. Searched the import scripts but it doesn’t contain a discourse to discourse importer to start from. I figure I need a script to:

access the posts table
iterate through all posts
iterate through the post:
–> remove P tags altogether
–> replace BR/ with a newline character?
–> do something smart fro urls
–> do something smart for images
rebake all posts probably.

Can anyone guide me to some relevant discussion on Discourse, or does anyone have some script or snippets that can get me started? Not a seasoned developer but I can modify things that work…

Will share back to community once I have achieved my goal.

Koen

pfaffman · December 21, 2020, 6:47pm

Here’s code that does something similar:

posts=Post.where("raw like '%Sent from%using Tapatalk'")

posts.each do |post|
   post.raw.gsub!(/^Sent from my.+?using Tapatalk$/,"")
   post.save
   post.rebake!
end

I don’t think that you need to do something “smart” for images or URLs unless they are somehow broken.

You want something like

   post.raw.gsub!(/\/?\[p\]/ig,"\n")

to replace [p] and [/p] with a newline (an extra newline won’t hurt, but you can remove the \n if you don’t think you need a newline), but I haven’t tested, so this is probably wrong. You can test in something like https://rubular.com/.

koen360 · December 21, 2020, 10:24pm

Super this put me totally on the right track.

I guess for my simple case going with rake posts:remap["find","replace"] should suffice right?

Going to give it a shot, thanks so much!

pfaffman · December 21, 2020, 10:25pm

It can be tricky (if it’s even possible) to figure out how to escape [ with that rake task.

koen360 · December 21, 2020, 10:31pm

Sorry, that "]"character is just what I put because I couldn’t figure how to put the “<”.

I just need to remove a couple of these standard html tags.

Should be fine then right using the remap?

pfaffman · December 21, 2020, 10:32pm

Probably so. You quote things with backticks like this

`<`p`>`

or

`<p>`

neounix · December 22, 2020, 2:33am

FWIW @koen360

When we migrated our forum we had countless of these bbcode and tag issues from nearly two decades of forum posts.

We did not use the rake remap function for these and, in all cases, we used the technique that @pfaffman outlines in his code snippet:

This code snippet above using gsub() summarizes one of the best ways to clean up the raw posts after (or even better, during) migration.

Make sure you test your REGEX expressions BEFORE you actually implement them on the DB and have a full backup before you do operations like this directly on your DB.

koen360 · December 22, 2020, 10:51pm

Hi, below the content of my script/cleanup.rb which I launch using: RAILS_ENV=development bundle exec ruby script/cleanup.rb

File content:

require_relative '../config/environment'
pm = 0
Post.find_each do |test|
	test.raw.gsub!(/<(.|\/.)>/i,"")
	test.save
	test.rebake!
	pm = pm + 1
end
puts "cycled through #{pm} posts"

i tried d/rake posts:rebake which rebakes 1757 posts. The script cycles through just 1712 posts which is the imported stuff with the html tags and the remainder being new ones created in Discourse.

Think I am getting close but when I inspect the raw content in the UI I keep seeing all the html tags.

Tried rebooting the environment and relaunching unicorn but to no avail. So close… so close ;o)

pfaffman · December 22, 2020, 11:30pm

Did you test your regex somewhere? Do one post at a time to make sure that the raw got changed?

koen360 · December 23, 2020, 9:27am

Used your rubular suggestion and now regexr.com see below screenshot. Just went for the p and the r tags for now and get those sorted before adding the more complex ones.

koen360 · December 25, 2020, 8:09pm

When I added a put statement to my little cleanup.rb to print out the raw post contents to the CLI I noticed there were no html tags printed out at all.

However when I edit any post I do see the following, with the html tags on the hand side. This doesn’t seem to be the normal situation because when come back to edit the post I am now editing I don’t see html tags…

Anyone got a clue?

pfaffman · December 25, 2020, 9:17pm

That gsub needs a g after the /i to match multiple tags. But if what you see in your puts is different from what you see after the script has run, then I don’t have an explanation.

koen360 · December 26, 2020, 1:42pm

Weird I get:

script/cleanup.rb:9: unknown regexp option - g

when adding g to the i like this:

require_relative ‘…/config/environment’

pm = 0
Post.find_each do |test|
puts test.raw
test.raw.gsub!(/<(.|/.)>/ig,“”)
test.save
test.rebake!
pm = pm + 1
end
puts “cycled through #{pm} posts”

neounix · December 27, 2020, 8:33am

As I recall (and this is how we generally match multilines using gsub …), a Ruby multiline REGEX requires m:

/./m - Any character (the m modifier enables multiline mode)

See:

Screen Shot 2020-12-27 at 3.34.12 PM

HTH

koen360 · December 27, 2020, 10:22am

Thanks that worked from the regexp side of things.

It seems there are two things that I can’t explain using this script:

# Call it like this:
# RAILS_ENV=development bundle exec ruby script/cleanup.rb -> cleanup.log

require_relative '../config/environment'
pm = 0
Post.find_each do |test|
	puts test.raw
	test.raw.gsub!(/<(.|\/.)>/im,"")
	test.save
	test.rebake!
	pm = pm + 1
end
puts "cycled through #{pm} posts"

after running this a couple of times, the cleanup.log keeps containing 10 instances of <p> in around 21000 lines of raw post material. These never get removed weirdly enough
even weirder (to me) is that when I fire up unicorn and access the site on my local machine I still get the html tags in all the posts that I check in the raw view of the editor.

Seems to me I am not looking at the same environment perhaps? Does the d/unicorn look at a local production environment while my script applies changes to dev?

Trying to get this working using the dev environment using docker guide locally first before going to my live site.

Must be some major newbie thing I am overlooking.

Topic		Replies	Views
How to fix formatting issues? - markdown badly migrated to HTML Migration flarum	8	456	March 29, 2024
Fix quotes after phpBB import Support	81	6807	June 8, 2024
[Announce] Search & replace / batch process Discourse posts Extras rest-api	6	9263	November 24, 2017
What's the fastest way to replace strings with multiple regexes in 1 million posts? Support	9	831	September 21, 2020
Disable tags sanitizing Dev	2	652	July 30, 2018

Clean-up html tags in all posts after migration?

Related topics