Clean-up html tags in all posts after migration?

Hi as mentioned in Makeshift but working flarum import script I have migrated from Flarum to Discourse. My first site and totally digging your work, kudos to you all.

Something that at first glance in dev seemed to be just fine was the way the posts were all migrated. In the browser most look fine but underwater there are a lot of legacy html tags that do break things in loading pictures and importing them to local storage and so on.

Is there a smart way to remove the html mark-up post migration of all the posts?
Other question, I also need to correct lots of urls from http to https.

Thanks for your help. By the way the makeshift script that I provided should definitely be amended to take better care of processing the posts to the Discourse post format (not in my skillset)ā€¦

2 Likes

Itā€™s easiest to fix these issues in the importer. If youā€™ve gone live already and thatā€™s not an option then itā€™s harder. You just write code to modify the raw text and rebake the posts. There is no magic, Iā€™ll afraid.

3 Likes

It sounds like you are running into the issue described here: Fix broken images for posts created by the WP Discourse and RSS plugins. My first reply in that topic gives some details about what causes the issue. The issue affects images in posts that were created with HTML. Iā€™ll update that topicā€™s title to make it clear that it affects more than just posts created with the WP DIscourse plugin or an RSS feed.

Ideally the Discourse markdown parser would be able to handle HTML image tags that are wrapped in other HTML tags. I think itā€™s a difficult problem to fix though.

2 Likes

Yes this is exactly the phenomenon with my broken images inside other HTML tags.

I have started correcting manually but it is laborious and is compounded by the fact that it bumps the post to the top of the latest list. That requires a manual bump reset etc.

I will go ahead and try to figure out the logic for removing the html tags by looking at a couple of really bad posts. Then Iā€™d might need some help automating that across the whole database, going to try dataexplorer to figure that part out. Would dataexplorer allow to act as an IDE for doing these post transformations?

Looking forward to the learning curve.

Koen

1 Like

No, the Data Explorer plugin only allows you to read from the siteā€™s database. It does not allow you to write to the siteā€™s database.

2 Likes

Hi I have figured out how to clean up in a manual fashion and also get the broken images fixed. However Iā€™d like to do it in an automated fashion.

What I would like to do is find a way to remove all the html tags like [P] and [/P] and [BR/].

I have searched the forum but I canā€™t find anything close. Searched the import scripts but it doesnā€™t contain a discourse to discourse importer to start from. I figure I need a script to:

access the posts table
iterate through all posts
iterate through the post:
ā€“> remove P tags altogether
ā€“> replace BR/ with a newline character?
ā€“> do something smart fro urls
ā€“> do something smart for images
rebake all posts probably.

Can anyone guide me to some relevant discussion on Discourse, or does anyone have some script or snippets that can get me started? Not a seasoned developer but I can modify things that workā€¦

Will share back to community once I have achieved my goal.

Koen

1 Like

Hereā€™s code that does something similar:

posts=Post.where("raw like '%Sent from%using Tapatalk'")

posts.each do |post|
   post.raw.gsub!(/^Sent from my.+?using Tapatalk$/,"")
   post.save
   post.rebake!
end

I donā€™t think that you need to do something ā€œsmartā€ for images or URLs unless they are somehow broken.

You want something like

   post.raw.gsub!(/\/?\[p\]/ig,"\n")

to replace [p] and [/p] with a newline (an extra newline wonā€™t hurt, but you can remove the \n if you donā€™t think you need a newline), but I havenā€™t tested, so this is probably wrong. You can test in something like https://rubular.com/.

2 Likes

Super this put me totally on the right track.

I guess for my simple case going with rake posts:remap["find","replace"] should suffice right?

Going to give it a shot, thanks so much!

1 Like

It can be tricky (if itā€™s even possible) to figure out how to escape [ with that rake task.

1 Like

Sorry, that "]"character is just what I put because I couldnā€™t figure how to put the ā€œ<ā€.

I just need to remove a couple of these standard html tags.

Should be fine then right using the remap?

2 Likes

Probably so. You quote things with backticks like this

`<`p`>`

or

`<p>`
2 Likes

FWIW @koen360

When we migrated our forum we had countless of these bbcode and tag issues from nearly two decades of forum posts.

We did not use the rake remap function for these and, in all cases, we used the technique that @pfaffman outlines in his code snippet:

This code snippet above using gsub() summarizes one of the best ways to clean up the raw posts after (or even better, during) migration.

Make sure you test your REGEX expressions BEFORE you actually implement them on the DB and have a full backup before you do operations like this directly on your DB.

1 Like

Hi, below the content of my script/cleanup.rb which I launch using: RAILS_ENV=development bundle exec ruby script/cleanup.rb

File content:

require_relative '../config/environment'
pm = 0
Post.find_each do |test|
	test.raw.gsub!(/<(.|\/.)>/i,"")
	test.save
	test.rebake!
	pm = pm + 1
end
puts "cycled through #{pm} posts"

i tried d/rake posts:rebake which rebakes 1757 posts. The script cycles through just 1712 posts which is the imported stuff with the html tags and the remainder being new ones created in Discourse.

Think I am getting close but when I inspect the raw content in the UI I keep seeing all the html tags.

Tried rebooting the environment and relaunching unicorn but to no avail. So closeā€¦ so close ;o)

2 Likes

Did you test your regex somewhere? Do one post at a time to make sure that the raw got changed?

1 Like

Used your rubular suggestion and now regexr.com see below screenshot. Just went for the p and the r tags for now and get those sorted before adding the more complex ones.

2 Likes

When I added a put statement to my little cleanup.rb to print out the raw post contents to the CLI I noticed there were no html tags printed out at all.

However when I edit any post I do see the following, with the html tags on the hand side. This doesnā€™t seem to be the normal situation because when come back to edit the post I am now editing I donā€™t see html tagsā€¦

Anyone got a clue?

1 Like

That gsub needs a g after the /i to match multiple tags. But if what you see in your puts is different from what you see after the script has run, then I donā€™t have an explanation.

1 Like

Weird I get:

script/cleanup.rb:9: unknown regexp option - g

when adding g to the i like this:

require_relative ā€˜ā€¦/config/environmentā€™

pm = 0
Post.find_each do |test|
puts test.raw
test.raw.gsub!(/<(.|/.)>/ig,ā€œā€)
test.save
test.rebake!
pm = pm + 1
end
puts ā€œcycled through #{pm} postsā€

1 Like

As I recall (and this is how we generally match multilines using gsub ā€¦), a Ruby multiline REGEX requires m:

/./m - Any character (the m modifier enables multiline mode)

See:

Screen Shot 2020-12-27 at 3.34.12 PM

HTH

1 Like

Thanks that worked from the regexp side of things.

It seems there are two things that I canā€™t explain using this script:

# Call it like this:
# RAILS_ENV=development bundle exec ruby script/cleanup.rb -> cleanup.log

require_relative '../config/environment'
pm = 0
Post.find_each do |test|
	puts test.raw
	test.raw.gsub!(/<(.|\/.)>/im,"")
	test.save
	test.rebake!
	pm = pm + 1
end
puts "cycled through #{pm} posts"
  1. after running this a couple of times, the cleanup.log keeps containing 10 instances of <p> in around 21000 lines of raw post material. These never get removed weirdly enough
  2. even weirder (to me) is that when I fire up unicorn and access the site on my local machine I still get the html tags in all the posts that I check in the raw view of the editor.

Seems to me I am not looking at the same environment perhaps? Does the d/unicorn look at a local production environment while my script applies changes to dev?

Trying to get this working using the dev environment using docker guide locally first before going to my live site.

Must be some major newbie thing I am overlooking. :sweat_smile:

1 Like