Something that at first glance in dev seemed to be just fine was the way the posts were all migrated. In the browser most look fine but underwater there are a lot of legacy html tags that do break things in loading pictures and importing them to local storage and so on.
Is there a smart way to remove the html mark-up post migration of all the posts?
Other question, I also need to correct lots of urls from http to https.
Thanks for your help. By the way the makeshift script that I provided should definitely be amended to take better care of processing the posts to the Discourse post format (not in my skillset)ā¦
Itās easiest to fix these issues in the importer. If youāve gone live already and thatās not an option then itās harder. You just write code to modify the raw text and rebake the posts. There is no magic, Iāll afraid.
It sounds like you are running into the issue described here: Fix broken images for posts created by the WP Discourse and RSS plugins. My first reply in that topic gives some details about what causes the issue. The issue affects images in posts that were created with HTML. Iāll update that topicās title to make it clear that it affects more than just posts created with the WP DIscourse plugin or an RSS feed.
Ideally the Discourse markdown parser would be able to handle HTML image tags that are wrapped in other HTML tags. I think itās a difficult problem to fix though.
Yes this is exactly the phenomenon with my broken images inside other HTML tags.
I have started correcting manually but it is laborious and is compounded by the fact that it bumps the post to the top of the latest list. That requires a manual bump reset etc.
I will go ahead and try to figure out the logic for removing the html tags by looking at a couple of really bad posts. Then Iād might need some help automating that across the whole database, going to try dataexplorer to figure that part out. Would dataexplorer allow to act as an IDE for doing these post transformations?
Hi I have figured out how to clean up in a manual fashion and also get the broken images fixed. However Iād like to do it in an automated fashion.
What I would like to do is find a way to remove all the html tags like [P] and [/P] and [BR/].
I have searched the forum but I canāt find anything close. Searched the import scripts but it doesnāt contain a discourse to discourse importer to start from. I figure I need a script to:
access the posts table
iterate through all posts
iterate through the post:
ā> remove P tags altogether
ā> replace BR/ with a newline character?
ā> do something smart fro urls
ā> do something smart for images
rebake all posts probably.
Can anyone guide me to some relevant discussion on Discourse, or does anyone have some script or snippets that can get me started? Not a seasoned developer but I can modify things that workā¦
Will share back to community once I have achieved my goal.
posts=Post.where("raw like '%Sent from%using Tapatalk'")
posts.each do |post|
post.raw.gsub!(/^Sent from my.+?using Tapatalk$/,"")
post.save
post.rebake!
end
I donāt think that you need to do something āsmartā for images or URLs unless they are somehow broken.
You want something like
post.raw.gsub!(/\/?\[p\]/ig,"\n")
to replace [p] and [/p] with a newline (an extra newline wonāt hurt, but you can remove the \n if you donāt think you need a newline), but I havenāt tested, so this is probably wrong. You can test in something like https://rubular.com/.
When we migrated our forum we had countless of these bbcode and tag issues from nearly two decades of forum posts.
We did not use the rake remap function for these and, in all cases, we used the technique that @pfaffman outlines in his code snippet:
This code snippet above using gsub() summarizes one of the best ways to clean up the raw posts after (or even better, during) migration.
Make sure you test your REGEX expressions BEFORE you actually implement them on the DB and have a full backup before you do operations like this directly on your DB.
Hi, below the content of my script/cleanup.rb which I launch using: RAILS_ENV=development bundle exec ruby script/cleanup.rb
File content:
require_relative '../config/environment'
pm = 0
Post.find_each do |test|
test.raw.gsub!(/<(.|\/.)>/i,"")
test.save
test.rebake!
pm = pm + 1
end
puts "cycled through #{pm} posts"
i tried d/rake posts:rebake which rebakes 1757 posts. The script cycles through just 1712 posts which is the imported stuff with the html tags and the remainder being new ones created in Discourse.
Think I am getting close but when I inspect the raw content in the UI I keep seeing all the html tags.
Tried rebooting the environment and relaunching unicorn but to no avail. So closeā¦ so close ;o)
Used your rubular suggestion and now regexr.com see below screenshot. Just went for the p and the r tags for now and get those sorted before adding the more complex ones.
When I added a put statement to my little cleanup.rb to print out the raw post contents to the CLI I noticed there were no html tags printed out at all.
However when I edit any post I do see the following, with the html tags on the hand side. This doesnāt seem to be the normal situation because when come back to edit the post I am now editing I donāt see html tagsā¦
That gsub needs a g after the /i to match multiple tags. But if what you see in your puts is different from what you see after the script has run, then I donāt have an explanation.
Thanks that worked from the regexp side of things.
It seems there are two things that I canāt explain using this script:
# Call it like this:
# RAILS_ENV=development bundle exec ruby script/cleanup.rb -> cleanup.log
require_relative '../config/environment'
pm = 0
Post.find_each do |test|
puts test.raw
test.raw.gsub!(/<(.|\/.)>/im,"")
test.save
test.rebake!
pm = pm + 1
end
puts "cycled through #{pm} posts"
after running this a couple of times, the cleanup.log keeps containing 10 instances of <p> in around 21000 lines of raw post material. These never get removed weirdly enough
even weirder (to me) is that when I fire up unicorn and access the site on my local machine I still get the html tags in all the posts that I check in the raw view of the editor.
Seems to me I am not looking at the same environment perhaps? Does the d/unicorn look at a local production environment while my script applies changes to dev?
Trying to get this working using the dev environment using docker guide locally first before going to my live site.