Clean-up html tags in all posts after migration?

No, the Data Explorer plugin only allows you to read from the site’s database. It does not allow you to write to the site’s database.

2 Likes

Hi I have figured out how to clean up in a manual fashion and also get the broken images fixed. However I’d like to do it in an automated fashion.

What I would like to do is find a way to remove all the html tags like [P] and [/P] and [BR/].

I have searched the forum but I can’t find anything close. Searched the import scripts but it doesn’t contain a discourse to discourse importer to start from. I figure I need a script to:

access the posts table
iterate through all posts
iterate through the post:
–> remove P tags altogether
–> replace BR/ with a newline character?
–> do something smart fro urls
–> do something smart for images
rebake all posts probably.

Can anyone guide me to some relevant discussion on Discourse, or does anyone have some script or snippets that can get me started? Not a seasoned developer but I can modify things that work…

Will share back to community once I have achieved my goal.

Koen

1 Like

Here’s code that does something similar:

posts=Post.where("raw like '%Sent from%using Tapatalk'")

posts.each do |post|
   post.raw.gsub!(/^Sent from my.+?using Tapatalk$/,"")
   post.save
   post.rebake!
end

I don’t think that you need to do something “smart” for images or URLs unless they are somehow broken.

You want something like

   post.raw.gsub!(/\/?\[p\]/ig,"\n")

to replace [p] and [/p] with a newline (an extra newline won’t hurt, but you can remove the \n if you don’t think you need a newline), but I haven’t tested, so this is probably wrong. You can test in something like https://rubular.com/.

2 Likes

Super this put me totally on the right track.

I guess for my simple case going with rake posts:remap["find","replace"] should suffice right?

Going to give it a shot, thanks so much!

1 Like

It can be tricky (if it’s even possible) to figure out how to escape [ with that rake task.

1 Like

Sorry, that "]"character is just what I put because I couldn’t figure how to put the “<”.

I just need to remove a couple of these standard html tags.

Should be fine then right using the remap?

2 Likes

Probably so. You quote things with backticks like this

`<`p`>`

or

`<p>`
2 Likes

FWIW @koen360

When we migrated our forum we had countless of these bbcode and tag issues from nearly two decades of forum posts.

We did not use the rake remap function for these and, in all cases, we used the technique that @pfaffman outlines in his code snippet:

This code snippet above using gsub() summarizes one of the best ways to clean up the raw posts after (or even better, during) migration.

Make sure you test your REGEX expressions BEFORE you actually implement them on the DB and have a full backup before you do operations like this directly on your DB.

1 Like

Hi, below the content of my script/cleanup.rb which I launch using: RAILS_ENV=development bundle exec ruby script/cleanup.rb

File content:

require_relative '../config/environment'
pm = 0
Post.find_each do |test|
	test.raw.gsub!(/<(.|\/.)>/i,"")
	test.save
	test.rebake!
	pm = pm + 1
end
puts "cycled through #{pm} posts"

i tried d/rake posts:rebake which rebakes 1757 posts. The script cycles through just 1712 posts which is the imported stuff with the html tags and the remainder being new ones created in Discourse.

Think I am getting close but when I inspect the raw content in the UI I keep seeing all the html tags.

Tried rebooting the environment and relaunching unicorn but to no avail. So close… so close ;o)

2 Likes

Did you test your regex somewhere? Do one post at a time to make sure that the raw got changed?

1 Like

Used your rubular suggestion and now regexr.com see below screenshot. Just went for the p and the r tags for now and get those sorted before adding the more complex ones.

2 Likes

When I added a put statement to my little cleanup.rb to print out the raw post contents to the CLI I noticed there were no html tags printed out at all.

However when I edit any post I do see the following, with the html tags on the hand side. This doesn’t seem to be the normal situation because when come back to edit the post I am now editing I don’t see html tags…

Anyone got a clue?

1 Like

That gsub needs a g after the /i to match multiple tags. But if what you see in your puts is different from what you see after the script has run, then I don’t have an explanation.

1 Like

Weird I get:

script/cleanup.rb:9: unknown regexp option - g

when adding g to the i like this:

require_relative ‘…/config/environment’

pm = 0
Post.find_each do |test|
puts test.raw
test.raw.gsub!(/<(.|/.)>/ig,"")
test.save
test.rebake!
pm = pm + 1
end
puts “cycled through #{pm} posts”

1 Like

As I recall (and this is how we generally match multilines using gsub …), a Ruby multiline REGEX requires m:

/./m - Any character (the m modifier enables multiline mode)

See:

https://ruby-doc.org/core-2.7.2/Regexp.html

Screen Shot 2020-12-27 at 3.34.12 PM

HTH

1 Like

Thanks that worked from the regexp side of things.

It seems there are two things that I can’t explain using this script:

# Call it like this:
# RAILS_ENV=development bundle exec ruby script/cleanup.rb -> cleanup.log

require_relative '../config/environment'
pm = 0
Post.find_each do |test|
	puts test.raw
	test.raw.gsub!(/<(.|\/.)>/im,"")
	test.save
	test.rebake!
	pm = pm + 1
end
puts "cycled through #{pm} posts"
  1. after running this a couple of times, the cleanup.log keeps containing 10 instances of <p> in around 21000 lines of raw post material. These never get removed weirdly enough
  2. even weirder (to me) is that when I fire up unicorn and access the site on my local machine I still get the html tags in all the posts that I check in the raw view of the editor.

Seems to me I am not looking at the same environment perhaps? Does the d/unicorn look at a local production environment while my script applies changes to dev?

Trying to get this working using the dev environment using docker guide locally first before going to my live site.

Must be some major newbie thing I am overlooking. :sweat_smile:

1 Like

Oops. Sorry. That’s what the g in gsub does! I don’t know what I was thinking.

If you put the puts after the gsub do you see the replacement being done?

Also, what I world do is run the code by hand one line at a time and see if it does what you expect and see if changes made there are affecting the database you’re trying to affect.

1 Like

So… I think I know what has caused my confusion…

‘rails c’ vs ’ d/rails c’

My absolute inexperience in working with containers wreaking havoc here. My script working fine, removing the tags outside of the container but then when firing up the d/unicorn all the tags were still there. Now when doing d/rails c I do see all the tags I want to get rid of…

OMG. This must have been so obvious a thing to do that you @pfaffman didn’t think of this biting me LOL.

Anyways, I think I am all set to clean up my posts in an automated fashion on my live site too.

2 Likes

That’s a new one on me! I figured that you somehow were accessing two databases, but even now that you’ve explained it, I don’t quite know what was happening!

Glad you got it!

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.