Re processing imported posts

(Dean Taylor) #1

I would like to re process all imported posts (700,000 + posts) …

Basic problem was that a super large number of images were from photobucket but the import process didn’t handle them correctly.

Where code like this:


Was turned into this:

So photobucket is the easy ones to handle because of the similarity of the URL’s - so I’ll tackle that one first and then look for other broken items.

… basically I would like loop over all posts (PM’s, forum content) and:

  • Run a regular regex replace creating a new in-memory string (revised_content) based on the latest revision (original_content)
  • If revised_content is different to the original_content
  • Add new revision with revised_content
  • This action should not bump the topics activity date.

Just wondering if anybody has done this kind of thing already and has a code-snippet or two to point me at before I get started?

(Michael - #2

Yes, we encountered the EXACT same issue, even the HTML/bbCode looked the same as yours.

I will post our ‘fix’ script later today. We didn’t use revisions though, just regex on raw and then a rebake.

(Michael - #3


First go into the database and issue the following SQL query

update posts set raw=regexp_replace(raw, '(http://[^\.]*\[^\]]*)\]\[img\]([^\[]*)\[\/img', '\1', 'g');

Then get a command prompt and

bundle exec rails c

Post.where("cooked like ?","%photobucket%").find_each do |post|
   puts "X"
   sleep 2

I had to put the sleep in there to avoid Photobucket getting annoyed and Sidekiq queuing up a lot. But maybe you don’t need that.

(Dean Taylor) #4

Thanks for providing what you have @michaeld

However your regular expression doesn’t actually match what I have in my posts.

I have:

Whilst you seem to be matching the original bbcode version.

(Michael - #5

Ok. Well… important lesson: you need to keep the .html part, not the jpg. If you leave it in the post on a single line, oneboxing will do the rest.

So you need a regex for that. I think


will do.

(Dean Taylor) #6

Thanks for the feedback @michaeld I’ll report back what I end up doing.