Re processing imported posts


(Dean Taylor) #1

I would like to re process all imported posts (700,000 + posts) …

Basic problem was that a super large number of images were from photobucket but the import process didn’t handle them correctly.

Where code like this:

[URL=http://s662.photobucket.com/user/jinzin2008/media/thWolverine_51_013.jpg.html][IMG]http://i662.photobucket.com/albums/uu345/jinzin2008/thWolverine_51_013.jpg[/IMG][/URL]

Was turned into this:

http://s662.photobucket.com/user/jinzin2008/media/thWolverine_51_013.jpg.htmlhttp://i662.photobucket.com/albums/uu345/jinzin2008/thWolverine_51_013.jpg

So photobucket is the easy ones to handle because of the similarity of the URL’s - so I’ll tackle that one first and then look for other broken items.

… basically I would like loop over all posts (PM’s, forum content) and:

  • Run a regular regex replace creating a new in-memory string (revised_content) based on the latest revision (original_content)
  • If revised_content is different to the original_content
  • Add new revision with revised_content
  • This action should not bump the topics activity date.

Just wondering if anybody has done this kind of thing already and has a code-snippet or two to point me at before I get started?


(Michael - DiscourseHosting.com) #2

Yes, we encountered the EXACT same issue, even the HTML/bbCode looked the same as yours.

I will post our ‘fix’ script later today. We didn’t use revisions though, just regex on raw and then a rebake.


(Michael - DiscourseHosting.com) #3

Ok.

First go into the database and issue the following SQL query

update posts set raw=regexp_replace(raw, '(http://[^\.]*\.photobucket.com[^\]]*)\]\[img\]([^\[]*)\[\/img', '\1', 'g');

Then get a command prompt and

bundle exec rails c

Post.where("cooked like ?","%photobucket%").find_each do |post|
   post.rebake!
   puts "X"
   sleep 2
end

I had to put the sleep in there to avoid Photobucket getting annoyed and Sidekiq queuing up a lot. But maybe you don’t need that.


(Dean Taylor) #4

Thanks for providing what you have @michaeld

However your regular expression doesn’t actually match what I have in my posts.

I have:

http://s662.photobucket.com/user/jinzin2008/media/thWolverine_51_013.jpg.htmlhttp://i662.photobucket.com/albums/uu345/jinzin2008/thWolverine_51_013.jpg

Whilst you seem to be matching the original bbcode version.


(Michael - DiscourseHosting.com) #5

Ok. Well… important lesson: you need to keep the .html part, not the jpg. If you leave it in the post on a single line, oneboxing will do the rest.

So you need a regex for that. I think

(http:\/\/[^\.]*\.photobucket.com[^\.]*\.jpg.html)http:\/\/[^\.]*\.photobucket.com[^\.]*\.jpg

will do.


(Dean Taylor) #6

Thanks for the feedback @michaeld I’ll report back what I end up doing.