Advice on approach to post migration

bletch · May 10, 2018, 6:05pm

Hi, is there a preferred Ruby HTML parsing library - preferably one that is already installed with Discourse that I can use to parse the structure of the HTML posts that I’m migrating?

Background

I’m working on a migration from MVCForum. So far I have Users, Categories, Topics, Posts, Likes, Reported posts, Badges and Badge Grants working. I have Tags and Personal Messages to do next.

However, I now need to massage the processing of the post content so that it works well with Discourse. Content comes out of MVCF as HTML and can feature lots of nested replies. I need a way to unpick the HTML and work out the structure and hierarchy.

With each blockquote reply, I can get the import post ID that was being replied to. This means that instead of simply creating a “reply to topic” post, I can instead create a Discourse “reply to post” post and therefore drop the deeply nested blockquotes.

My question is how can I model the HTML in a way that allows me to do that?

I have very little Ruby experience and am coming at this from a Python background. In Python I would use an HTML parsing library like Beautifulsoup or lxml. My guess is that Discourse already uses one that will already be installed in the runtime. This would mean that anyone in the future running the MVCF migration would not need to install an additional Gem. If one isn’t already present with Discourse, can you recommend one that would allow me to work with the HTML like it were a DOM-like object?

Thanks!

pfaffman · May 10, 2018, 7:07pm

Look for importers that use the Nokogiri gem.

disqus.rb
ipboard.rb
jive_api.rb
jive.rb

bletch · May 10, 2018, 7:12pm

Perfect! Exactly what I’m looking for.

Cheers, Dan.

Topic		Replies	Views
Import HTML site to Discourse? Migration	7	1269	November 24, 2022
Importing HTML into Discourse Migration	1	434	October 29, 2023
Clean-up html tags in all posts after migration? Migration flarum	23	2479	February 1, 2021
HTML to Markdown and rebake Support	3	700	October 7, 2020
Converting Wiki Posts to HTML Support	3	529	February 13, 2022

Advice on approach to post migration

Related topics