Hi, is there a preferred Ruby HTML parsing library - preferably one that is already installed with Discourse that I can use to parse the structure of the HTML posts that I’m migrating?
Background
I’m working on a migration from MVCForum. So far I have Users, Categories, Topics, Posts, Likes, Reported posts, Badges and Badge Grants working. I have Tags and Personal Messages to do next.
However, I now need to massage the processing of the post content so that it works well with Discourse. Content comes out of MVCF as HTML and can feature lots of nested replies. I need a way to unpick the HTML and work out the structure and hierarchy.
With each blockquote reply, I can get the import post ID that was being replied to. This means that instead of simply creating a “reply to topic” post, I can instead create a Discourse “reply to post” post and therefore drop the deeply nested blockquotes.
My question is how can I model the HTML in a way that allows me to do that?
I have very little Ruby experience and am coming at this from a Python background. In Python I would use an HTML parsing library like Beautifulsoup or lxml. My guess is that Discourse already uses one that will already be installed in the runtime. This would mean that anyone in the future running the MVCF migration would not need to install an additional Gem. If one isn’t already present with Discourse, can you recommend one that would allow me to work with the HTML like it were a DOM-like object?
Thanks!