Importing HTML into Discourse

Good day,

When google yahoo was sunset, the admins of an existing group exported all messages into HTML. They then uploaded all the HTML into Groups.io and that was their migration process.

Today there are few hundred of these HTML files with titles such as HtmlDigest001 containing hundreds of different subjects per file and each subject has dozens of messages.

I’ve been scrambling with python scripts trying to extract text organised by subject and posting date into individual word documents, but with no success.

I was now thinking if Discourse would be able to import these HTML files and somehow convert them into separate messages. Or if there is a tool capable of doing this task.

Thank you for your time and help.

Regards

Well, anything is possible. You’d need to write something that would parse them and, say, push them into a database. You can look at one the the json or csv importers. The nokogiri gem can help.

2 Likes