HTML to Markdown and rebake

Mike_Gehrhardt · October 17, 2019, 1:30pm

I need your help on a problem. We have a lot of topics stored as HTML in the database (raw_data) but it’s HTML that is “migrated” from an other system. It was done before we brought the website and would’ve never ever done it this way. What we want to achieve is to convert the HTML-String containing <div>, <link>, <br />, <span>, <blockquote>,<small> but no <p> and something that is no HTML like [quote][/quote] into Markdown and than rebake the posts, to get them into discourse style HTML so that they where optimized by discourse (e. g. crawler view). At the moment the plain old HTML content is used (cooked_method=2) wich leads to many crawling issues and soft404 in Google Search Console.
We have to do this for about 4-5Mio posts, so it will be very expensive job.

Any ideas?

Best, Mike

sam · October 21, 2019, 1:24am

We have a built in HTML → Markdown convertor library, it is not perfect but it does the job for the quote function.

You could pass all the posts through that I guess, but what you are describing here, to me, sounds like a large amount of custom work. I would recommend reaching out the the community on marketplace and putting a $$$ value on the job.

pfaffman · October 21, 2019, 2:10am

That’s the kind of thing that I do. You can email Jay@literatecomputing.com.

louquillio · October 7, 2020, 2:12am

Actually, I like your html2markdown parser very much and would like to put it to use outside of Discourse, for my daily work. Any tips on extracting it into a textarea dingus?

There’s no shortage of html2markdown parsers. Heck, Aaron Swartz wrote one.

The difference is that I trust yours to do what I want — no more, no less.

Thanks.

LQ

Topic		Replies	Views
Is there a way to convert cooked content back to Markdown? Support markdown	9	110	August 19, 2024
Converting Wiki Posts to HTML Support	3	527	February 13, 2022
What markdown parser Discourse is using? Dev	6	2016	April 2, 2016
How is Google Docs to Markdown so good on Discourse? Praise	4	2931	June 12, 2020
Converting links from raw markdown to HTML Dev	2	1783	November 21, 2018

HTML to Markdown and rebake

Related topics