I need your help on a problem. We have a lot of topics stored as HTML in the database (raw_data) but it’s HTML that is “migrated” from an other system. It was done before we brought the website and would’ve never ever done it this way. What we want to achieve is to convert the HTML-String containing
<div>, <link>, <br />, <span>, <blockquote>,<small> but no <p> and something that is no HTML like [quote][/quote] into Markdown and than rebake the posts, to get them into discourse style HTML so that they where optimized by discourse (e. g. crawler view). At the moment the plain old HTML content is used (cooked_method=2) wich leads to many crawling issues and soft404 in Google Search Console.
We have to do this for about 4-5Mio posts, so it will be very expensive job.