HTML 转 Markdown 并重新生成

I need your help on a problem. We have a lot of topics stored as HTML in the database (raw_data) but it’s HTML that is “migrated” from an other system. It was done before we brought the website and would’ve never ever done it this way. What we want to achieve is to convert the HTML-String containing <div>, <link>, <br />, <span>, <blockquote>,<small> but no <p> and something that is no HTML like [quote][/quote] into Markdown and than rebake the posts, to get them into discourse style HTML so that they where optimized by discourse (e. g. crawler view). At the moment the plain old HTML content is used (cooked_method=2) wich leads to many crawling issues and soft404 in Google Search Console.
We have to do this for about 4-5Mio posts, so it will be very expensive job.

Any ideas?

Best, Mike

2 个赞

We have a built in HTML → Markdown convertor library, it is not perfect but it does the job for the quote function.

You could pass all the posts through that I guess, but what you are describing here, to me, sounds like a large amount of custom work. I would recommend reaching out the the community on marketplace and putting a $$$ value on the job.

3 个赞

That’s the kind of thing that I do. You can email Jay@literatecomputing.com.

2 个赞

实际上,我非常喜欢你的 html2markdown 解析器,并希望将其用于 Discourse 之外,服务于我的日常工作。有什么关于将其提取为一个文本区域小工具的建议吗?

html2markdown 解析器多如牛毛。事实上,Aaron Swartz 也写过这样一个

区别在于,我信任你的解析器能按我的意愿行事——不多不少。

谢谢。

LQ

1 个赞