Is there a way to convert cooked content back to Markdown?

U4EA · August 19, 2024, 2:57pm

EDIT: mistake in my original post. I meant to refer to the cooked field, not the raw field (CORRECTED).

I recently acquired some JSON data from a discourse forum, where the post data is in the “cooked” form. I was wondering if there is anyway to convert this back to the Markdown? I am new to Discourse and have searched but can’t find a way to do this. Seeing as the cooked data appears to be used to create the HTML, I am guessing an alternative route would be to use the function that converts cooked to HTML then convert the HTML to Markdown.

Any help greatly appreciated.

Thanks.

wal · August 19, 2024, 3:00pm

there are a lot of options on Google but my first choice for these things is usually pandoc pandoc - How to convert HTML to Markdown while retaining non-markdown HTML tags? - Stack Overflow

merefield · August 19, 2024, 3:01pm

Try this. Simples

U4EA · August 19, 2024, 3:06pm

I already have the data, I can’t grab it again. It was from a forum that was taken down, I am trying to clean the data so it can be revived.

What exactly format is the “raw” data?

leonardo · August 19, 2024, 3:14pm

Markdown!

If by raw you mean a field called raw, then you’re looking at the actual Markdown source that we store. For an example, this is the JSON endpoint for you last post just now.

The raw field there is the actual text you composed in the Markdown editor, and we store it as-is so it doesn’t get more pure than that.

Instead, if you generally mean “the raw HTML” as scraped without using JSON endpoints, then you can turn that HTML into Markdown externally with pandoc as suggested above, or any other software.

U4EA · August 19, 2024, 3:20pm

Please accept my apologies, I made a mistake in my first post (since corrected). I meant to refer to the cooked data as opposed to the raw data (it’s been a long day…).

What form is the cooked data in and is there any way to convert it to Markdown or HTML? Thanks.

U4EA · August 19, 2024, 3:20pm

I made a mistake in my first post, since corrected. My apologies.

leonardo · August 19, 2024, 3:28pm

Ah, that makes more sense. The cooked field is the HTML rendered from Markdown.

You can simply run that through pandoc to get Markdown; you won’t get full fidelity to the corresponding raw because there are some non-standard Markdown tags like [quote] which get rendered to certain HTML patterns, but if you simply need the content as Markdown, pandoc should work well enough.

U4EA · August 19, 2024, 4:19pm

Thank you very much. I will get on to that now.

I assume the post data (content) is actually stored in the DB as Markdown?

merefield · August 19, 2024, 4:38pm

Both, here’s an example:

 #<Post:0x00007fbb78416f50
 id: 2203,
 user_id: -4,
 topic_id: 590,
 post_number: 6,
 raw: "@merefield, it looks like @eloy has mentioned that their favourite colour is red!",
 cooked:
  "<p><a class=\"mention\" href=\"/u/merefield\">@merefield</a>, it looks like <a class=\"mention\" href=\"/u/eloy\">@eloy</a> has mentioned that their favourite colour is red!</p>",
 created_at: Sun, 18 Aug 2024 11:15:32.487912000 UTC +00:00,
 updated_at: Sun, 18 Aug 2024 11:15:32.487912000 UTC +00:00,

Topic		Replies	Views
Converting Wiki Posts to HTML Support	3	527	February 13, 2022
Get back the real "raw" data that created a post? Dev	25	2801	May 6, 2021
What markdown parser Discourse is using? Dev	6	2016	April 2, 2016
Converting links from raw markdown to HTML Dev	2	1783	November 21, 2018
HTML to Markdown and rebake Support	3	698	October 7, 2020

Is there a way to convert cooked content back to Markdown?

Related topics