Is there a way to convert cooked content back to Markdown?

EDIT: mistake in my original post. I meant to refer to the cooked field, not the raw field (CORRECTED).

I recently acquired some JSON data from a discourse forum, where the post data is in the “cooked” form. I was wondering if there is anyway to convert this back to the Markdown? I am new to Discourse and have searched but can’t find a way to do this. Seeing as the cooked data appears to be used to create the HTML, I am guessing an alternative route would be to use the function that converts cooked to HTML then convert the HTML to Markdown.

Any help greatly appreciated.

Thanks.

1 Like

there are a lot of options on Google but my first choice for these things is usually pandoc pandoc - How to convert HTML to Markdown while retaining non-markdown HTML tags? - Stack Overflow

1 Like

Try this. Simples

I already have the data, I can’t grab it again. It was from a forum that was taken down, I am trying to clean the data so it can be revived.

What exactly format is the “raw” data?

Markdown!

If by raw you mean a field called raw, then you’re looking at the actual Markdown source that we store. For an example, this is the JSON endpoint for you last post just now.

The raw field there is the actual text you composed in the Markdown editor, and we store it as-is so it doesn’t get more pure than that.

Instead, if you generally mean “the raw HTML” as scraped without using JSON endpoints, then you can turn that HTML into Markdown externally with pandoc as suggested above, or any other software.

3 Likes

Please accept my apologies, I made a mistake in my first post (since corrected). I meant to refer to the cooked data as opposed to the raw data (it’s been a long day…).

What form is the cooked data in and is there any way to convert it to Markdown or HTML? Thanks.

I made a mistake in my first post, since corrected. My apologies.

1 Like

Ah, that makes more sense. The cooked field is the HTML rendered from Markdown.

You can simply run that through pandoc to get Markdown; you won’t get full fidelity to the corresponding raw because there are some non-standard Markdown tags like [quote] which get rendered to certain HTML patterns, but if you simply need the content as Markdown, pandoc should work well enough.

3 Likes

Thank you very much. I will get on to that now.

I assume the post data (content) is actually stored in the DB as Markdown?

Both, here’s an example:

 #<Post:0x00007fbb78416f50
 id: 2203,
 user_id: -4,
 topic_id: 590,
 post_number: 6,
 raw: "@merefield, it looks like @eloy has mentioned that their favourite colour is red!",
 cooked:
  "<p><a class=\"mention\" href=\"/u/merefield\">@merefield</a>, it looks like <a class=\"mention\" href=\"/u/eloy\">@eloy</a> has mentioned that their favourite colour is red!</p>",
 created_at: Sun, 18 Aug 2024 11:15:32.487912000 UTC +00:00,
 updated_at: Sun, 18 Aug 2024 11:15:32.487912000 UTC +00:00,
2 Likes