Converting links from raw markdown to HTML

hellekin · November 20, 2018, 9:16pm

Context

I’m trying to retrieve specific posts to turn them into a single corpus (e.g., a book) to be processed elsewhere (e.g., with Pandoc).

‘Simple’ Approach

Discourse provides two readily accessible ways to download specific posts:

Appending .json to an URL will give its JSON representation that includes the cooked HTML version
Using the topic/post numbers from the original URL and replacing the prefix with /raw gives the original Markdown version, e.g., from https://discourse.example/t/some-topic/123/4 to https://discourse.example/raw/123/4

Using the second approach, I’m listing topic numbers in a file and get the Markdown input. It works fine if there’s no hyperlink or attached files.

API Approach?

Two problems to solve:

relative links must be turned into absolute links if we want them clickable outside Discourse
upload:// links must be converted to absolute links as well, or to relative links if the assets are downloaded.

The JSON file produced in (1) does not seem to allow this conversion (at least not obviously, maybe there’s a way to re-parse the Markdown input and loop against the links list) so I’m wondering if the API can do that, retrieving the original Markdown but with externally usable URLs. In fact I wonder if such a view could be made available in the way raw is. This would immensely simplify reusing contents outside Discourse, or import contents into another Discourse instance (think [fediverse]).

I could go for the HTML and work from there, but it seems much cleaner to process the Markdown files (also since other sources might use Markdown that we want to compose with). Any suggestions as to the ways to approach this problem and eventual courses of action to understand this issue better or make it work in future versions is welcome!

gerhard · November 21, 2018, 12:14am

First, you need to decide if you want to have Markdown or HTML.
Next, you need to either parse the Markdown or HTML and replace relative paths with absolute URLs or use a simple regex for find and replace. If you go with Markdown you’ll also need to handle the special upload:// URLs.

You could either do this stuff with the API or write a plugin that adds a new route for rendering the modified Markdown or HTML. If you want everything to look exactly like it does in Discourse, I’d choose to generate HTML using a plugin.

hellekin · November 21, 2018, 2:13pm

Thanks @gerhard for this starter. You propose creating a plugin to provide the wanted views. This works fine if I’m in control of the Discourse instance. But this won’t fly for random scraping from various sources.

Of course, having Discourse render externally usable Markdown files would be useful, but it is not the case yet unless the API is already allowing it.

How would one go to convert these upload links? Is it a simple conversion from upload://X to https://Discourse.example/path/to/X or is there more magic involved? Is this “path/to” discoverable or do I have to know it beforehand, etc. are questions I’m asking myself…

Topic		Replies	Views
Converting Wiki Posts to HTML Support	3	533	February 13, 2022
Is there a way to convert cooked content back to Markdown? Support markdown	9	146	August 19, 2024
Reverse-engineering markdown image urls Support	1	531	May 12, 2022
Mailing list mode: "Upload" links broken (?) in e-mails Bug	19	1385	November 26, 2019
Getting Markdown for a post using API Dev rest-api	6	271	September 3, 2024

Converting links from raw markdown to HTML

Context

‘Simple’ Approach

API Approach?

Related topics