Converting links from raw markdown to HTML

Context

I’m trying to retrieve specific posts to turn them into a single corpus (e.g., a book) to be processed elsewhere (e.g., with Pandoc).

‘Simple’ Approach

Discourse provides two readily accessible ways to download specific posts:

  1. Appending .json to an URL will give its JSON representation that includes the cooked HTML version
  2. Using the topic/post numbers from the original URL and replacing the prefix with /raw gives the original Markdown version, e.g., from https://discourse.example/t/some-topic/123/4 to https://discourse.example/raw/123/4

Using the second approach, I’m listing topic numbers in a file and get the Markdown input. It works fine if there’s no hyperlink or attached files.

API Approach?

Two problems to solve:

  • relative links must be turned into absolute links if we want them clickable outside Discourse
  • upload:// links must be converted to absolute links as well, or to relative links if the assets are downloaded.

The JSON file produced in (1) does not seem to allow this conversion (at least not obviously, maybe there’s a way to re-parse the Markdown input and loop against the links list) so I’m wondering if the API can do that, retrieving the original Markdown but with externally usable URLs. In fact I wonder if such a view could be made available in the way raw is. This would immensely simplify reusing contents outside Discourse, or import contents into another Discourse instance (think [fediverse]).

I could go for the HTML and work from there, but it seems much cleaner to process the Markdown files (also since other sources might use Markdown that we want to compose with). Any suggestions as to the ways to approach this problem and eventual courses of action to understand this issue better or make it work in future versions is welcome!

2 Likes

First, you need to decide if you want to have Markdown or HTML.
Next, you need to either parse the Markdown or HTML and replace relative paths with absolute URLs or use a simple regex for find and replace. If you go with Markdown you’ll also need to handle the special upload:// URLs.

You could either do this stuff with the API or write a plugin that adds a new route for rendering the modified Markdown or HTML. If you want everything to look exactly like it does in Discourse, I’d choose to generate HTML using a plugin.

6 Likes

Thanks @gerhard for this starter. You propose creating a plugin to provide the wanted views. This works fine if I’m in control of the Discourse instance. But this won’t fly for random scraping from various sources.

Of course, having Discourse render externally usable Markdown files would be useful, but it is not the case yet unless the API is already allowing it.

How would one go to convert these upload links? Is it a simple conversion from upload://X to https://Discourse.example/path/to/X or is there more magic involved? Is this “path/to” discoverable or do I have to know it beforehand, etc. are questions I’m asking myself…