A basic Discourse archival tool

kyle315a · July 31, 2022, 10:09pm

A Discourse forum that I use is being taken offline in a couple weeks, so I set out to archive the site. I did a lot of research, trail and error, and I found a simple solution with HTTrack. Here’s everything I learned.

Archive a Discourse site with HTTrack
For Windows users, the best solution appears to HTTrack. This worked great and it archived the site to HTML files. All categories, threads, and posts were archived including all pages with relative navigation links.

A basic tutorial on HTTrack is here. I left the settings on default with the following custom settings.

Web Addresses:
- https://forums.gearboxsoftware.com/c/homeworld/
- https://forums.gearboxsoftware.com/c/homeworld-dok/
Scan Rules:
- -gearboxsoftware.com/* -forums.gearboxsoftware.com/* +forums.gearboxsoftware.com/c/homeworld/* +forums.gearboxsoftware.com/c/homeworld-dok/* +forums.gearboxsoftware.com/t/* +forums.gearboxsoftware.com/user_avatar/* +sea2.discourse-cdn.com/*
Browser ID (aka User Agent):
- Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Note: There’s a CSS issue preventing category links from working, however that can easily be fixed as described below.

CSS Issue
When viewing Category pages as googlebot, the thread links don’t work. An example is [here](https://web.archive.org/web/20220731051419/https://forums.gearboxsoftware.com/c/homeworld/57).

This makes navigation impossible on category pages in HTTrack, archive.org and google catch. This appears to be a Discourse issue in a CSS file. To fix the links, simply block/delete the following CSS file:

stylesheets/desktop_theme_10_1965d1d398092f2d9f956b36e08b127e00f53b70.css?__ws=forums.gearboxsoftware.com

@codinghorror - Can you guys address this?

Challenges
I ran into the following challenges and eventually overcame them after much trial and error.

Discourse pages are dynamically generated with JavaScript. This makes for poor results with most archive/crawler tools.
Most threads only load with the first ~20? posts, the rest of the posts don’t appear until you scroll down. Pressing Ctrl+P loads a /print page with all posts visible. Users are limited to printing 5 pages an hour with print mode, but this limit can be increased by a Discourse site admin.
Adrelanos noted that multi-page threads weren’t being archived properly by HTTrack, however I suspect this issue was due to his HTTrack settings, as I did not have this issue.
Saving a page to PDF won’t include any collapsed details sections.
Pages can be loaded in basic HTML by adding ?_escaped_fragment_ to the end of a URL, but this trick only works for threads not categories.

The above challenges aren’t a concern once you learn that all Discourse pages/content can be rendered properly as HTML for crawlers. To do this, you must change your crawler / browser’s user agent to googlebot to get the HTML version of pages.

Archive.org
If you use the “Save Page Now” feature on web.archive.org, it will archive the javascript version of Discourse with poor results. Archive.org uses the user agent of the person requesting the archive. So you must change your user agent to googlebot. You can get a Chrome extension called “User-Agent Switcher for Chrome”. In the options add:

Name: Googlebot
String: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Group: Chrome
Indicator Flag: 1

Alternative Archive Tools
Many tools are listed here: Archive an old forum "in place" to start a new Discourse forum
I also briefly tested GUI tools like Cyotek WebCopy, A1 Website Download, and WAIL.
Command line tools include mcmcclur’s tool and wget. A tuturial on wget is [here](https://letswp.justifiedgrid.com/download-discourse-forum-wget/).
However for Windows users, the best solution appears to HTTrack.

Note: Since I’m a new user, I’m limited to two links in a post. Hence I turned some links into preformatted text.

Topic		Replies	Views
Make Discourse play nice with the Wayback Machine Feature	46	12077	June 2, 2020
Improving Discourse static HTML archive Feature	5	2115	April 7, 2019
Any updates on the best way to create a HTML archive of a static site? Community Building	10	386	April 16, 2026
Interact with discourse from Python? Development	31	5541	April 20, 2026
Is anyone working on a Discourse Wiki? Feature	41	16854	May 15, 2020

A basic Discourse archival tool

Related topics