A basic Discourse archival tool

A Discourse forum that I use is being taken offline in a couple weeks, so I set out to archive the site. I did a lot of research, trail and error, and I found a simple solution with HTTrack. Here’s everything I learned.

Archive a Discourse site with HTTrack
For Windows users, the best solution appears to HTTrack. This worked great and it archived the site to HTML files. All categories, threads, and posts were archived including all pages with relative navigation links.

A basic tutorial on HTTrack is here. I left the settings on default with the following custom settings.

  • Web Addresses:
    • https://forums.gearboxsoftware.com/c/homeworld/
    • https://forums.gearboxsoftware.com/c/homeworld-dok/
  • Scan Rules:
    • -gearboxsoftware.com/* -forums.gearboxsoftware.com/* +forums.gearboxsoftware.com/c/homeworld/* +forums.gearboxsoftware.com/c/homeworld-dok/* +forums.gearboxsoftware.com/t/* +forums.gearboxsoftware.com/user_avatar/* +sea2.discourse-cdn.com/*
  • Browser ID (aka User Agent):
    • Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Note: There’s a CSS issue preventing category links from working, however that can easily be fixed as described below.

CSS Issue
When viewing Category pages as googlebot, the thread links don’t work. An example is [here](https://web.archive.org/web/20220731051419/https://forums.gearboxsoftware.com/c/homeworld/57).

This makes navigation impossible on category pages in HTTrack, archive.org and google catch. This appears to be a Discourse issue in a CSS file. To fix the links, simply block/delete the following CSS file:

  • stylesheets/desktop_theme_10_1965d1d398092f2d9f956b36e08b127e00f53b70.css?__ws=forums.gearboxsoftware.com

@codinghorror - Can you guys address this?

Challenges
I ran into the following challenges and eventually overcame them after much trial and error.

  • Discourse pages are dynamically generated with JavaScript. This makes for poor results with most archive/crawler tools.
  • Most threads only load with the first ~20? posts, the rest of the posts don’t appear until you scroll down. Pressing Ctrl+P loads a /print page with all posts visible. Users are limited to printing 5 pages an hour with print mode, but this limit can be increased by a Discourse site admin.
  • Adrelanos noted that multi-page threads weren’t being archived properly by HTTrack, however I suspect this issue was due to his HTTrack settings, as I did not have this issue.
  • Saving a page to PDF won’t include any collapsed details sections.
  • Pages can be loaded in basic HTML by adding ?_escaped_fragment_ to the end of a URL, but this trick only works for threads not categories.

The above challenges aren’t a concern once you learn that all Discourse pages/content can be rendered properly as HTML for crawlers. To do this, you must change your crawler / browser’s user agent to googlebot to get the HTML version of pages.

Archive.org
If you use the “Save Page Now” feature on web.archive.org, it will archive the javascript version of Discourse with poor results. Archive.org uses the user agent of the person requesting the archive. So you must change your user agent to googlebot. You can get a Chrome extension called “User-Agent Switcher for Chrome”. In the options add:

  • Name: Googlebot
  • String: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
  • Group: Chrome
  • Indicator Flag: 1

Alternative Archive Tools
Many tools are listed here: Archive an old forum "in place" to start a new Discourse forum
I also briefly tested GUI tools like Cyotek WebCopy, A1 Website Download, and WAIL.
Command line tools include mcmcclur’s tool and wget. A tuturial on wget is [here](https://letswp.justifiedgrid.com/download-discourse-forum-wget/).
However for Windows users, the best solution appears to HTTrack.

Note: Since I’m a new user, I’m limited to two links in a post. Hence I turned some links into preformatted text.

8 Likes