A basic Discourse archival tool

@Silvanus Well, I noticed that you point to the forum at https://uskojarukous.fi/ on your Profile page; I went ahead and created a couple of archives of that. You can (temporarily) take a look at the results here:

Here are a few comments:

  • I definitely like my version better; no surprise there because I designed it the way I want it to look.
  • The front page of the httrack version doesn’t look so great simply because that’s what the escaped fragment version looks like.
  • I think it might make sense to start httrack at a subpage to generate something like this.
  • It wouldn’t be too hard to make my archival tool grab the categories; I might do that for the next iteration.
  • My code adds MathJax to every page because my forums are mathematical. I should probably try to detect if MathJax is necessary. I’m guessing your forum doesn’t require it.

The httrack command

The httrack version was generated with a command that looks like so:

httrack https://uskojarukous.fi -https://uskojarukous.fi/users* -*.rss -O uskojarukous_arxiv -x -o -M10000000 --user-agent "Googlebot"
  • The -https://uskojarukous.fi/users* -*.rss prevents httrack from downloading files matching those patterns.
  • The -x -o combo replaces both external links and errors with a local file indicating the error. So, for example, we don’t link to user profiles on the original that weren’t downloaded locally.
  • The -M10000000 restricts the total amount downloaded to 10MB. There appears to be some post processing and downloading of supplemental files that makes the total larger than this anyway.
  • The --user-agent "Googlebot" should not be necessary if the forum is powered by a recent version of Discourse.

The archival tool code

For the most part, the archival tool should run with minimal changes. I run it within a Jupyter notebook but the exact same code could be run from a Python script with the appropriate libraries installed. Of course, you need to tell it what forum you want to download. The few lines of my first input look like so:

base_url = 'https://uskojarukous.fi/'
path = os.path.join(os.getcwd(), 'uskojarukous')
archive_blurb = "A partial archive of uskojarukous.fi as of " + \
  date.today().strftime("%A %B %d, %Y") + '.'

Later, in input 6, I define max_more_topics = 2. Essentially, that defines a bound on k in this code here:

'/latest.json?page=k'

But again, there should be some changes made to the code to get it to work for non-mathematical forums.

5 Likes