@Silvanus Well, I noticed that you point to the forum at https://uskojarukous.fi/ on your Profile page; I went ahead and created a couple of archives of that. You can (temporarily) take a look at the results here:
Here are a few comments:
- I definitely like my version better; no surprise there because I designed it the way I want it to look.
- The front page of the httrack version doesn’t look so great simply because that’s what the escaped fragment version looks like.
- I think it might make sense to start httrack at a subpage to generate something like this.
- It wouldn’t be too hard to make my archival tool grab the categories; I might do that for the next iteration.
- My code adds MathJax to every page because my forums are mathematical. I should probably try to detect if MathJax is necessary. I’m guessing your forum doesn’t require it.
The httrack command
The httrack version was generated with a command that looks like so:
httrack https://uskojarukous.fi -https://uskojarukous.fi/users* -*.rss -O uskojarukous_arxiv -x -o -M10000000 --user-agent "Googlebot"
- The
-https://uskojarukous.fi/users* -*.rss
prevents httrack from downloading files matching those patterns. - The
-x -o
combo replaces both external links and errors with a local file indicating the error. So, for example, we don’t link to user profiles on the original that weren’t downloaded locally. - The
-M10000000
restricts the total amount downloaded to 10MB. There appears to be some post processing and downloading of supplemental files that makes the total larger than this anyway. - The
--user-agent "Googlebot"
should not be necessary if the forum is powered by a recent version of Discourse.
The archival tool code
For the most part, the archival tool should run with minimal changes. I run it within a Jupyter notebook but the exact same code could be run from a Python script with the appropriate libraries installed. Of course, you need to tell it what forum you want to download. The few lines of my first input look like so:
base_url = 'https://uskojarukous.fi/'
path = os.path.join(os.getcwd(), 'uskojarukous')
archive_blurb = "A partial archive of uskojarukous.fi as of " + \
date.today().strftime("%A %B %d, %Y") + '.'
Later, in input 6, I define max_more_topics = 2
. Essentially, that defines a bound on k
in this code here:
'/latest.json?page=k'
But again, there should be some changes made to the code to get it to work for non-mathematical forums.