A basic Discourse archival tool

Just a few tidbits for anyone else looking for some httrack tips (which works great for my purposes).

  • A complete list of command line flags: HTTrack Website Copier - Offline Browser
  • Using the -s0 flag ignores the robots.txt (if you have a non-spider-able account)
  • If your site is behind a login, you can download a .txt file of the cookie (once logged in) using a chrome extension like cookies.txt and place that in the directory you’re running httrack from.
6 Likes

I’m using httrack via cron to create an offline archive of our Discourse site. However, the user that is logging in under httrack gets marked as a “view” for each topic, giving super-inflated numbers of views for each topic (the cron runs every hour).

Is there a way to exclude a certain user from being recorded in the statistics / view stats for the site as a whole?

6 Likes

Good point, where would this be intercepted @sam?

1 Like

We have this method for tracking page views:

We have additional methods for tracking user visits which would be even harder to override.

We only store one page view per day per user, but I get that it can add up.

Hacking this out so certain users are not tracked would either require a plugin or some sort of daily query that nukes all the views by the user and remembers to also reduce views count from the topics table.

4 Likes

For my purposes (a very minimally used site for internal coms) even a boilerplate script that I could manually run on occasion that says “nuke all views by user:archive” would be great.

Hi all – just jumping in here to say that @mcmcclur’s code was exactly what I was looking for! So thank you very much for sharing :slight_smile:

I made a few small modifications (mainly additional code that makes sure to grab all posts in a topic, not just the first twenty) and the code is here: GitHub - kitsandkats/ArchiveDiscourse: Code for archiving a Discourse site into static HTML, forked from @mcmcclur’s original repo and stored as a python file instead of a Jupyter notebook.

I’m very happy with how it turned out. Thanks again!

9 Likes

Hi just read through this whole thread and wanted to check if this tool works if the the discourse fourm is behind a login and password how would I edit the code so it will allow me to archival the site ?

1 Like

As it is currently written, the code is not designed to access any material that requires a login. It should be pretty easy to set that up, though. The code interacts with the Discourse site via the Python Requests library which does offer authentication. It’s feasible that adding an auth=('user', 'pass') to the code at the appropriate points is all that’s required. I’m not currently running a Discourse site so I can’t test that at the moment.

5 Likes