A basic Discourse archival tool

Just a few tidbits for anyone else looking for some httrack tips (which works great for my purposes).

  • A complete list of command line flags: HTTrack Website Copier - Offline Browser
  • Using the -s0 flag ignores the robots.txt (if you have a non-spider-able account)
  • If your site is behind a login, you can download a .txt file of the cookie (once logged in) using a chrome extension like cookies.txt and place that in the directory you’re running httrack from.

I’m using httrack via cron to create an offline archive of our Discourse site. However, the user that is logging in under httrack gets marked as a “view” for each topic, giving super-inflated numbers of views for each topic (the cron runs every hour).

Is there a way to exclude a certain user from being recorded in the statistics / view stats for the site as a whole?


Good point, where would this be intercepted @sam?

1 Like

We have this method for tracking page views:

We have additional methods for tracking user visits which would be even harder to override.

We only store one page view per day per user, but I get that it can add up.

Hacking this out so certain users are not tracked would either require a plugin or some sort of daily query that nukes all the views by the user and remembers to also reduce views count from the topics table.


For my purposes (a very minimally used site for internal coms) even a boilerplate script that I could manually run on occasion that says “nuke all views by user:archive” would be great.

Hi all – just jumping in here to say that @mcmcclur’s code was exactly what I was looking for! So thank you very much for sharing :slight_smile:

I made a few small modifications (mainly additional code that makes sure to grab all posts in a topic, not just the first twenty) and the code is here: GitHub - kitsandkats/ArchiveDiscourse: Code for archiving a Discourse site into static HTML, forked from @mcmcclur’s original repo and stored as a python file instead of a Jupyter notebook.

I’m very happy with how it turned out. Thanks again!


Hi just read through this whole thread and wanted to check if this tool works if the the discourse fourm is behind a login and password how would I edit the code so it will allow me to archival the site ?

1 Like

As it is currently written, the code is not designed to access any material that requires a login. It should be pretty easy to set that up, though. The code interacts with the Discourse site via the Python Requests library which does offer authentication. It’s feasible that adding an auth=('user', 'pass') to the code at the appropriate points is all that’s required. I’m not currently running a Discourse site so I can’t test that at the moment.


httrack does not work for me. Using:

httrack https://my-forums.org --user-agent "Googlebot"

httrack is quite promising, but long forum thread with multiple pages are incomplete. Once I click on “page 2” it does not work. I.e.

  • file:///home/user/My%20Web%20Sites/my-forums/my-forum.org/t/forum-thread-title/83394658.html looks really good (does not fetch from external resources), but
  • file:///home/user/My%20Web%20Sites/my-forums/my-forum.org/t/forum-thread-title/83394658.html?page=2 is broken.

Any suggestions?

Perhaps httrack can be told somehow to “use print mode”?

Perhaps httrack can be told to “append /print at the end”?

Is there a user agent setting which shows the whole forum thread on a single page? If not, could you please add this feature? You already implemented print mode. Most is already implemented. What’s left is a user agent to which results in providing contents generated for “print mode” to the crawler? Alternatively, if you don’t like the idea of a custom user agent for this purpose, what about a http header or cookie that could be used for this purpose?

ArchiveDiscourse improved/forked by by @kitsandkats is also broken for me.

Could you please consider also implementing /print also for front page / category pages?

Quote myself in I don't like infinite scrolling and want to disable it

(Temporarily) disabling infinite scroll (for some user agents) would make it possible to archive discourse with the htttrack web archive tool.

1 Like

Python requests will automatically use .netrc for authentication if required (but it needs to get 401 HTTP response).


I’ve gotten good results with wget, including authentication. Described here: