Un outil d'archivage Discourse de base

mcmcclur · Mai 12, 2017, 8:20

Outil d’archivage mis à jour avec Codex en mai 2026

Il semble qu’il soit assez délicat de sauvegarder l’intégralité d’un site Discourse sous forme statique. Selon ce post de Jeff Atwood, c’est « beaucoup plus difficile qu’on ne le pense ». Il ne semble pas non plus que cela soit une priorité pour l’équipe Discourse, ce qui est tout à fait compréhensible.

Pour ma part, j’ai cependant constaté que j’avais vraiment besoin d’un moyen de générer des versions HTML statiques basiques de mes sites Discourse. J’utilise Discourse depuis quelques années comme forum de discussion pour enseigner mes cours de mathématiques au collège ; tous les quelques mois, je retire un ou deux sites et j’en lance un ou deux nouveaux. Évidemment, les discussions sur les sites retirés ont de la valeur, j’avais donc vraiment besoin d’un moyen de les sauvegarder. Finalement, j’ai décidé de créer mon propre outil.

L’idée de base est simple : utiliser l’API Discourse pour parcourir le site, récupérer la version cuisinée de chaque message et la transformer en HTML. L’outil se concentre principalement sur mes propres besoins en tant que professeur de mathématiques au collège utilisant de petits forums Discourse pour soutenir mes cours de mathématiques. Ainsi, le contenu mathématique, comme f(x)=e^{-x^2}, doit être automatiquement mis en forme avec MathJax V4 et les blocs de code délimités tagués sage doivent être convertis en cellules Sage actives.

Si vous êtes intéressé, vous pouvez consulter

Une petite partie de Discourse Meta,
Le forum de mon cours de Mathématiques pour l’apprentissage automatique, et/ou
Le dépôt GitHub.

Note

La mise à jour de l’outil d’archivage a été réalisée en grande partie avec Codex.

codinghorror · Mai 12, 2017, 8:29

We’re definitely interested in this, because sometimes you want to turn off all the fancy hosting and databases and render out a set of static HTML pages for permanent long term archiving with zero security risk.

With the meta topic, others can follow along and edit / contribute as needed.

Falco · Mai 12, 2017, 8:30

You can also use our basic HTML version for archiving: this topic in HTML.

You can get this version using a crawler user agent.

Maybe this + recursive wget or similar can help you.

mcmcclur · Juillet 19, 2017, 3:01

Yes, those links are gone, but it’s all summarized on this new page. Also, the output of the code as applied to this DiscourseMeta is now here. I even put it up on GitHub so maybe someone will get interested.

I’d like to edit the original post, but I seem to be past the edit window.

Incidentally, I do think that httrack works much better than I originally thought but I still strongly prefer my version for two main reasons:

My code explicitly supports MathJax, which is essential for my work.
(I’ll probably need to update my code to work with the new MathPlugin sometime)
I’ve got much more control over what get’s downloaded and how it’s displayed. For example, I don’t like the way that httrack output points to user links, even if not downloadedl

Silvanus · Juillet 19, 2017, 1:08

I’m hosting a forum that is currently, in its third iteration, running Discourse. Our last two forums were (I think, phpbb2 or something like that). I have resolved to archive them using Discourse, so that:

I scan the phpbb2 database into Discourse (there’s a migration tool)
I create a static HTML archive using Discourse.
I put up the static HTML archive into public use (preferably in the same place where our dynamic forum running Discourse is).

According to the first message

There are no user pages or category pages

Could it be somehow advanced so that creating category views would be also possible?

Also, any help on how to use the Jupyter notebook thing? First time I hear of this…

mcmcclur · Juillet 19, 2017, 1:43

@Silvanus Can you indicate a live discourse site that you want to archive? I’d be glad to try it out.

Also, have you tried httrack? I think that a command as simple as httrack yoursiteurl might work quite well.

Silvanus · Juillet 19, 2017, 2:14

I’m still in the phase 1 (phpbb2 > phpbb3 > discourse) of my archival, so no site yet. After I’ve managed the phpbb conversion, I’ll get back to this. It feels very, very hard. Been trying to install phpbb3 for a while now, but I get some weird problems all the time.

I’ll have to try that httrack, thanks.

mcmcclur · Juillet 19, 2017, 4:18

@Silvanus Well, I noticed that you point to the forum at https://uskojarukous.fi/ on your Profile page; I went ahead and created a couple of archives of that. You can (temporarily) take a look at the results here:

Here are a few comments:

I definitely like my version better; no surprise there because I designed it the way I want it to look.
The front page of the httrack version doesn’t look so great simply because that’s what the escaped fragment version looks like.
I think it might make sense to start httrack at a subpage to generate something like this.
It wouldn’t be too hard to make my archival tool grab the categories; I might do that for the next iteration.
My code adds MathJax to every page because my forums are mathematical. I should probably try to detect if MathJax is necessary. I’m guessing your forum doesn’t require it.

The httrack command

The httrack version was generated with a command that looks like so:

httrack https://uskojarukous.fi -https://uskojarukous.fi/users* -*.rss -O uskojarukous_arxiv -x -o -M10000000 --user-agent "Googlebot"

The -https://uskojarukous.fi/users* -*.rss prevents httrack from downloading files matching those patterns.
The -x -o combo replaces both external links and errors with a local file indicating the error. So, for example, we don’t link to user profiles on the original that weren’t downloaded locally.
The -M10000000 restricts the total amount downloaded to 10MB. There appears to be some post processing and downloading of supplemental files that makes the total larger than this anyway.
The --user-agent "Googlebot" should not be necessary if the forum is powered by a recent version of Discourse.

The archival tool code

For the most part, the archival tool should run with minimal changes. I run it within a Jupyter notebook but the exact same code could be run from a Python script with the appropriate libraries installed. Of course, you need to tell it what forum you want to download. The few lines of my first input look like so:

base_url = 'https://uskojarukous.fi/'
path = os.path.join(os.getcwd(), 'uskojarukous')
archive_blurb = "A partial archive of uskojarukous.fi as of " + \
  date.today().strftime("%A %B %d, %Y") + '.'

Later, in input 6, I define max_more_topics = 2. Essentially, that defines a bound on k in this code here:

'/latest.json?page=k'

But again, there should be some changes made to the code to get it to work for non-mathematical forums.

Silvanus · Juillet 19, 2017, 4:35

Very cool, thank you for all the clarifications. Just a quick note, it seems that your tool can’t handle sub-categories (which is why many of the messages seem to be without a category).

mcmcclur · Juillet 19, 2017, 4:39

@Silvanus Yes, I think you’re absolutely right about the sub-category thing. Thanks - I had wondered about that.

Silvanus · Juillet 19, 2017, 10:42

@mcmcclur: as you already realized, I’m the admin of said forum, which is the third of our forums. When we did technological jumps, we didn’t migrate, but started from scratch, and the older forum was archived. The last two forums are in SMF format - but I finally managed to start converting them into Discourse format!

So, our forum had a public area and a closed area. I’m thinking that the closed area (a few categories) should be archived, but closed off via a password gate. I noticed that the static paths are something like /t/TITLE/MESSAGEID/. This, if course, lends itself for thread-by-thread gating, but is slightly cumbersome - but, heh, I guess that’s what you get when archiving huge loads of stuff from a dynamic forum to a static archive…

Antroden · Octobre 18, 2018, 2:25

Quelques astuces pour toute personne cherchant des conseils sur httrack (qui fonctionne très bien pour mes besoins).

Une liste complète des options de ligne de commande : HTTrack Website Copier - Offline Browser
L’option -s0 ignore le fichier robots.txt (si vous avez un compte non accessible aux robots d’indexation).
Si votre site est protégé par une connexion, vous pouvez télécharger un fichier .txt contenant les cookies (une fois connecté) à l’aide d’une extension Chrome comme cookies.txt, puis placer ce fichier dans le répertoire depuis lequel vous exécutez httrack.

J’utilise httrack via cron pour créer une archive hors ligne de notre site Discourse. Cependant, l’utilisateur qui se connecte via httrack est comptabilisé comme une « vue » pour chaque sujet, ce qui gonfle considérablement le nombre de vues par sujet (le cron s’exécute toutes les heures).

Existe-t-il un moyen d’exclure un utilisateur spécifique des statistiques de vue du site dans son ensemble ?

codinghorror · Octobre 18, 2018, 8:22

Good point, where would this be intercepted @sam?

sam · Octobre 19, 2018, 12:48

We have this method for tracking page views:

github.com/discourse/discourse

app/controllers/topics_controller.rb

f0af61da4


      
          def should_track_visit_to_topic?
            !!((!request.format.json? || params[:track_visit]) && current_user)
          end

We have additional methods for tracking user visits which would be even harder to override.

We only store one page view per day per user, but I get that it can add up.

github.com/discourse/discourse

app/models/topic_view_item.rb

f0af61da4


      
          # Only store a view once per day per thing per (user || ip)

Hacking this out so certain users are not tracked would either require a plugin or some sort of daily query that nukes all the views by the user and remembers to also reduce views count from the topics table.

kamcc · Janvier 15, 2019, 9:57

Hi all – just jumping in here to say that @mcmcclur’s code was exactly what I was looking for! So thank you very much for sharing

I made a few small modifications (mainly additional code that makes sure to grab all posts in a topic, not just the first twenty) and the code is here: GitHub - kitsandkats/ArchiveDiscourse: Code for archiving a Discourse site into static HTML. · GitHub, forked from @mcmcclur’s original repo and stored as a python file instead of a Jupyter notebook.

I’m very happy with how it turned out. Thanks again!

johnnyboi5858 · Décembre 4, 2019, 6:15

Bonjour, je viens de parcourir tout ce fil et je voulais savoir si cet outil fonctionne si le forum Discourse est protégé par un identifiant et un mot de passe. Comment modifier le code pour permettre l’archivage du site ?

mcmcclur · Décembre 4, 2019, 1:05

Tel qu’il est actuellement rédigé, le code n’est pas conçu pour accéder à des éléments nécessitant une connexion. Cependant, cela devrait être assez simple à mettre en place. Le code interagit avec le site Discourse via la bibliothèque Python Requests, qui propose une authentification. Il est tout à fait envisageable que l’ajout de auth=('user', 'pass') au code aux endroits appropriés soit la seule chose nécessaire. Je ne fais pas actuellement tourner de site Discourse, je ne peux donc pas tester cela pour le moment.

adrelanos · Mai 26, 2020, 1:53

httrack ne fonctionne pas pour moi. J’utilise :

httrack https://my-forums.org --user-agent "Googlebot"

httrack est très prometteur, mais les longs fils de discussion du forum sur plusieurs pages sont incomplets. Dès que je clique sur « page 2 », cela ne fonctionne pas. C’est-à-dire que :

file:///home/user/My%20Web%20Sites/my-forums/my-forum.org/t/forum-thread-title/83394658.html semble vraiment bien (ne charge pas les ressources externes), mais
file:///home/user/My%20Web%20Sites/my-forums/my-forum.org/t/forum-thread-title/83394658.html?page=2 est cassé.

Des suggestions ?

Peut-être que httrack peut être configuré pour « utiliser le mode impression » ?

exemple de vue standard d’une discussion sur le forum
exemple de vue imprimée d’une discussion sur le forum même URL, mais /print a été ajouté à la fin

Peut-être que httrack peut être configuré pour « ajouter /print à la fin » ?

Existe-t-il un paramètre d’user-agent qui affiche tout le fil de discussion du forum sur une seule page ? Sinon, pourriez-vous ajouter cette fonctionnalité ? Vous avez déjà mis en œuvre le mode impression. La plupart est déjà implémenté. Il ne reste plus qu’un user-agent qui fournirait au crawler le contenu généré pour le « mode impression » ? Alternativement, si vous n’aimez pas l’idée d’un user-agent personnalisé à cette fin, qu’en est-il d’un en-tête HTTP ou d’un cookie qui pourrait être utilisé à cette fin ?

ArchiveDiscourse amélioré/forké par par @kitsandkats est également cassé pour moi.

Pourriez-vous également envisager d’implémenter /print pour les pages d’accueil et les pages de catégories ?

Je me cite dans https://meta.discourse.org/t/i-dont-like-infinite-scrolling-and-want-to-disable-it/104660/3

Désactiver (temporairement) le défilement infini (pour certains user-agents) rendrait possible l’archivage de Discourse avec l’outil d’archive web httrack.

saper · Janvier 31, 2021, 12:30

Les requêtes Python utiliseront automatiquement le fichier .netrc pour l’authentification si nécessaire (mais il faut qu’elles reçoivent une réponse HTTP 401).

brechtm · Mars 1, 2021, 6:09

J’ai obtenu de bons résultats avec wget, y compris pour l’authentification. Décrit ici :

https://meta.discourse.org/t/archive-an-old-forum-in-place-to-start-a-new-discourse-forum/13433/14

Sujet		Réponses	Vues
Make Discourse play nice with the Wayback Machine Feature	46	12071	Juin 2, 2020
Improving Discourse static HTML archive Feature	5	2113	Avril 7, 2019
Any updates on the best way to create a HTML archive of a static site? Community Building	10	384	Avril 16, 2026
Interact with discourse from Python? Development	31	5467	Avril 20, 2026
Is anyone working on a Discourse Wiki? Feature	41	16851	Mai 15, 2020

Un outil d'archivage Discourse de base

Note

The httrack command

The archival tool code

Sujets connexes