Uno strumento di archiviazione base per Discourse

mcmcclur · 12 Maggio 2017, 8:20pm

Strumento di archiviazione aggiornato con Codex del maggio 2026

Sembra che salvare un intero sito Discourse in una versione statica sia piuttosto complicato. Secondo questo post di Jeff Atwood, è «molto più difficile di quanto si possa pensare». Nemmeno sembra che questa sia una priorità per il team di Discourse, il che è perfettamente comprensibile.

Tuttavia, per le mie esigenze, ho scoperto di aver davvero bisogno di un modo per generare versioni HTML statiche di base dei miei siti Discourse. Utilizzo Discourse da un paio d’anni come forum di discussione per le mie lezioni di matematica universitaria; ogni pochi mesi, quindi, ritiro uno o due siti e ne attivo uno o due nuovi. Ovviamente, le discussioni sui siti in ritiro hanno valore, quindi avevo davvero bisogno di un modo per salvarle. Alla fine, ho deciso di creare il mio strumento.

L’idea di base è semplice: utilizzare l’API di Discourse per esplorare il sito, recuperare la versione «cotta» di ogni messaggio e trasformarla in HTML. Lo strumento si concentra principalmente sulle mie esigenze come professore universitario di matematica che utilizza piccoli forum Discourse per supportare le mie lezioni. Di conseguenza, i contenuti matematici, come f(x)=e^{-x^2}, devono essere automaticamente impaginati con MathJax V4 e i blocchi di codice delimitati contrassegnati come sage devono essere convertiti in Sage Cell attivi.

Se interessati, potete visualizzare

Una piccola parte di Discourse Meta,
Il forum per il mio corso di Matematica per l’Apprendimento Automatico, e/o
Il repository GitHub.

Nota

L’aggiornamento dello strumento di archiviazione è stato effettuato principalmente con Codex.

codinghorror · 12 Maggio 2017, 8:29pm

We’re definitely interested in this, because sometimes you want to turn off all the fancy hosting and databases and render out a set of static HTML pages for permanent long term archiving with zero security risk.

With the meta topic, others can follow along and edit / contribute as needed.

Falco · 12 Maggio 2017, 8:30pm

You can also use our basic HTML version for archiving: this topic in HTML.

You can get this version using a crawler user agent.

Maybe this + recursive wget or similar can help you.

mcmcclur · 19 Luglio 2017, 3:01am

Yes, those links are gone, but it’s all summarized on this new page. Also, the output of the code as applied to this DiscourseMeta is now here. I even put it up on GitHub so maybe someone will get interested.

I’d like to edit the original post, but I seem to be past the edit window.

Incidentally, I do think that httrack works much better than I originally thought but I still strongly prefer my version for two main reasons:

My code explicitly supports MathJax, which is essential for my work.
(I’ll probably need to update my code to work with the new MathPlugin sometime)
I’ve got much more control over what get’s downloaded and how it’s displayed. For example, I don’t like the way that httrack output points to user links, even if not downloadedl

Silvanus · 19 Luglio 2017, 1:08pm

I’m hosting a forum that is currently, in its third iteration, running Discourse. Our last two forums were (I think, phpbb2 or something like that). I have resolved to archive them using Discourse, so that:

I scan the phpbb2 database into Discourse (there’s a migration tool)
I create a static HTML archive using Discourse.
I put up the static HTML archive into public use (preferably in the same place where our dynamic forum running Discourse is).

According to the first message

There are no user pages or category pages

Could it be somehow advanced so that creating category views would be also possible?

Also, any help on how to use the Jupyter notebook thing? First time I hear of this…

mcmcclur · 19 Luglio 2017, 1:43pm

@Silvanus Can you indicate a live discourse site that you want to archive? I’d be glad to try it out.

Also, have you tried httrack? I think that a command as simple as httrack yoursiteurl might work quite well.

Silvanus · 19 Luglio 2017, 2:14pm

I’m still in the phase 1 (phpbb2 > phpbb3 > discourse) of my archival, so no site yet. After I’ve managed the phpbb conversion, I’ll get back to this. It feels very, very hard. Been trying to install phpbb3 for a while now, but I get some weird problems all the time.

I’ll have to try that httrack, thanks.

mcmcclur · 19 Luglio 2017, 4:18pm

@Silvanus Well, I noticed that you point to the forum at https://uskojarukous.fi/ on your Profile page; I went ahead and created a couple of archives of that. You can (temporarily) take a look at the results here:

Here are a few comments:

I definitely like my version better; no surprise there because I designed it the way I want it to look.
The front page of the httrack version doesn’t look so great simply because that’s what the escaped fragment version looks like.
I think it might make sense to start httrack at a subpage to generate something like this.
It wouldn’t be too hard to make my archival tool grab the categories; I might do that for the next iteration.
My code adds MathJax to every page because my forums are mathematical. I should probably try to detect if MathJax is necessary. I’m guessing your forum doesn’t require it.

The httrack command

The httrack version was generated with a command that looks like so:

httrack https://uskojarukous.fi -https://uskojarukous.fi/users* -*.rss -O uskojarukous_arxiv -x -o -M10000000 --user-agent "Googlebot"

The -https://uskojarukous.fi/users* -*.rss prevents httrack from downloading files matching those patterns.
The -x -o combo replaces both external links and errors with a local file indicating the error. So, for example, we don’t link to user profiles on the original that weren’t downloaded locally.
The -M10000000 restricts the total amount downloaded to 10MB. There appears to be some post processing and downloading of supplemental files that makes the total larger than this anyway.
The --user-agent "Googlebot" should not be necessary if the forum is powered by a recent version of Discourse.

The archival tool code

For the most part, the archival tool should run with minimal changes. I run it within a Jupyter notebook but the exact same code could be run from a Python script with the appropriate libraries installed. Of course, you need to tell it what forum you want to download. The few lines of my first input look like so:

base_url = 'https://uskojarukous.fi/'
path = os.path.join(os.getcwd(), 'uskojarukous')
archive_blurb = "A partial archive of uskojarukous.fi as of " + \
  date.today().strftime("%A %B %d, %Y") + '.'

Later, in input 6, I define max_more_topics = 2. Essentially, that defines a bound on k in this code here:

'/latest.json?page=k'

But again, there should be some changes made to the code to get it to work for non-mathematical forums.

Silvanus · 19 Luglio 2017, 4:35pm

Very cool, thank you for all the clarifications. Just a quick note, it seems that your tool can’t handle sub-categories (which is why many of the messages seem to be without a category).

mcmcclur · 19 Luglio 2017, 4:39pm

@Silvanus Yes, I think you’re absolutely right about the sub-category thing. Thanks - I had wondered about that.

Silvanus · 19 Luglio 2017, 10:42pm

@mcmcclur: as you already realized, I’m the admin of said forum, which is the third of our forums. When we did technological jumps, we didn’t migrate, but started from scratch, and the older forum was archived. The last two forums are in SMF format - but I finally managed to start converting them into Discourse format!

So, our forum had a public area and a closed area. I’m thinking that the closed area (a few categories) should be archived, but closed off via a password gate. I noticed that the static paths are something like /t/TITLE/MESSAGEID/. This, if course, lends itself for thread-by-thread gating, but is slightly cumbersome - but, heh, I guess that’s what you get when archiving huge loads of stuff from a dynamic forum to a static archive…

Antroden · 18 Ottobre 2018, 2:25pm

Just a few tidbits for anyone else looking for some httrack tips (which works great for my purposes).

A complete list of command line flags: HTTrack Website Copier - Offline Browser
Using the -s0 flag ignores the robots.txt (if you have a non-spider-able account)
If your site is behind a login, you can download a .txt file of the cookie (once logged in) using a chrome extension like cookies.txt and place that in the directory you’re running httrack from.

I’m using httrack via cron to create an offline archive of our Discourse site. However, the user that is logging in under httrack gets marked as a “view” for each topic, giving super-inflated numbers of views for each topic (the cron runs every hour).

Is there a way to exclude a certain user from being recorded in the statistics / view stats for the site as a whole?

codinghorror · 18 Ottobre 2018, 8:22pm

Good point, where would this be intercepted @sam?

sam · 19 Ottobre 2018, 12:48am

We have this method for tracking page views:

github.com/discourse/discourse

app/controllers/topics_controller.rb

f0af61da4


      
          def should_track_visit_to_topic?
            !!((!request.format.json? || params[:track_visit]) && current_user)
          end

We have additional methods for tracking user visits which would be even harder to override.

We only store one page view per day per user, but I get that it can add up.

github.com/discourse/discourse

app/models/topic_view_item.rb

f0af61da4


      
          # Only store a view once per day per thing per (user || ip)

Hacking this out so certain users are not tracked would either require a plugin or some sort of daily query that nukes all the views by the user and remembers to also reduce views count from the topics table.

kamcc · 15 Gennaio 2019, 9:57pm

Hi all – just jumping in here to say that @mcmcclur’s code was exactly what I was looking for! So thank you very much for sharing

I made a few small modifications (mainly additional code that makes sure to grab all posts in a topic, not just the first twenty) and the code is here: GitHub - kitsandkats/ArchiveDiscourse: Code for archiving a Discourse site into static HTML. · GitHub, forked from @mcmcclur’s original repo and stored as a python file instead of a Jupyter notebook.

I’m very happy with how it turned out. Thanks again!

johnnyboi5858 · 4 Dicembre 2019, 6:15am

Ciao, ho appena letto tutta questa discussione e volevo verificare se questo strumento funziona anche se il forum Discourse è protetto da login e password. Come dovrei modificare il codice per consentirmi di archiviare il sito?

mcmcclur · 4 Dicembre 2019, 1:05pm

Così com’è scritto attualmente, il codice non è progettato per accedere a materiali che richiedono un login. Tuttavia, dovrebbe essere abbastanza semplice configurarlo. Il codice interagisce con il sito Discourse tramite la libreria Python Requests, che offre funzionalità di autenticazione. È plausibile che aggiungere auth=('user', 'pass') al codice nei punti appropriati sia tutto ciò che serve. Al momento non sto eseguendo un sito Discourse, quindi non posso testarlo.

adrelanos · 26 Maggio 2020, 1:53pm

httrack non funziona per me. Sto usando:

httrack https://my-forums.org --user-agent "Googlebot"

httrack è molto promettente, ma i lunghi thread del forum con più pagine risultano incompleti. Una volta cliccato su “pagina 2”, non funziona. Cioè:

file:///home/user/My%20Web%20Sites/my-forums/my-forum.org/t/forum-thread-title/83394658.html sembra davvero buono (non scarica risorse esterne), ma
file:///home/user/My%20Web%20Sites/my-forums/my-forum.org/t/forum-thread-title/83394658.html?page=2 è rotto.

Avete suggerimenti?

Forse si può dire a httrack di “usare la modalità stampa”?

esempio di visualizzazione standard di una discussione sul forum
esempio di visualizzazione stampata di una discussione sul forum stesso URL, basta aggiungere /print alla fine

Forse si può dire a httrack di “aggiungere /print alla fine”?

Esiste un’impostazione user agent che mostri l’intero thread del forum in una singola pagina? Se no, potreste aggiungere questa funzionalità? Avete già implementato la modalità stampa. La maggior parte è già realizzata. Quello che manca è un user agent che fornisca al crawler i contenuti generati per la “modalità stampa”? In alternativa, se non vi piace l’idea di un user agent personalizzato per questo scopo, che ne dite di un header HTTP o di un cookie che possa essere utilizzato a questo scopo?

ArchiveDiscourse migliorato/forcato da da @kitsandkats è anch’esso rotto per me.

Potreste prendere in considerazione l’implementazione di /print anche per la pagina principale e le pagine delle categorie?

Mi cito in https://meta.discourse.org/t/i-dont-like-infinite-scrolling-and-want-to-disable-it/104660/3

Disabilitare temporaneamente lo scroll infinito (per alcuni user agent) renderebbe possibile archiviare Discourse con lo strumento di archiviazione web htttrack.

saper · 31 Gennaio 2021, 12:30pm

Python requests utilizzerà automaticamente .netrc per l’autenticazione se necessario (ma è necessario ottenere una risposta HTTP 401).

brechtm · 1 Marzo 2021, 6:09pm

Ho ottenuto buoni risultati con wget, inclusa l’autenticazione. Descritto qui:

https://meta.discourse.org/t/archive-an-old-forum-in-place-to-start-a-new-discourse-forum/13433/14

Argomento		Risposte	Visualizzazioni
Make Discourse play nice with the Wayback Machine Feature	46	12067	Giugno 2, 2020
Improving Discourse static HTML archive Feature	5	2113	Aprile 7, 2019
Any updates on the best way to create a HTML archive of a static site? Community Building	10	384	Aprile 16, 2026
Interact with discourse from Python? Development	31	5463	Aprile 20, 2026
Is anyone working on a Discourse Wiki? Feature	41	16851	Maggio 15, 2020

Uno strumento di archiviazione base per Discourse

Nota

The httrack command

The archival tool code

Argomenti correlati