Uma ferramenta básica de arquivamento para Discourse

mcmcclur · Maio 12, 2017, 8:20pm

Ferramenta de arquivamento atualizada com o Codex de maio de 2026

Parece ser bastante complicado salvar um site inteiro do Discourse em uma versão estática. De acordo com esta postagem de Jeff Atwood, é «muito mais difícil do que você imagina». Também não parece que isso seja uma prioridade para a equipe do Discourse, o que é perfeitamente compreensível.

No entanto, para os meus propósitos, percebi que realmente precisava de alguma forma de gerar versões em HTML estático básico dos meus sites do Discourse. Tenho usado o Discourse há alguns anos como um fórum de discussão ao ensinar minhas aulas de matemática no ensino superior; assim, a cada poucos meses, aposento um ou dois sites e inicio um ou dois novos. Obviamente, as discussões nos sites que estão sendo aposentados têm valor, então eu realmente precisava de alguma forma de salvá-las. Por fim, decidi criar minha própria ferramenta.

A ideia básica é simples: usar a API do Discourse para rastrear o site, capturar a versão formatada de cada postagem e transformá-la em HTML. A ferramenta foca principalmente nas minhas próprias necessidades como professor de matemática no ensino superior, que utiliza pequenos fóruns do Discourse para apoiar minhas aulas de matemática. Assim, conteúdo matemático, como f(x)=e^{-x^2}, deve ser automaticamente formatado com o MathJax V4 e blocos de código delimitados marcados como sage são convertidos em Células Sage ativas.

Se estiver interessado, você pode visualizar

Nota

A atualização da ferramenta de arquivamento foi realizada em grande parte com o Codex.

codinghorror · Maio 12, 2017, 8:29pm

We’re definitely interested in this, because sometimes you want to turn off all the fancy hosting and databases and render out a set of static HTML pages for permanent long term archiving with zero security risk.

With the meta topic, others can follow along and edit / contribute as needed.

Falco · Maio 12, 2017, 8:30pm

You can also use our basic HTML version for archiving: this topic in HTML.

You can get this version using a crawler user agent.

Maybe this + recursive wget or similar can help you.

mcmcclur · Julho 19, 2017, 3:01am

Yes, those links are gone, but it’s all summarized on this new page. Also, the output of the code as applied to this DiscourseMeta is now here. I even put it up on GitHub so maybe someone will get interested.

I’d like to edit the original post, but I seem to be past the edit window.

Incidentally, I do think that httrack works much better than I originally thought but I still strongly prefer my version for two main reasons:

My code explicitly supports MathJax, which is essential for my work.
(I’ll probably need to update my code to work with the new MathPlugin sometime)
I’ve got much more control over what get’s downloaded and how it’s displayed. For example, I don’t like the way that httrack output points to user links, even if not downloadedl

Silvanus · Julho 19, 2017, 1:08pm

I’m hosting a forum that is currently, in its third iteration, running Discourse. Our last two forums were (I think, phpbb2 or something like that). I have resolved to archive them using Discourse, so that:

I scan the phpbb2 database into Discourse (there’s a migration tool)
I create a static HTML archive using Discourse.
I put up the static HTML archive into public use (preferably in the same place where our dynamic forum running Discourse is).

According to the first message

There are no user pages or category pages

Could it be somehow advanced so that creating category views would be also possible?

Also, any help on how to use the Jupyter notebook thing? First time I hear of this…

mcmcclur · Julho 19, 2017, 1:43pm

@Silvanus Can you indicate a live discourse site that you want to archive? I’d be glad to try it out.

Also, have you tried httrack? I think that a command as simple as httrack yoursiteurl might work quite well.

Silvanus · Julho 19, 2017, 2:14pm

I’m still in the phase 1 (phpbb2 > phpbb3 > discourse) of my archival, so no site yet. After I’ve managed the phpbb conversion, I’ll get back to this. It feels very, very hard. Been trying to install phpbb3 for a while now, but I get some weird problems all the time.

I’ll have to try that httrack, thanks.

mcmcclur · Julho 19, 2017, 4:18pm

@Silvanus Well, I noticed that you point to the forum at https://uskojarukous.fi/ on your Profile page; I went ahead and created a couple of archives of that. You can (temporarily) take a look at the results here:

Here are a few comments:

I definitely like my version better; no surprise there because I designed it the way I want it to look.
The front page of the httrack version doesn’t look so great simply because that’s what the escaped fragment version looks like.
I think it might make sense to start httrack at a subpage to generate something like this.
It wouldn’t be too hard to make my archival tool grab the categories; I might do that for the next iteration.
My code adds MathJax to every page because my forums are mathematical. I should probably try to detect if MathJax is necessary. I’m guessing your forum doesn’t require it.

The httrack command

The httrack version was generated with a command that looks like so:

httrack https://uskojarukous.fi -https://uskojarukous.fi/users* -*.rss -O uskojarukous_arxiv -x -o -M10000000 --user-agent "Googlebot"

The -https://uskojarukous.fi/users* -*.rss prevents httrack from downloading files matching those patterns.
The -x -o combo replaces both external links and errors with a local file indicating the error. So, for example, we don’t link to user profiles on the original that weren’t downloaded locally.
The -M10000000 restricts the total amount downloaded to 10MB. There appears to be some post processing and downloading of supplemental files that makes the total larger than this anyway.
The --user-agent "Googlebot" should not be necessary if the forum is powered by a recent version of Discourse.

The archival tool code

For the most part, the archival tool should run with minimal changes. I run it within a Jupyter notebook but the exact same code could be run from a Python script with the appropriate libraries installed. Of course, you need to tell it what forum you want to download. The few lines of my first input look like so:

base_url = 'https://uskojarukous.fi/'
path = os.path.join(os.getcwd(), 'uskojarukous')
archive_blurb = "A partial archive of uskojarukous.fi as of " + \
  date.today().strftime("%A %B %d, %Y") + '.'

Later, in input 6, I define max_more_topics = 2. Essentially, that defines a bound on k in this code here:

'/latest.json?page=k'

But again, there should be some changes made to the code to get it to work for non-mathematical forums.

Silvanus · Julho 19, 2017, 4:35pm

Very cool, thank you for all the clarifications. Just a quick note, it seems that your tool can’t handle sub-categories (which is why many of the messages seem to be without a category).

mcmcclur · Julho 19, 2017, 4:39pm

@Silvanus Yes, I think you’re absolutely right about the sub-category thing. Thanks - I had wondered about that.

Silvanus · Julho 19, 2017, 10:42pm

@mcmcclur: as you already realized, I’m the admin of said forum, which is the third of our forums. When we did technological jumps, we didn’t migrate, but started from scratch, and the older forum was archived. The last two forums are in SMF format - but I finally managed to start converting them into Discourse format!

So, our forum had a public area and a closed area. I’m thinking that the closed area (a few categories) should be archived, but closed off via a password gate. I noticed that the static paths are something like /t/TITLE/MESSAGEID/. This, if course, lends itself for thread-by-thread gating, but is slightly cumbersome - but, heh, I guess that’s what you get when archiving huge loads of stuff from a dynamic forum to a static archive…

Antroden · Outubro 18, 2018, 2:25pm

Apenas algumas dicas para quem também está procurando por dicas do httrack (que funciona muito bem para os meus propósitos).

Uma lista completa de flags de linha de comando: HTTrack Website Copier - Offline Browser
Usar a flag -s0 ignora o arquivo robots.txt (se você tiver uma conta não acessível por spiders)
Se seu site estiver atrás de um login, você pode baixar um arquivo .txt com os cookies (após fazer login) usando uma extensão do Chrome como cookies.txt e colocá-lo no diretório onde está executando o httrack.

Estou usando httrack via cron para criar um arquivo offline do nosso site Discourse. No entanto, o usuário que faz login pelo httrack é marcado como uma “visualização” para cada tópico, gerando números de visualizações superinflados para cada tópico (o cron roda a cada hora).

Existe alguma maneira de excluir um determinado usuário de ser registrado nas estatísticas / estatísticas de visualização do site como um todo?

codinghorror · Outubro 18, 2018, 8:22pm

Good point, where would this be intercepted @sam?

sam · Outubro 19, 2018, 12:48am

We have this method for tracking page views:

github.com/discourse/discourse

app/controllers/topics_controller.rb

f0af61da4


      
          def should_track_visit_to_topic?
            !!((!request.format.json? || params[:track_visit]) && current_user)
          end

We have additional methods for tracking user visits which would be even harder to override.

We only store one page view per day per user, but I get that it can add up.

github.com/discourse/discourse

app/models/topic_view_item.rb

f0af61da4


      
          # Only store a view once per day per thing per (user || ip)

Hacking this out so certain users are not tracked would either require a plugin or some sort of daily query that nukes all the views by the user and remembers to also reduce views count from the topics table.

kamcc · Janeiro 15, 2019, 9:57pm

Hi all – just jumping in here to say that @mcmcclur’s code was exactly what I was looking for! So thank you very much for sharing

I made a few small modifications (mainly additional code that makes sure to grab all posts in a topic, not just the first twenty) and the code is here: GitHub - kitsandkats/ArchiveDiscourse: Code for archiving a Discourse site into static HTML. · GitHub, forked from @mcmcclur’s original repo and stored as a python file instead of a Jupyter notebook.

I’m very happy with how it turned out. Thanks again!

johnnyboi5858 · Dezembro 4, 2019, 6:15am

Oi, acabei de ler todo esse tópico e queria verificar se essa ferramenta funciona se o fórum Discourse estiver atrás de um login e senha. Como eu editaria o código para permitir que eu arquivasse o site?

mcmcclur · Dezembro 4, 2019, 1:05pm

Como o código está escrito atualmente, ele não foi projetado para acessar nenhum material que exija login. No entanto, configurá-lo deve ser bastante simples. O código interage com o site Discourse por meio da biblioteca Python Requests, que oferece autenticação. É factível que adicionar auth=('user', 'pass') ao código nos pontos apropriados seja tudo o que for necessário. No momento, não estou executando um site Discourse, então não posso testar isso agora.

adrelanos · Maio 26, 2020, 1:53pm

httrack não funciona para mim. Estou usando:

httrack https://my-forums.org --user-agent "Googlebot"

httrack é bastante promissor, mas threads longas de fórum com múltiplas páginas ficam incompletas. Ao clicar em “página 2”, não funciona. Ou seja:

file:///home/user/Meus%20Sites%20Web/my-forums/my-forum.org/t/titulo-do-fio-do-fórum/83394658.html parece muito bom (não busca recursos externos), mas
file:///home/user/Meus%20Sites%20Web/my-forums/my-forum.org/t/titulo-do-fio-do-fórum/83394658.html?page=2 está quebrado.

Alguma sugestão?

Talvez seja possível instruir o httrack a “usar o modo de impressão”?

exemplo de visualização padrão de discussão em fórum
exemplo de visualização impressa de discussão em fórum mesma URL, apenas /print foi adicionado no final

Talvez seja possível instruir o httrack a “adicionar /print no final”?

Existe alguma configuração de user agent que mostre toda a thread do fórum em uma única página? Se não, você poderia adicionar esse recurso? Você já implementou o modo de impressão. A maior parte já está implementada. O que falta é um user agent que faça com que o conteúdo gerado para o “modo de impressão” seja fornecido ao rastreador? Alternativamente, se você não gostar da ideia de um user agent personalizado para esse fim, que tal um cabeçalho HTTP ou um cookie que possa ser usado para esse propósito?

ArchiveDiscourse melhorado/bifurcado por por @kitsandkats também está quebrado para mim.

Você poderia considerar implementar /print também para a página inicial / páginas de categoria?

Cito a mim mesmo em https://meta.discourse.org/t/i-dont-like-infinite-scrolling-and-want-to-disable-it/104660/3

Desativar (temporariamente) a rolagem infinita (para alguns user agents) tornaria possível arquivar o Discourse com a ferramenta de arquivamento web htttrack.

saper · Janeiro 31, 2021, 12:30pm

O Python Requests usará automaticamente o .netrc para autenticação, se necessário (mas é necessário receber uma resposta HTTP 401).

brechtm · Março 1, 2021, 6:09pm

Obtive bons resultados com o wget, incluindo autenticação. Descrito aqui:

https://meta.discourse.org/t/archive-an-old-forum-in-place-to-start-a-new-discourse-forum/13433/14

Tópico		Respostas	Visualizações
Make Discourse play nice with the Wayback Machine Feature	46	12067	2 de Junho de 2020
Improving Discourse static HTML archive Feature	5	2113	7 de Abril de 2019
Any updates on the best way to create a HTML archive of a static site? Community Building	10	384	16 de Abril de 2026
Interact with discourse from Python? Development	31	5463	20 de Abril de 2026
Is anyone working on a Discourse Wiki? Feature	41	16851	15 de Maio de 2020

Uma ferramenta básica de arquivamento para Discourse

Nota

The httrack command

The archival tool code

Tópicos relacionados