Una herramienta básica de archivo para Discourse

mcmcclur · 12 Mayo, 2017 20:20

Herramienta de archivado actualizada con Codex en mayo de 2026

Parece que es bastante complicado guardar un sitio completo de Discourse en una versión estática. Según este post de Jeff Atwood, es «mucho más difícil de lo que uno pensaría». Tampoco parece que esto sea una prioridad para el equipo de Discourse, lo cual es perfectamente comprensible.

Sin embargo, para mis propósitos, descubrí que realmente necesitaba alguna manera de generar versiones básicas en HTML estático de mis sitios de Discourse. He estado usando Discourse durante un par de años como tablero de discusión al impartir mis clases de matemáticas universitarias, por lo que, cada pocos meses, retiro uno o dos sitios y comienzo uno o dos nuevos. Obviamente, las discusiones en los sitios que se retiran tienen valor, por lo que realmente necesitaba alguna manera de guardarlas. En última instancia, decidí crear mi propia herramienta.

La idea básica es sencilla: usar la API de Discourse para recorrer el sitio, obtener la versión cocinada de cada publicación y transformarla en HTML. La herramienta se centra principalmente en mis propias necesidades como profesor universitario de matemáticas que utiliza pequeños foros de Discourse para apoyar mis clases de matemáticas. Por lo tanto, el contenido matemático, como f(x)=e^{-x^2}, debe ser tipografiado automáticamente con MathJax V4 y los bloques de código delimitados etiquetados como sage se traducen a Celdas Sage activas.

Si estás interesado, puedes ver

Una pequeña parte de Discourse Meta,
El foro de mi clase de Matemáticas para Aprendizaje Automático, y/o
El repositorio de GitHub.

Nota

La actualización de la herramienta de archivado se realizó en gran parte con Codex.

codinghorror · 12 Mayo, 2017 20:29

We’re definitely interested in this, because sometimes you want to turn off all the fancy hosting and databases and render out a set of static HTML pages for permanent long term archiving with zero security risk.

With the meta topic, others can follow along and edit / contribute as needed.

Falco · 12 Mayo, 2017 20:30

You can also use our basic HTML version for archiving: this topic in HTML.

You can get this version using a crawler user agent.

Maybe this + recursive wget or similar can help you.

mcmcclur · 19 Julio, 2017 03:01

Yes, those links are gone, but it’s all summarized on this new page. Also, the output of the code as applied to this DiscourseMeta is now here. I even put it up on GitHub so maybe someone will get interested.

I’d like to edit the original post, but I seem to be past the edit window.

Incidentally, I do think that httrack works much better than I originally thought but I still strongly prefer my version for two main reasons:

My code explicitly supports MathJax, which is essential for my work.
(I’ll probably need to update my code to work with the new MathPlugin sometime)
I’ve got much more control over what get’s downloaded and how it’s displayed. For example, I don’t like the way that httrack output points to user links, even if not downloadedl

Silvanus · 19 Julio, 2017 13:08

I’m hosting a forum that is currently, in its third iteration, running Discourse. Our last two forums were (I think, phpbb2 or something like that). I have resolved to archive them using Discourse, so that:

I scan the phpbb2 database into Discourse (there’s a migration tool)
I create a static HTML archive using Discourse.
I put up the static HTML archive into public use (preferably in the same place where our dynamic forum running Discourse is).

According to the first message

There are no user pages or category pages

Could it be somehow advanced so that creating category views would be also possible?

Also, any help on how to use the Jupyter notebook thing? First time I hear of this…

mcmcclur · 19 Julio, 2017 13:43

@Silvanus Can you indicate a live discourse site that you want to archive? I’d be glad to try it out.

Also, have you tried httrack? I think that a command as simple as httrack yoursiteurl might work quite well.

Silvanus · 19 Julio, 2017 14:14

I’m still in the phase 1 (phpbb2 > phpbb3 > discourse) of my archival, so no site yet. After I’ve managed the phpbb conversion, I’ll get back to this. It feels very, very hard. Been trying to install phpbb3 for a while now, but I get some weird problems all the time.

I’ll have to try that httrack, thanks.

mcmcclur · 19 Julio, 2017 16:18

@Silvanus Well, I noticed that you point to the forum at https://uskojarukous.fi/ on your Profile page; I went ahead and created a couple of archives of that. You can (temporarily) take a look at the results here:

Here are a few comments:

I definitely like my version better; no surprise there because I designed it the way I want it to look.
The front page of the httrack version doesn’t look so great simply because that’s what the escaped fragment version looks like.
I think it might make sense to start httrack at a subpage to generate something like this.
It wouldn’t be too hard to make my archival tool grab the categories; I might do that for the next iteration.
My code adds MathJax to every page because my forums are mathematical. I should probably try to detect if MathJax is necessary. I’m guessing your forum doesn’t require it.

The httrack command

The httrack version was generated with a command that looks like so:

httrack https://uskojarukous.fi -https://uskojarukous.fi/users* -*.rss -O uskojarukous_arxiv -x -o -M10000000 --user-agent "Googlebot"

The -https://uskojarukous.fi/users* -*.rss prevents httrack from downloading files matching those patterns.
The -x -o combo replaces both external links and errors with a local file indicating the error. So, for example, we don’t link to user profiles on the original that weren’t downloaded locally.
The -M10000000 restricts the total amount downloaded to 10MB. There appears to be some post processing and downloading of supplemental files that makes the total larger than this anyway.
The --user-agent "Googlebot" should not be necessary if the forum is powered by a recent version of Discourse.

The archival tool code

For the most part, the archival tool should run with minimal changes. I run it within a Jupyter notebook but the exact same code could be run from a Python script with the appropriate libraries installed. Of course, you need to tell it what forum you want to download. The few lines of my first input look like so:

base_url = 'https://uskojarukous.fi/'
path = os.path.join(os.getcwd(), 'uskojarukous')
archive_blurb = "A partial archive of uskojarukous.fi as of " + \
  date.today().strftime("%A %B %d, %Y") + '.'

Later, in input 6, I define max_more_topics = 2. Essentially, that defines a bound on k in this code here:

'/latest.json?page=k'

But again, there should be some changes made to the code to get it to work for non-mathematical forums.

Silvanus · 19 Julio, 2017 16:35

Very cool, thank you for all the clarifications. Just a quick note, it seems that your tool can’t handle sub-categories (which is why many of the messages seem to be without a category).

mcmcclur · 19 Julio, 2017 16:39

@Silvanus Yes, I think you’re absolutely right about the sub-category thing. Thanks - I had wondered about that.

Silvanus · 19 Julio, 2017 22:42

@mcmcclur: as you already realized, I’m the admin of said forum, which is the third of our forums. When we did technological jumps, we didn’t migrate, but started from scratch, and the older forum was archived. The last two forums are in SMF format - but I finally managed to start converting them into Discourse format!

So, our forum had a public area and a closed area. I’m thinking that the closed area (a few categories) should be archived, but closed off via a password gate. I noticed that the static paths are something like /t/TITLE/MESSAGEID/. This, if course, lends itself for thread-by-thread gating, but is slightly cumbersome - but, heh, I guess that’s what you get when archiving huge loads of stuff from a dynamic forum to a static archive…

Antroden · 18 Octubre, 2018 14:25

Solo algunos datos para cualquier persona que esté buscando consejos sobre httrack (que funciona muy bien para mis propósitos).

Una lista completa de las banderas de la línea de comandos: HTTrack Website Copier - Offline Browser
Usar la bandera -s0 ignora el robots.txt (si tienes una cuenta que no puede ser rastreada por spiders).
Si tu sitio está detrás de un inicio de sesión, puedes descargar un archivo .txt de las cookies (una vez que hayas iniciado sesión) usando una extensión de Chrome como cookies.txt y colocarlo en el directorio desde el que ejecutas httrack.

Estoy usando httrack a través de cron para crear un archivo sin conexión de nuestro sitio Discourse. Sin embargo, el usuario que inicia sesión bajo httrack se marca como una “vista” para cada tema, lo que da números de vistas súper inflados para cada tema (el cron se ejecuta cada hora).

¿Hay alguna manera de excluir a un usuario determinado de ser registrado en las estadísticas o estadísticas de vistas del sitio en general?

codinghorror · 18 Octubre, 2018 20:22

Good point, where would this be intercepted @sam?

sam · 19 Octubre, 2018 00:48

We have this method for tracking page views:

github.com/discourse/discourse

app/controllers/topics_controller.rb

f0af61da4


      
          def should_track_visit_to_topic?
            !!((!request.format.json? || params[:track_visit]) && current_user)
          end

We have additional methods for tracking user visits which would be even harder to override.

We only store one page view per day per user, but I get that it can add up.

github.com/discourse/discourse

app/models/topic_view_item.rb

f0af61da4


      
          # Only store a view once per day per thing per (user || ip)

Hacking this out so certain users are not tracked would either require a plugin or some sort of daily query that nukes all the views by the user and remembers to also reduce views count from the topics table.

kamcc · 15 Enero, 2019 21:57

Hi all – just jumping in here to say that @mcmcclur’s code was exactly what I was looking for! So thank you very much for sharing

I made a few small modifications (mainly additional code that makes sure to grab all posts in a topic, not just the first twenty) and the code is here: GitHub - kitsandkats/ArchiveDiscourse: Code for archiving a Discourse site into static HTML. · GitHub, forked from @mcmcclur’s original repo and stored as a python file instead of a Jupyter notebook.

I’m very happy with how it turned out. Thanks again!

johnnyboi5858 · 4 Diciembre, 2019 06:15

Hola, acabo de leer todo este hilo y quería consultar si esta herramienta funciona si el foro de Discourse está detrás de un inicio de sesión y contraseña. ¿Cómo debería editar el código para que me permita archivar el sitio?

mcmcclur · 4 Diciembre, 2019 13:05

Tal como está escrito actualmente, el código no está diseñado para acceder a ningún material que requiera inicio de sesión. Sin embargo, debería ser bastante fácil configurarlo. El código interactúa con el sitio de Discourse a través de la biblioteca Python Requests, la cual ofrece autenticación. Es factible que agregar auth=('user', 'pass') al código en los puntos adecuados sea todo lo necesario. En este momento no estoy ejecutando un sitio de Discourse, por lo que no puedo probarlo.

adrelanos · 26 Mayo, 2020 13:53

httrack no funciona para mí. Estoy usando:

httrack https://my-forums.org --user-agent "Googlebot"

httrack es bastante prometedor, pero los hilos largos del foro con múltiples páginas están incompletos. Una vez que hago clic en “página 2”, no funciona. Es decir:

file:///home/user/My%20Web%20Sites/my-forums/my-forum.org/t/forum-thread-title/83394658.html se ve realmente bien (no obtiene datos de recursos externos), pero
file:///home/user/My%20Web%20Sites/my-forums/my-forum.org/t/forum-thread-title/83394658.html?page=2 está roto.

¿Alguna sugerencia?

¿Quizás se le pueda indicar a httrack de alguna manera que “utilice el modo de impresión”?

ejemplo de vista estándar de discusión en el foro
ejemplo de vista de impresión de discusión en el foro misma URL, solo se agregó /print al final

¿Quizás se le pueda indicar a httrack que “agregue /print al final”?

¿Existe una configuración de agente de usuario que muestre todo el hilo del foro en una sola página? Si no es así, ¿podrías agregar esta función? Ya implementaste el modo de impresión. Lo que falta es un agente de usuario que haga que se proporcionen al rastreador los contenidos generados para el “modo de impresión”. Alternativamente, si no te gusta la idea de un agente de usuario personalizado para este propósito, ¿qué tal un encabezado HTTP o una cookie que se pueda usar para este fin?

ArchiveDiscourse mejorado/bifurcado por por @kitsandkats también está roto para mí.

¿Podrías considerar también implementar /print también para las páginas principales o de categorías?

Me cito a mí mismo en https://meta.discourse.org/t/i-dont-like-infinite-scrolling-and-want-to-disable-it/104660/3

Desactivar (temporalmente) el desplazamiento infinito (para algunos agentes de usuario) haría posible archivar Discourse con la herramienta de archivo web httrack.

saper · 31 Enero, 2021 12:30

Python requests utilizará automáticamente .netrc para la autenticación si es necesario (pero necesita recibir una respuesta HTTP 401).

brechtm · 1 Marzo, 2021 18:09

He obtenido buenos resultados con wget, incluida la autenticación. Descrito aquí:

https://meta.discourse.org/t/archive-an-old-forum-in-place-to-start-a-new-discourse-forum/13433/14

Tema		Respuestas	Vistas
Make Discourse play nice with the Wayback Machine Feature	46	12067	2 Junio 2020
Improving Discourse static HTML archive Feature	5	2113	7 Abril 2019
Any updates on the best way to create a HTML archive of a static site? Community Building	10	384	16 Abril 2026
Interact with discourse from Python? Development	31	5463	20 Abril 2026
Is anyone working on a Discourse Wiki? Feature	41	16851	15 Mayo 2020

Una herramienta básica de archivo para Discourse

Nota

The httrack command

The archival tool code

Temas relacionados