A basic Discourse archival tool

(Mark McClure) #1

It seems that it’s pretty tricky to save an entire discourse site to a static version. According to this post by Jeff Atwood, it’s “much harder than you’d think”. It doesn’t appear that this is a priority for the Discourse team, either, which is perfectly understandable.

For my purposes, though, I found that I really needed some way to generate basic, static HTML versions of my Discourse sites. I’ve been using Discourse for a couple of years now as a discussion board when teaching my college math classes so, every few months, I retire one or two sites and start one or two more. Obviously, the discussions on the retiring sites have value so I really needed some way to save them. Ultimately, I figured I’d build my own tool.

The basic idea is simple: Rather than scan the HTML and use the HTTP protocol to crawl the site, I figured I’d use the Discourse API to crawl the site. You can view the result of applying the tool to this Discourse Meta on my webpage.

Before looking at it though, please temper your expectations. I’m a college math professor, not a professional web developer. And, while I’d like it to look pretty nice, I’m mainly interested in simplicity. My guess is that most folks here would consider this to be proof of concept, rather than a serious, working tool. Taking that into account, here are some features/limitations:

• The code grabs the site logo and places it in a fixed banner at the top. If no site logo is found, it uses the site logo at the top of meta by default.
• It uses the API to grab the topic list and generates a new page for each topic. You can limit the number of times you respond to more_topics_url.
• There is a single main page that links to those topics.
• MathJax is important for my needs so every page loads and configures MathJax.
• There is no other JavaSciript and no other plugins are considered.
• There are no user pages or category pages.
• It’s not very configurable without messing with the code directly.

In spite of all the limitations, it’s sufficient for my needs and I’m rather happy with it. I have no particular plans to expand it, other than incrementally as needed. If anyone is interested, the code (which is Python) is available here:

Perhaps, someone will push it further or just be inspired by the idea?

Converting discourse topics to read only pages
Make Discourse play nice with the Wayback Machine
Migration away from Discourse
Archive an old forum "in place" to start a new Discourse forum
How do I export comlete forum as static html-pages
Recommended way to close and archive a Discourse forum itself?
(Jeff Atwood) #2

We’re definitely interested in this, because sometimes you want to turn off all the fancy hosting and databases and render out a set of static HTML pages for permanent long term archiving with zero security risk.

With the meta topic, others can follow along and edit / contribute as needed.

(Rafael dos Santos Silva) #3

You can also use our basic HTML version for archiving: this topic in HTML.

You can get this version using a crawler user agent.

(Jeff Wong) #4

This is awesome! I’m pretty sure that being able to expose a static archivable HTML pages for consumption via the wayback machine would be a pretty big win, too.

I remember seeing some discussion about that at some point on meta. It seems to currently archive blank pages of discourse forms. It’d be great to have a forum archive long after a site disappears.

(Jeff Atwood) #5

Not relevant, IMO.

That’s much more of a bug with the wayback machine; it can’t deal at all with JS-only sites. In that case it should be sending in a web spider header and it would get served 1996 era HTML pages it could understand.

(Mark McClure) #6

@Falco I was unaware of the _escaped_fragment_ param - thank you very much. It looks like it might be very easy to incorporate that into my code to reconstruct a static version of the site that matches the Discourse look and feel more closely. I might just work on that a bit next week.

Thanks again!

(Rafael dos Santos Silva) #7

Keep in mind, sending an user agent that matches this regex will make it even easier, you don’t need the fragment:

(Sawood Alam) #8

What do you mean by a web spider header? Is there a generic way of identifying a crawler as a web spider other than identifying based on a known set of user-agents?

(Jeff Atwood) #9

User agent header is the standard and expected way and has been since 1996 or so…

(Sawood Alam) #10

Thanks @codinghorror, I am aware of that. However, user-agent does not advertise any classification (such as a crawler or a web browser) by itself. This means, it is the responsibility of the web application (in this case, Discourse) to identify all web crawlers and react accordingly. I know, ia_archiver was added in the list a while ago, but there are many more web archives other than the Internet Archive.

(Jeff Atwood) #11

Sorry, what is your point? I do not understand what point you are trying to make here?

Statistically speaking, there is not some wild explosion of important web spiders, there is just Google and a handful of barely relevant others. For example here’s a forum that is self hosted…

(Sawood Alam) #12

@codinghorror, let me begin with introducing myself. Web Science and Web Archiving is my field of research. When you used the term web spider header, I thought there is something magical out there that can tell, “hey, I am a web spider, please serve me the non-JS friendly version of your page”, (other than the User-Agent header, which is not a “web spider” header, but a header to identify the user agent that may or may not be a web spider). I was not challenging anything, I was simply curious to know if there was actually a way to convey that.

Talking about the relevance of bots, activity of a bot might not always be considered as the only factor of relevance. Web archives for example, do not crawl each site very frequently, but when they do, their purpose is often to preserve the site for ages to come as a historical record. If they fail to crawl the content of a Discourse forum, just because the crawler software used was not capable to execute the JS and perform various actions and the forum software did not identify them as limited-capability user-agent, then a fair portion of the conversation of our era would be lost as the original sites fade away with the passage of time.

The Internet Archive is not the only web archive in town. There are many other significant web archives. For example, many European countries have national archives (often mandated to archive their national domains), counties like Japan and Canada have their own web archives, and Library of Congress has a web archive. Many of these archives, including the Internet Archive, use Heritrix crawler, but each archive advertises their own User-Agent, not recognized by Discourse.

We also run a web archive aggregator, so we know that various smaller archives play a significant role in archiving our web history. The Internet Archive, being the biggest animal in its field, is also the easiest target for censorship. For example, it is banned in China and Russia, and it was banned for a brief period in India as well. If US government decides one day to shut it down then what we would be left with is the aggregation of those small web archives (who can’t capture Discourse sites properly).

I don’t know a true solution to solve this, but I told all this to explain why I was curious when you used the term web spider header. Some archives are trying to implement more capable crawlers (for example, based on headless browsers) to execute JS and capture the deferred representation, but this process is very slow. One solution would be to identify about a dozen well-known and significant web archives and add their user-agent string in the known bots’ list (and update periodically as more web archives come to life). Another solution would be to ask web archives to include a common term in their user-agent string to identify that they are web archives with limited rendering capability, hence a pre-rendered response would be desired.

(Rafael dos Santos Silva) #13

If you have a good user agent that is used a lot on the wild in archiving tools, please send a PR adding it to our regex.

(Eli the Bearded) #14

I checked. ia_archiver, the one from Internet Archive, is already in there.

(Sawood Alam) #15

I will try to convince various web archives to add a common term in their user-agent along with their specific identifier. Then we can target that common term. Alternatively, I can find out user-agents of all well-known archives, but that would be something we will have to keep updating as new archives come.

This makes me feel that the crawler detection pattern better suits as a configuration option rather than a hard coded string. I imagine it something like a couple of fields in the admin control panel that allow us to list user agents to be blocked and user agents to be treated as static crawlers. The latter can be pre-populated with all well-known user agents while giving admins an option to pick and choose as they feel good. This, for example, would allow an Estonian Discourse forum to add Estonian Archive’s user-agent in the crawlers’ field and be preserved well by their national web archive. Also, one can add some test user-agents in the crawlers’ field and perform local crawling testing, without faking to be one of those well-known search engine bots.

(Eli the Bearded) #16

Almost all web robots would be caught by:

/bot|spider|[Cc]rawler|curl|wget|libwww/

The first one covers (in my estimation) 90% of them, with basically no false positives. The last three are for lazier bots that use common libraries. Then there’s stuff like SiteSnagger, which maybe they don’t want to accommodate.

(Jeff Atwood) #17

The three letters “bot” appearing anywhere in the user agent seems really dangerous and ill advised to me.

(Sawood Alam) #18

A more generic approach would be to flip the crawler detection logic. Rather than finding a match for potential crawlers (using a known list), consider every user-agent as non-capable (static site friendly) by default. However, if the user-agent string contains names of one of well known web browsers or user-agents that are known to be capable executing JS, then serve them the rich content otherwise serve the static one. Because, list of capable user-agents is more deterministic and finite than otherwise. This solution will not only be nicer for web archives, but any other script or crawler for that matter.

As far as Internet Archives current “Save Page Now” feature is concerned, I am hopeful that they will either change the user-agent hijacking or will use corresponding headless browsers for high-fidelity archiving. Their regular Heritrix crawler (along with many other web archives) send unique user-agents.

(Gerhard Schlager) #19

That’s very optimistic and assumes that the user or a proxy server didn’t change the user-agent. That has the potential of worsening the user experience for a lot of power users. I wouldn’t want to face the angry mob.

(Sawood Alam) #20

On the contrary, I think general purpose web proxies would rather not change the user-agent and if they do, they would prefer to change it to something that mimics a web browser, because they know some web applications reject requests coming from unknown user-agents. I have experienced many sites who would return a splash page or reject the connection when fetched using curl or a Python library without setting a custom user-agent that contains something like “Mozilla”.

However, your concern is legitimate, but it can be solved by log analysis of a busy site. For example, remove all the requests that contain the name of one of the well known browsers, crawlers, and other popular user agents, then plot the histogram of the remaining user agents to identify potential proxies. Those can be added as capable user-agents list. I still believe that capable user-agents are more limited in number and deterministic than otherwise.

Alternatively, to deal with such false negatives, we can add a small piece of JS code in the static response that, if executed, would inform the server to serve the rich version. This will trigger only one redirect for the base page and rest of the experience will happen via Ajax/Fetch as usual.