Make Discourse play nice with the Wayback Machine

dandv · October 16, 2015, 2:39am

I’ve noticed that when archiving Discourse topics using archive.org, the CSS is mangled. Here is an example:

codinghorror · October 16, 2015, 3:23am

Cannot be resolved, as wayback does not understand and cannot archive pure JavaScript sites at all. Either you serve it plain crawler HTML, or it will show nothing.

You may want to raise this with them, if they can make their crawler archive pure JavaScript sites, but it requires an extremely advanced crawler that runs headless browsers. Pure text curl retrieval is no longer sufficient, they have to retrieve the JS and execute it.

riking · October 16, 2015, 3:50am

orrrrr we could build a topic view that looks like this (from my AMPproject.org experiment branch)

This layout is both extremely easy to crawl, and works with both wide and narrow screens.

However, I seem to have forgotten the topic’s category.

codinghorror · October 16, 2015, 4:35am

Very strongly opposed to adding another renderer, one that has to be kept in sync with the primary JavaScript renderer, plus the crawler 1996 HTML renderer we already have.

riking · October 16, 2015, 6:27am

The way I have it implemented, this replaces the crawling view outright on topics.

FSanches · November 25, 2016, 6:58pm

May I suggest we focus back on the original issue here?

There’s people willing to use the Internet Archive Wayback Machine (myself included) in order to keep copies of Discourse threads for posterity and right now that seems to be impossible to achieve with the current implementation (of both Discourse and Wayback).

What are the things that can be done to the Discourse codebase that can improve this? And what are the things that can be done on the Archive’s crawling system that could help as well?

codinghorror · November 25, 2016, 7:00pm

I believe the CSS should look a lot better with @falco’s latest changes in that area. Can you quickly check using the archive.org tools @falco?

Falco · November 25, 2016, 7:05pm

I think all we need is detecting archive.org UA as a crawler, let me take a look.

FSanches · November 25, 2016, 7:06pm

I actually experienced a server error while attempting to save a Discourse page on the wayback machine today:

https://web.archive.org/save/_embed//t/recuperando-dados-de-disquetes-antigos/35

This was the original URL:

And I was trying to save it this way:

Falco · November 25, 2016, 7:58pm

OK, this is kinda strange.

They are using the User Agent of the user who asked to save the page to download the HTML, and after that they try to show the page with a bunch of injected JS.

This topic when I click on the save button using Google Bot User Agent: Make Discourse play nice with the Wayback Machine - feature - Discourse Meta

Sent an email to archive.org, let’s see.

ibnesayeed · December 2, 2016, 10:32pm

That’s not enough actually, because there are many other web archives in town and new web archives come to life every now and then.

ibnesayeed · December 2, 2016, 10:56pm

There are many efforts of headless browser based archiving including http://archive.is/ which is an on-demand single page archiving system. It renders the page using PhantomJS and then archives the rendered DOM plus necessary assets. However, doing it on a massive scale (not just for on-demand pages) takes a lot of time, because PhantomJS or any other renderer is orders of magnitude slower than traditional vanilla crawlers such as Heritrix that is used my Internet Archive and many other web archives.

Here is a relevant research work on the topic Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations. Below is a blog post summarizing the research work and related resources.

Falco · December 2, 2016, 11:38pm

Actually that didn’t work because they aren’t respecting user agent.

Discourse is already crawler friendly, you just need to ask the crawler version.

grahamperrin · May 14, 2017, 11:50am

Attempts by me to save this topic result in:

I can not describe the procedure, or the end result, as playing nicely.

codinghorror · May 14, 2017, 2:28pm

We should follow up and see if we can improve this somehow @falco, perhaps in the next few weeks or early after 1.9 begins beta.

Falco · May 15, 2017, 9:31pm

The problem, as described above, is that Wayback machine is not being nice, because they hijack their user-agent string, making it impossible for use to serve a non-js crawler version for then.

I tried their e-mail but got 0 responses.

ibnesayeed · May 15, 2017, 9:52pm

@Falco, the Wayback machine is the archival replay system. The crawler used at Internet Archive is Heritrix. That said, would you mind telling me the exact user-agent string you are seeing from them? I might be able to approach some people at Internet Archive on personal channels and see what’s going on.

Falco · May 15, 2017, 9:56pm

Wow, this is amazing!

It’s like I said here:

For example, if you are using Firefox 52 and click on the save, they will use your user-agent (Firefox 52 user agent) when asking the page.

If they add something that can distinguish this requests, we can serve the correct page.

ibnesayeed · May 15, 2017, 10:12pm

I will try to talk to some friends at the Internet Archive later this week. Next month I will be meeting with many web archive folks (including Internet Archive) at an International Internet Preservation Consortium conference. I will raise this issue there.

grahamperrin · May 16, 2017, 5:35am

Thanks, and to clarify: I did not suggest that Discourse did not play nicely. Please see earlier versions of my post, keyword:

/print

– I meant to say that the print workaround is not nice. I do not recommend it.

Topic		Replies	Views
A basic Discourse archival tool Dev	24	14012	April 30, 2025
Discourse is feeding js to archive.org again Bug	2	860	November 13, 2018
Discourse not loading on legacy browsers Bug	56	5013	May 16, 2022
Unfortunately I had to pull the plug Community	98	11173	December 24, 2022
Disable or bypass feature detect for Googlebot (while serving JS app to crawlers) Support unsupported-install	8	3200	June 14, 2022

Make Discourse play nice with the Wayback Machine

Related topics