Make Discourse play nice with the Wayback Machine

The way I have it implemented, this replaces the crawling view outright on topics.

13 Likes

May I suggest we focus back on the original issue here?

There’s people willing to use the Internet Archive Wayback Machine (myself included) in order to keep copies of Discourse threads for posterity and right now that seems to be impossible to achieve with the current implementation (of both Discourse and Wayback).

What are the things that can be done to the Discourse codebase that can improve this? And what are the things that can be done on the Archive’s crawling system that could help as well?

2 Likes

I believe the CSS should look a lot better with @falco’s latest changes in that area. Can you quickly check using the archive.org tools @falco?

3 Likes

I think all we need is detecting archive.org UA as a crawler, let me take a look.

2 Likes

I actually experienced a server error while attempting to save a Discourse page on the wayback machine today:

https://web.archive.org/save/_embed//t/recuperando-dados-de-disquetes-antigos/35

This was the original URL:

And I was trying to save it this way:

1 Like

OK, this is kinda strange.

They are using the User Agent of the user who asked to save the page to download the HTML, and after that they try to show the page with a bunch of injected JS.

This topic when I click on the save button using Google Bot User Agent: Make Discourse play nice with the Wayback Machine - feature - Discourse Meta

Sent an email to archive.org, let’s see.

4 Likes

That’s not enough actually, because there are many other web archives in town and new web archives come to life every now and then.

1 Like

There are many efforts of headless browser based archiving including http://archive.is/ which is an on-demand single page archiving system. It renders the page using PhantomJS and then archives the rendered DOM plus necessary assets. However, doing it on a massive scale (not just for on-demand pages) takes a lot of time, because PhantomJS or any other renderer is orders of magnitude slower than traditional vanilla crawlers such as Heritrix that is used my Internet Archive and many other web archives.

Here is a relevant research work on the topic Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations. Below is a blog post summarizing the research work and related resources.

3 Likes

Actually that didn’t work because they aren’t respecting user agent.

Discourse is already crawler friendly, you just need to ask the crawler version.

Attempts by me to save this topic result in:

I can not describe the procedure, or the end result, as playing nicely.

We should follow up and see if we can improve this somehow @falco, perhaps in the next few weeks or early after 1.9 begins beta.

2 Likes

The problem, as described above, is that Wayback machine is not being nice, because they hijack their user-agent string, making it impossible for use to serve a non-js crawler version for then.

I tried their e-mail but got 0 responses.

3 Likes

@Falco, the Wayback machine is the archival replay system. The crawler used at Internet Archive is Heritrix. That said, would you mind telling me the exact user-agent string you are seeing from them? I might be able to approach some people at Internet Archive on personal channels and see what’s going on.

6 Likes

Wow, this is amazing! :tada:

It’s like I said here:

For example, if you are using Firefox 52 and click on the save, they will use your user-agent (Firefox 52 user agent) when asking the page.

If they add something that can distinguish this requests, we can serve the correct page.

1 Like

I will try to talk to some friends at the Internet Archive later this week. Next month I will be meeting with many web archive folks (including Internet Archive) at an International Internet Preservation Consortium conference. I will raise this issue there.

7 Likes

Thanks, and to clarify: I did not suggest that Discourse did not play nicely. Please see earlier versions of my post, keyword:

  • /print

– I meant to say that the print workaround is not nice. I do not recommend it.

1 Like

Any news from the IA people? it would be great to see this issue fixed…

I’ve tried to archive this historic post from the Let’s Encrypt people and archive.org made a page that says “Oops! That page doesn’t exist or is private”, so clearly there’s something more to be done here. The print version does render correctly, but it’s not visible from the user and I wouldn’t have found it without looking here, so I’m not sure it’s a good fix.

In general, this connects with complaints I’ve heard from would-be Discourse users about the heavy use of Javascript in the user interface. Obviously, this was discussed many times here, but having something show up when no Javascript loads at all would be a huge benefit to many users, not just crawlers. In Firefox with Javascript disabled, this site just says “Cannot load app” and basically tells me to go away. Many security-worried folks browse the web with Javascript turned off. As the article @ibnesayeed shared says, this is not specific to Discourse and browsing without Javascript promises a world of pain and emptiness, but it would be nice if Discourse would degrade more gracefully than “Go away, you’re too old”. :slight_smile:

Thanks for all the work! I can already appreciate all the work that’s been done to tailor to all those corner cases, for what it’s worth…

Set your user agent to google and try again. I think you’re missing something here.

(Also the IA user agent is explicitly in our list of crawler user agents, as I recall)

weird. i might be missing something indeed: if I disable javascript/XHR (in uMatrix) and hit “reload”, I get the “Cannot load app” message. buuut if i force-reload i get the crawler version. so I’m not sure what’s going on here…

also, setting my useragent to Googlebot works: I see the crawler version with javascript enabled or not. It would still be nice to show that version when Javascript is disabled, for example…

but that’s kind of drifting out of the original topic here… maybe i should make a “i’m a paranoid non-javascript old fart and i still want discourse to work” discussion? :wink:

It does. In Chrome when I disable JS and refresh the Let’s Encrypt page I get a fully readable page.

There are several reports of bad browsers extensions that don’t work on service worker enabled sites: ServiceWorker thinks I'm offline when I'm not

Last I checked, they send down the user agent of the user who asked the archive, making our user-agent approach useless :sadpanda:

4 Likes