Make Discourse play nice with the Wayback Machine

You should ping them directly, I know they hang out on IRC etc

2 Likes

OK I think I know what’s happening:

  1. User starts archiving process in the Wayback Machine

  2. Wayback Machine just loads standard Discourse in your browsers, under their domain, with some injected JS.

  3. We serve the correct HTML/JS/CSS, but since we are in a strange domain:
    https://web.archive.org/web/20180313203411/https://community.letsencrypt.org/t/acme-v2-and-wildcard-certificate-support-is-live/55579?u=falcotesting
    Ember router doesn’t know how the route and renders the 404 template.

Any ideas @eviltrout ?

5 Likes

That is going to be tricky. Not sure to fix outside of having discourse recognize the wayback machine URLs and strip them?

5 Likes

Tip: also handle URLs where the date is suffixed with “if_” (for in-page content, i.e. subresources i.e. "if"rames).

I’ve tried to poke one of the IA employees I know and see if he knows who to poke at IA/how to poke them.

6 Likes

This looks OK to me now, or am I missing something cc @falco @ibnesayeed @anarcat

http://web.archive.org/web/20190202031942/https://meta.discourse.org/t/make-discourse-play-nice-with-the-wayback-machine/34579

We’re getting the crawler version which seems to work fine and is a good outcome?

11 Likes

I don’t see the same result here, strangely:

“Oops! That page doesn’t exist or is private.”

I tried to save the page with the “save page now” feature, maybe that’s the trouble?

2 Likes

Do you mean some oddball browser plugin, or a web page where you paste in the URL?

Sort of, going to page two from there does not work. Try it.

I have a Javascript bookmarklet for the internet archive.

javascript:void(window.open('https://web.archive.org/save/'+location.href));

But I would assume the same would happen you would add the URL in the save page now page.

I… don’t understand what you mean there, sorry. :slight_smile:

2 Likes

This is a source of giant confusion, I just hit it. I want this fixed and am happy to have Discourse chuck money at this important problem.

Wayback machine has this thing called: liveweb proxy. https://github.com/internetarchive/liveweb

This little python thing has not been touched since 2013. What it does is attempts to offload “waybacking” to consumers.

If I head to wayback machine and plug in a URL it does not have I get:

So I head off and click that button and get:

This screenshot is a lie, cause I can see the page properly as anonymous if I hit Discourse.

What happens here technically is that they run a proxy in California that intercepts all the traffic from my browser to meta. This proxy uses the same user agent as the one I have, so we think this is coming from Firefox and give it a proper desktop view which is not desirable at all.

There are 3 possible solutions to this problem

  • Get wayback to add specific header to liveweb proxy to detect traffic is coming from it and switch to crawler view. I don’t think it has one cause I looked at the source and can not see it.

  • We teach our ember app/router to understand the liveweb proxy is playing funny games with our web app and have it “allow” for it. This is a nightmare as @eviltrout will attest and not something I want us to do.

  • Convince archive.org to switch user agent to the same user agent they are using when they crawl us.

I think sorting this out is very important, cause each time people submit pages to wayback machine it is getting “rubbish” that it is considering adding to its index.

Do we have any friends at wayback we can talk to? Seems like a trivial fix on their end.

9 Likes

… actually… looks like we have a header to key off…

@maja can you fix this up. Treat stuff as a “crawler” if the via header is set to web.archive.org.

( you get that page by trying to archive https://www.whatismybrowser.com/detect/what-http-headers-is-my-browser-sending?randomstring )

Then test once deployed on https://archive.org/web/ by adding a URL.

13 Likes

Thanks to @maja we now detect wayback machine and thanks to @awesomerobot and @saurabhp we style it much nicer, so we have:

I feel this is done enough to close :blush: we can look at refining it further down the line.

22 Likes

Replay is currently broken, the JS runs and the Ember Router breaks due to the pathname change.

Thanks to the improved browser detection from @david, there is an extremely ugly but also tempting fix to get new captures to render properly: just patch browser-detect so it detects the replay and yanks out the noscript version.

https://github.com/riking/discourse/commit/6a83c83bd3acab37f7a3e24f6aa4a14081bb2249

The problem is, if we start serving that script, and by some miracle the JS starts being able to run, all the old archived pages are forced into the no-js view.

Now that I write that out, you know, that’s probably not too bad of a price to pay for getting working archive playbacks today. (Draft PR) I have been talked out of actually doing this.

3 Likes

(Was @dan, not me)

Is our existing wayback machine bypass broken?
https://github.com/discourse/discourse/blob/cb8f8de422b8b270dc57f3614d3d9d718bfc40ef/lib/crawler_detection.rb#L18-L18

3 Likes

Is there any particular reason we are not checking for their user agent (archive.org_bot)? It seems to be a less fragile solution.

https://archive.org/details%2Farchive.org_bot%2F

1 Like

Their “liveweb” thing does not send the user agent, I think:

4 Likes

I believe some things changed (see the dates). I think we should be checking both of them.

EDIT: Submitted a PR for this:

https://github.com/discourse/discourse/pull/9777

10 Likes