Make Discourse play nice with the Wayback Machine

I’ve tried to poke one of the IA employees I know and see if he knows who to poke at IA/how to poke them.

6 Likes

This looks OK to me now, or am I missing something cc @falco @ibnesayeed @anarcat

http://web.archive.org/web/20190202031942/https://meta.discourse.org/t/make-discourse-play-nice-with-the-wayback-machine/34579

We’re getting the crawler version which seems to work fine and is a good outcome?

11 Likes

I don’t see the same result here, strangely:

“Oops! That page doesn’t exist or is private.”

I tried to save the page with the “save page now” feature, maybe that’s the trouble?

2 Likes

Do you mean some oddball browser plugin, or a web page where you paste in the URL?

Sort of, going to page two from there does not work. Try it.

I have a Javascript bookmarklet for the internet archive.

javascript:void(window.open('https://web.archive.org/save/'+location.href));

But I would assume the same would happen you would add the URL in the save page now page.

I… don’t understand what you mean there, sorry. :slight_smile:

2 Likes

This is a source of giant confusion, I just hit it. I want this fixed and am happy to have Discourse chuck money at this important problem.

Wayback machine has this thing called: liveweb proxy. GitHub - internetarchive/liveweb: Liveweb proxy of the Wayback Machine project

This little python thing has not been touched since 2013. What it does is attempts to offload “waybacking” to consumers.

If I head to wayback machine and plug in a URL it does not have I get:

So I head off and click that button and get:

This screenshot is a lie, cause I can see the page properly as anonymous if I hit Discourse.

What happens here technically is that they run a proxy in California that intercepts all the traffic from my browser to meta. This proxy uses the same user agent as the one I have, so we think this is coming from Firefox and give it a proper desktop view which is not desirable at all.

There are 3 possible solutions to this problem

  • Get wayback to add specific header to liveweb proxy to detect traffic is coming from it and switch to crawler view. I don’t think it has one cause I looked at the source and can not see it.

  • We teach our ember app/router to understand the liveweb proxy is playing funny games with our web app and have it “allow” for it. This is a nightmare as @eviltrout will attest and not something I want us to do.

  • Convince archive.org to switch user agent to the same user agent they are using when they crawl us.

I think sorting this out is very important, cause each time people submit pages to wayback machine it is getting “rubbish” that it is considering adding to its index.

Do we have any friends at wayback we can talk to? Seems like a trivial fix on their end.

9 Likes

… actually… looks like we have a header to key off…

@maja can you fix this up. Treat stuff as a “crawler” if the via header is set to web.archive.org.

( you get that page by trying to archive https://www.whatismybrowser.com/detect/what-http-headers-is-my-browser-sending?randomstring )

Then test once deployed on https://archive.org/web/ by adding a URL.

13 Likes

Thanks to @maja we now detect wayback machine and thanks to @awesomerobot and @saurabhp we style it much nicer, so we have:

I feel this is done enough to close :blush: we can look at refining it further down the line.

22 Likes

Replay is currently broken, the JS runs and the Ember Router breaks due to the pathname change.

Thanks to the improved browser detection from @david, there is an extremely ugly but also tempting fix to get new captures to render properly: just patch browser-detect so it detects the replay and yanks out the noscript version.

The problem is, if we start serving that script, and by some miracle the JS starts being able to run, all the old archived pages are forced into the no-js view.

Now that I write that out, you know, that’s probably not too bad of a price to pay for getting working archive playbacks today. (Draft PR) I have been talked out of actually doing this.

3 Likes

(Was @udan11, not me)

Is our existing wayback machine bypass broken?

3 Likes

Is there any particular reason we are not checking for their user agent (archive.org_bot)? It seems to be a less fragile solution.

https://archive.org/details%2Farchive.org_bot%2F

1 Like

Their “liveweb” thing does not send the user agent, I think:

4 Likes

I believe some things changed (see the dates). I think we should be checking both of them.

EDIT: Submitted a PR for this:

9 Likes

Would be lovely to see this working again. I am promoting Discourse as a central hub for Solid Project, especially for core team members and experts working on standardization of Solid, but this issue is an important reason for them to be unwilling to do so.

1 Like

The pr was merged it should be working

4 Likes

Just confirmed by doing a “save outlinks” on a /top/yearly… working fully right now.

6 Likes