Make Discourse play nice with the Wayback Machine

This is a source of giant confusion, I just hit it. I want this fixed and am happy to have Discourse chuck money at this important problem.

Wayback machine has this thing called: liveweb proxy. GitHub - internetarchive/liveweb: Liveweb proxy of the Wayback Machine project

This little python thing has not been touched since 2013. What it does is attempts to offload “waybacking” to consumers.

If I head to wayback machine and plug in a URL it does not have I get:

So I head off and click that button and get:

This screenshot is a lie, cause I can see the page properly as anonymous if I hit Discourse.

What happens here technically is that they run a proxy in California that intercepts all the traffic from my browser to meta. This proxy uses the same user agent as the one I have, so we think this is coming from Firefox and give it a proper desktop view which is not desirable at all.

There are 3 possible solutions to this problem

  • Get wayback to add specific header to liveweb proxy to detect traffic is coming from it and switch to crawler view. I don’t think it has one cause I looked at the source and can not see it.

  • We teach our ember app/router to understand the liveweb proxy is playing funny games with our web app and have it “allow” for it. This is a nightmare as @eviltrout will attest and not something I want us to do.

  • Convince archive.org to switch user agent to the same user agent they are using when they crawl us.

I think sorting this out is very important, cause each time people submit pages to wayback machine it is getting “rubbish” that it is considering adding to its index.

Do we have any friends at wayback we can talk to? Seems like a trivial fix on their end.

9 Likes