You should ping them directly, I know they hang out on IRC etc
OK I think I know what’s happening:
User starts archiving process in the Wayback Machine
Wayback Machine just loads standard Discourse in your browsers, under their domain, with some injected JS.
We serve the correct HTML/JS/CSS, but since we are in a strange domain:
Ember router doesn’t know how the route and renders the 404 template.
Any ideas @eviltrout ?
That is going to be tricky. Not sure to fix outside of having discourse recognize the wayback machine URLs and strip them?
Tip: also handle URLs where the date is suffixed with “if_” (for in-page content, i.e. subresources i.e. "if"rames).
I’ve tried to poke one of the IA employees I know and see if he knows who to poke at IA/how to poke them.
We’re getting the crawler version which seems to work fine and is a good outcome?
I don’t see the same result here, strangely:
“Oops! That page doesn’t exist or is private.”
I tried to save the page with the “save page now” feature, maybe that’s the trouble?
Do you mean some oddball browser plugin, or a web page where you paste in the URL?
Sort of, going to page two from there does not work. Try it.
But I would assume the same would happen you would add the URL in the save page now page.
I… don’t understand what you mean there, sorry.
This is a source of giant confusion, I just hit it. I want this fixed and am happy to have Discourse chuck money at this important problem.
Wayback machine has this thing called: liveweb proxy. https://github.com/internetarchive/liveweb
This little python thing has not been touched since 2013. What it does is attempts to offload “waybacking” to consumers.
If I head to wayback machine and plug in a URL it does not have I get:
So I head off and click that button and get:
This screenshot is a lie, cause I can see the page properly as anonymous if I hit Discourse.
What happens here technically is that they run a proxy in California that intercepts all the traffic from my browser to meta. This proxy uses the same user agent as the one I have, so we think this is coming from Firefox and give it a proper desktop view which is not desirable at all.
There are 3 possible solutions to this problem
Get wayback to add specific header to liveweb proxy to detect traffic is coming from it and switch to crawler view. I don’t think it has one cause I looked at the source and can not see it.
We teach our ember app/router to understand the liveweb proxy is playing funny games with our web app and have it “allow” for it. This is a nightmare as @eviltrout will attest and not something I want us to do.
Convince archive.org to switch user agent to the same user agent they are using when they crawl us.
I think sorting this out is very important, cause each time people submit pages to wayback machine it is getting “rubbish” that it is considering adding to its index.
Do we have any friends at wayback we can talk to? Seems like a trivial fix on their end.
… actually… looks like we have a header to key off…
@maja can you fix this up. Treat stuff as a “crawler” if the via header is set to
( you get that page by trying to archive
Then test once deployed on https://archive.org/web/ by adding a URL.
I feel this is done enough to close we can look at refining it further down the line.
Replay is currently broken, the JS runs and the Ember Router breaks due to the pathname change.
Thanks to the improved browser detection from @david, there is an extremely ugly but also tempting fix to get new captures to render properly: just patch browser-detect so it detects the replay and yanks out the
The problem is, if we start serving that script, and by some miracle the JS starts being able to run, all the old archived pages are forced into the no-js view.
Now that I write that out, you know, that’s probably not too bad of a price to pay for getting working archive playbacks today. (Draft PR) I have been talked out of actually doing this.
(Was @dan, not me)
Is our existing wayback machine bypass broken?
Is there any particular reason we are not checking for their user agent (
archive.org_bot)? It seems to be a less fragile solution.
Their “liveweb” thing does not send the user agent, I think:
I believe some things changed (see the dates). I think we should be checking both of them.
EDIT: Submitted a PR for this: