Make Discourse play nice with the Wayback Machine

Any news from the IA people? it would be great to see this issue fixed…

I’ve tried to archive this historic post from the Let’s Encrypt people and archive.org made a page that says “Oops! That page doesn’t exist or is private”, so clearly there’s something more to be done here. The print version does render correctly, but it’s not visible from the user and I wouldn’t have found it without looking here, so I’m not sure it’s a good fix.

In general, this connects with complaints I’ve heard from would-be Discourse users about the heavy use of Javascript in the user interface. Obviously, this was discussed many times here, but having something show up when no Javascript loads at all would be a huge benefit to many users, not just crawlers. In Firefox with Javascript disabled, this site just says “Cannot load app” and basically tells me to go away. Many security-worried folks browse the web with Javascript turned off. As the article @ibnesayeed shared says, this is not specific to Discourse and browsing without Javascript promises a world of pain and emptiness, but it would be nice if Discourse would degrade more gracefully than “Go away, you’re too old”. :slight_smile:

Thanks for all the work! I can already appreciate all the work that’s been done to tailor to all those corner cases, for what it’s worth…

Set your user agent to google and try again. I think you’re missing something here.

(Also the IA user agent is explicitly in our list of crawler user agents, as I recall)

weird. i might be missing something indeed: if I disable javascript/XHR (in uMatrix) and hit “reload”, I get the “Cannot load app” message. buuut if i force-reload i get the crawler version. so I’m not sure what’s going on here…

also, setting my useragent to Googlebot works: I see the crawler version with javascript enabled or not. It would still be nice to show that version when Javascript is disabled, for example…

but that’s kind of drifting out of the original topic here… maybe i should make a “i’m a paranoid non-javascript old fart and i still want discourse to work” discussion? :wink:

It does. In Chrome when I disable JS and refresh the Let’s Encrypt page I get a fully readable page.

There are several reports of bad browsers extensions that don’t work on service worker enabled sites: ServiceWorker thinks I'm offline when I'm not - #19 by supermathie

Last I checked, they send down the user agent of the user who asked the archive, making our user-agent approach useless :sadpanda:

4 Likes

Uh! Well that’s interesting, you’re right! In Firefox, when I disable JS through the javascript.enabled config knob, i do get the proper page here. No idea why uMatrix fails to do the right thing here… I guess that’s a bug on uMatrix’s side?

That would be a problem indeed. Isn’t there anything IA sends that could be picked up by Discourse? There must be something…

3 Likes

I would love to get some input from the wayback team on this.

Looks like their current approach injects this into <body>:

<!-- BEGIN WAYBACK TOOLBAR INSERT -->
<script type="text/javascript" src="/static/js/timestamp.js?v=1520907403.0" charset="utf-8"></script>
<script type="text/javascript" src="/static/js/graph-calc.js?v=1520907403.0" charset="utf-8"></script>
<script type="text/javascript" src="/static/js/auto-complete.js?v=1520907403.0" charset="utf-8"></script>
<script type="text/javascript" src="/static/js/toolbar.js?v=1520907403.0" charset="utf-8"></script>

<style type="text/css">
body {
  margin-top:0 !important;
  padding-top:0 !important;
  /*min-width:800px !important;*/
}
.wb-autocomplete-suggestions {
    text-align: left; cursor: default; border: 1px solid #ccc; border-top: 0; background: #fff; box-shadow: -1px 1px 3px rgba(0,0,0,.1);
    position: absolute; display: none; z-index: 2147483647; max-height: 254px; overflow: hidden; overflow-y: auto; box-sizing: border-box;
}
.wb-autocomplete-suggestion { position: relative; padding: 0 .6em; line-height: 23px; white-space: nowrap; overflow: hidden; text-overflow: ellipsis; font-size: 1.02em; color: #333; }
.wb-autocomplete-suggestion b { font-weight: bold; }
.wb-autocomplete-suggestion.selected { background: #f0f0f0; }
</style>
<div id="wm-ipp" lang="en" style="display:none;direction:ltr;">
<div style="position:fixed;left:0;top:0;right:0;">
<div id="wm-ipp-inside">
  <div style="position:relative;">
    <div id="wm-logo" style="float:left;width:130px;padding-top:10px;">
      <a href="/web/" title="Wayback Machine home page"><img src="/static/images/toolbar/wayback-toolbar-logo.png" alt="Wayback Machine" width="110" height="39" border="0" /></a>
    </div>
    <div class="r" style="float:right;">
      <div id="wm-btns" style="text-align:right;height:25px;">
                  <div id="wm-save-snapshot-success">success</div>
          <div id="wm-save-snapshot-fail">fail</div>
          <a href="#"
             onclick="__wm.saveSnapshot('https://meta.discourse.org/t/make-discourse-play-nice-with-the-wayback-machine/34579?u=falcotest', '20180313183529')"
             title="Share via My Web Archive"
             id="wm-save-snapshot-open"
          >
            <span class="iconochive-web"></span>
          </a>
          <a href="https://archive.org/account/login.php"
             title="Sign In"
             id="wm-sign-in"
          >
            <span class="iconochive-person"></span>
          </a>
          <span id="wm-save-snapshot-in-progress" class="iconochive-web"></span>
        	<a href="http://faq.web.archive.org/" title="Get some help using the Wayback Machine" style="top:-6px;"><span class="iconochive-question" style="color:rgb(87,186,244);font-size:160%;"></span></a>
	<a id="wm-tb-close" href="#close" onclick="__wm.h(event);return false;" style="top:-2px;" title="Close the toolbar"><span class="iconochive-remove-circle" style="color:#888888;font-size:240%;"></span></a>
      </div>
      <div id="wm-share" style="text-align:right;">
	<a href="#" onclick="window.open('https://www.facebook.com/sharer/sharer.php?u=https://web.archive.org/web/20180313183529/https://meta.discourse.org/t/make-discourse-play-nice-with-the-wayback-machine/34579?u=falcotest', '', 'height=400,width=600'); return false;" title="Share on Facebook" style="margin-right:5px;" target="_blank"><span class="iconochive-facebook" style="color:#3b5998;font-size:160%;"></span></a>
	<a href="#" onclick="window.open('https://twitter.com/intent/tweet?text=https://web.archive.org/web/20180313183529/https://meta.discourse.org/t/make-discourse-play-nice-with-the-wayback-machine/34579?u=falcotest&amp;via=internetarchive', '', 'height=400,width=600'); return false;" title="Share on Twitter" style="margin-right:5px;" target="_blank"><span class="iconochive-twitter" style="color:#1dcaff;font-size:160%;"></span></a>
      </div>
    </div>
    <table class="c" style="">
      <tbody>
	<tr>
	  <td class="u" colspan="2">
	    <form target="_top" method="get" action="/web/submit" name="wmtb" id="wmtb"><input type="text" name="url" id="wmtbURL" value="https://meta.discourse.org/t/make-discourse-play-nice-with-the-wayback-machine/34579?u=falcotest" onfocus="this.focus();this.select();" /><input type="hidden" name="type" value="replay" /><input type="hidden" name="date" value="20180313183529" /><input type="submit" value="Go" /></form>
	  </td>
	  <td class="n" rowspan="2" style="width:110px;">
	    <table>
	      <tbody>
		<!-- NEXT/PREV MONTH NAV AND MONTH INDICATOR -->
		<tr class="m">
		  <td class="b" nowrap="nowrap">Feb</td>
		  <td class="c" id="displayMonthEl" title="You are here: 18:35:29 Mar 13, 2018">MAR</td>
		  <td class="f" nowrap="nowrap">Apr</td>
		</tr>
		<!-- NEXT/PREV CAPTURE NAV AND DAY OF MONTH INDICATOR -->
		<tr class="d">
		  <td class="b" nowrap="nowrap"><img src="/static/images/toolbar/wm_tb_prv_off.png" alt="Previous capture" width="14" height="16" border="0" /></td>
		  <td class="c" id="displayDayEl" style="width:34px;font-size:24px;white-space:nowrap;" title="You are here: 18:35:29 Mar 13, 2018">13</td>
		  <td class="f" nowrap="nowrap"><img src="/static/images/toolbar/wm_tb_nxt_off.png" alt="Next capture" width="14" height="16" border="0" /></td>
		</tr>
		<!-- NEXT/PREV YEAR NAV AND YEAR INDICATOR -->
		<tr class="y">
		  <td class="b" nowrap="nowrap">2017</td>
		  <td class="c" id="displayYearEl" title="You are here: 18:35:29 Mar 13, 2018">2018</td>
		  <td class="f" nowrap="nowrap">2019</td>
		</tr>
	      </tbody>
	    </table>
	  </td>
	</tr>
	<tr>
	  <td class="s">
	    	    <div id="wm-nav-captures">
	      	      <a class="t" href="/web/20180313183529*/https://meta.discourse.org/t/make-discourse-play-nice-with-the-wayback-machine/34579?u=falcotest" title="See a list of every capture for this URL">1 capture</a>
	      <div class="r" title="Timespan for captures of this URL">13 Mar 2018</div>
	      </div>
	  </td>
	  <td class="k">
	    <a href="" id="wm-graph-anchor">
	      <div id="wm-ipp-sparkline" title="Explore captures for this URL" style="position: relative">
		<canvas id="wm-sparkline-canvas" width="575" height="27" border="0"></canvas>
	      </div>
	    </a>
	  </td>
	</tr>
      </tbody>
    </table>
    <div style="position:absolute;bottom:0;right:2px;text-align:right;">
      <a id="wm-expand" class="wm-btn wm-closed" href="#expand" onclick="__wm.ex(event);return false;"><span id="wm-expand-icon" class="iconochive-down-solid"></span> <span style="font-size:80%">About this capture</span></a>
    </div>
  </div>
    <div id="wm-capinfo" style="border-top:1px solid #777;display:none; overflow: hidden">
            <div style="background-color:#666;color:#fff;font-weight:bold;text-align:center">COLLECTED BY</div>
    <div style="padding:3px;position:relative" id="wm-collected-by-content">
      <div style="display:inline-block;vertical-align:top;width:49%;">
			<span class="c-logo" style="background-image:url(https://archive.org/services/img/liveweb)"></span>
		<div>Collection: <a style="color:#33f;" href="https://archive.org/details/liveweb" target="_new"><span class="wm-title">Live Web Proxy Crawls</span></a></div>
		<div style="max-height:75px;overflow:hidden;position:relative;">
	  <div style="position:absolute;top:0;left:0;width:100%;height:75px;background:linear-gradient(to bottom,rgba(255,255,255,0) 0%,rgba(255,255,255,0) 90%,rgba(255,255,255,255) 100%);"></div>
	  Content crawled via the <a href="http://archive.org/web/web.php">Wayback Machine</a> Live Proxy mostly by the Save Page Now feature on web.archive.org.
<br /><br />
Liveweb proxy is a component of Internet Archive’s wayback machine project. The liveweb proxy captures the content of a web page in real time, archives it into a ARC or WARC file and returns the ARC/WARC record back to the wayback machine to process. The recorded ARC/WARC file becomes part of the wayback machine in due course of time.
<br /><br />
	</div>
	      </div>
    </div>
    <div style="background-color:#666;color:#fff;font-weight:bold;text-align:center" title="Timestamps for the elements of this page">TIMESTAMPS</div>
    <div>
      <div id="wm-capresources" style="margin:0 5px 5px 5px;max-height:250px;overflow-y:scroll !important"></div>
      <div id="wm-capresources-loading" style="text-align:left;margin:0 20px 5px 5px;display:none"><img src="/static/images/loading.gif" alt="loading" /></div>
    </div>
  </div></div></div></div><script type="text/javascript">
__wm.bt(575,27,25,2,"web","https://meta.discourse.org/t/make-discourse-play-nice-with-the-wayback-machine/34579?u=falcotest","2018-03-13",1996);
</script>
<!-- END WAYBACK TOOLBAR INSERT -->

Also, their proxy is failing to load our thir party js script at https://web.archive.org/web/20180313183530/https://d11a6trkgmumsb.cloudfront.net/assets/plugin-third-party-a2f584b233a72491ff3314b6b98a7d5eb77b3cf099111c1266dcf03974044e9e.js.

3 Likes

You should ping them directly, I know they hang out on IRC etc

2 Likes

OK I think I know what’s happening:

  1. User starts archiving process in the Wayback Machine

  2. Wayback Machine just loads standard Discourse in your browsers, under their domain, with some injected JS.

  3. We serve the correct HTML/JS/CSS, but since we are in a strange domain:
    https://web.archive.org/web/20180313203411/https://community.letsencrypt.org/t/acme-v2-and-wildcard-certificate-support-is-live/55579?u=falcotesting
    Ember router doesn’t know how the route and renders the 404 template.

Any ideas @eviltrout ?

5 Likes

That is going to be tricky. Not sure to fix outside of having discourse recognize the wayback machine URLs and strip them?

5 Likes

Tip: also handle URLs where the date is suffixed with “if_” (for in-page content, i.e. subresources i.e. "if"rames).

I’ve tried to poke one of the IA employees I know and see if he knows who to poke at IA/how to poke them.

6 Likes

This looks OK to me now, or am I missing something cc @falco @ibnesayeed @anarcat

http://web.archive.org/web/20190202031942/https://meta.discourse.org/t/make-discourse-play-nice-with-the-wayback-machine/34579

We’re getting the crawler version which seems to work fine and is a good outcome?

11 Likes

I don’t see the same result here, strangely:

“Oops! That page doesn’t exist or is private.”

I tried to save the page with the “save page now” feature, maybe that’s the trouble?

2 Likes

Do you mean some oddball browser plugin, or a web page where you paste in the URL?

Sort of, going to page two from there does not work. Try it.

I have a Javascript bookmarklet for the internet archive.

javascript:void(window.open('https://web.archive.org/save/'+location.href));

But I would assume the same would happen you would add the URL in the save page now page.

I… don’t understand what you mean there, sorry. :slight_smile:

2 Likes

This is a source of giant confusion, I just hit it. I want this fixed and am happy to have Discourse chuck money at this important problem.

Wayback machine has this thing called: liveweb proxy. GitHub - internetarchive/liveweb: Liveweb proxy of the Wayback Machine project

This little python thing has not been touched since 2013. What it does is attempts to offload “waybacking” to consumers.

If I head to wayback machine and plug in a URL it does not have I get:

So I head off and click that button and get:

This screenshot is a lie, cause I can see the page properly as anonymous if I hit Discourse.

What happens here technically is that they run a proxy in California that intercepts all the traffic from my browser to meta. This proxy uses the same user agent as the one I have, so we think this is coming from Firefox and give it a proper desktop view which is not desirable at all.

There are 3 possible solutions to this problem

  • Get wayback to add specific header to liveweb proxy to detect traffic is coming from it and switch to crawler view. I don’t think it has one cause I looked at the source and can not see it.

  • We teach our ember app/router to understand the liveweb proxy is playing funny games with our web app and have it “allow” for it. This is a nightmare as @eviltrout will attest and not something I want us to do.

  • Convince archive.org to switch user agent to the same user agent they are using when they crawl us.

I think sorting this out is very important, cause each time people submit pages to wayback machine it is getting “rubbish” that it is considering adding to its index.

Do we have any friends at wayback we can talk to? Seems like a trivial fix on their end.

9 Likes

… actually… looks like we have a header to key off…

@maja can you fix this up. Treat stuff as a “crawler” if the via header is set to web.archive.org.

( you get that page by trying to archive https://www.whatismybrowser.com/detect/what-http-headers-is-my-browser-sending?randomstring )

Then test once deployed on https://archive.org/web/ by adding a URL.

13 Likes

Thanks to @maja we now detect wayback machine and thanks to @awesomerobot and @saurabhp we style it much nicer, so we have:

I feel this is done enough to close :blush: we can look at refining it further down the line.

22 Likes