Unable to archive Discourse with the Internet Archive "Save Page Now" button


(Evert Meulie) #1

Continuing the discussion from Unable to archive Discourse pages with the Internet Archive…?:

Sorry I haven’t been able to reply earlier, but this problem is not resolved yet. I just tested with the same URL as before, and end up on a 404-page…

Continuing the discussion from Unable to archive Discourse pages with the Internet Archive…?:


#2

Tried with http://archive.is / http://archive.today and it works just fine: https://archive.today/oiHqT

But yup, can confirm it’s broken on archive.org.


(Jeff Atwood) #3

Hmm, not sure why. Here’s what our robots detection regex looks like:

def self.crawler?(user_agent)        !/Googlebot|Mediapartners|AdsBot|curl|Twitterbot|facebookexternalhit|bingbot|Baiduspider|ia_archiver/.match(user_agent).nil?
end

According to this page, the correct Internet Archive user agent is

User-agent: ia_archiver

However when I index something like whatsmyuseragent.com with the “Save Page Now” button on Internet Archive: Wayback Machine

… I get back

User-agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 (via Wayback Save Page)

It looks like this Save Page Now button is somehow driving your personal browser to issue the save request which results in oddities.

So … uh, I guess? I’ll just add Wayback Save Page to the user agent detector to make that a crawler, too.


(Jeff Atwood) #4

Ok there ya go, for what it’s worth:

This isn’t the “real” Internet Archive archiver, it’s somehow harnessing your local browser to do a loopback request which bypasses the correct User-Agent we expect from the Internet Archive…


(Jeff Atwood) #5