Make Discourse play nice with the Wayback Machine

There are many efforts of headless browser based archiving including http://archive.is/ which is an on-demand single page archiving system. It renders the page using PhantomJS and then archives the rendered DOM plus necessary assets. However, doing it on a massive scale (not just for on-demand pages) takes a lot of time, because PhantomJS or any other renderer is orders of magnitude slower than traditional vanilla crawlers such as Heritrix that is used my Internet Archive and many other web archives.

Here is a relevant research work on the topic Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations. Below is a blog post summarizing the research work and related resources.

3 Likes