# A basic Discourse archival tool

That’s very optimistic and assumes that the user or a proxy server didn’t change the user-agent. That has the potential of worsening the user experience for a lot of power users. I wouldn’t want to face the angry mob.

3 Likes

On the contrary, I think general purpose web proxies would rather not change the user-agent and if they do, they would prefer to change it to something that mimics a web browser, because they know some web applications reject requests coming from unknown user-agents. I have experienced many sites who would return a splash page or reject the connection when fetched using curl or a Python library without setting a custom user-agent that contains something like “Mozilla”.

However, your concern is legitimate, but it can be solved by log analysis of a busy site. For example, remove all the requests that contain the name of one of the well known browsers, crawlers, and other popular user agents, then plot the histogram of the remaining user agents to identify potential proxies. Those can be added as capable user-agents list. I still believe that capable user-agents are more limited in number and deterministic than otherwise.

Alternatively, to deal with such false negatives, we can add a small piece of JS code in the static response that, if executed, would inform the server to serve the rich version. This will trigger only one redirect for the base page and rest of the experience will happen via Ajax/Fetch as usual.

2 Likes

I submit that you should examine your user agent logs.

Absolutely not; you design for the world you want to see, not the hellish, post-apocalyptic wasteland you desperately hope it does not become.

3 Likes

I totally respect the attitude and philosophy of any given software. However, in this case, there was a problem in front of me and I was simply proposing a potential practical solution that would be maintainable while being more accommodating for tools with different purposes.

For example, say, “the world we want to see” is one where all web clients are very capable and smart, that behave as if a full blown web browser is being operated by a real person. So, we crated a very rich experience which is optimized for that. Say, we did not put any thoughts in our design for “the hellish, post-apocalyptic wasteland you desperately hope it does not become” and ignored them completely, just because we don’t want the world to be like that. What we would end up with, a shiny, functional, and fantastic piece of software who’s great content wont be discovered by various search engines unless search engines start simulating real users at crawling time and perform every possible interaction (clicking/hovering over all possible targets, panning, zooming, endless scrolling and whatnot) to load all possible content and build the interactivity tree. Then we step back and compromise with the “post-apocalyptic wasteland” because we realize that our forward thinking and world changing idea might not go forward without the help of the existing wasteland.

I personally believe in being more inclusive as long as it is practical. Every software has its own purpose. For instance, I would perhaps not like “curl” to start supporting interactive JS-rich HTML rendering, or “wget” to build frontier based on post-JS execution representation. It is on us as software designers to decide if we want these non-browser based players to interact with our web services in any meaningful way or not. Beyond these two examples, there are many more use cases and purpose-built tools that wont migrate to our shiny new world because either they can’t or that would kill their purpose. Sometimes we decide to be inclusive or not so inclusive based on factors like the effort it would take, the value it would bring, or sometimes solely based on the attitude of the software in making.

Here is a good relevant read that I only partially agree with:

2 Likes

@Falco (and others who’ve been discussing the user-agent)

Now that I know how to grab plain HTML, it’s not as hard as I thought to use a web crawler to generate a workable archive of a Discourse source. Using httrack, you can do something like so:

httrack yoursite -M1000000 -E60 --user-agent "Googlebot"

That command will spend up to 10 minutes archiving up to 1 Megabyte of your site. I applied it to this meta and generated this result. There are some quirks but the overall result looks pretty good.

A couple of questions:

• Note the user-agent param. That, of course, is specifically set so that Discourse will respond with vanilla HTML. Is this acceptable behavior, though? Masquerading as Google doesn’t seem quite right.
• Perhaps it would be a good idea to add HTTrack to the list of detectable crawlers? I suggest this, in part, because httrack is the recommended archival tool presented here.
8 Likes

Yes, HTTrack is the first tool that comes when you search for a crawler, so please send a PR adding their default user agent to the list.

5 Likes

For reference, on Ubuntu 16.04, the default user-agent for httrack is:

User-Agent: Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)

@Falco Pull request has been created - I appreciate the invitation to do so.

4 Likes

sadly, the code is gone and the first post is basically useless now!
and all single live comparisons are dead. and nobody was inspired by the idea yet. :’(

quite ironic to find so many broken links from a “how i’ve archived my online data” post.

in any case, perhaps @mcmcclur didn’t care about this because httrack is doing a good enough job with the “basic HTML escaped fragment” trick, which is already the default behavior after his PR, so simply installing it and running:

httrack yoursite

was enough for my case.

it still doesn’t properly render youtube oneboxes or spoiler plugin. user profiles are overly reduced and category pages are bugging out in my case, as they never get to next page or don’t properly group posts that are there (no idea why), but it’s as simple as it gets and apparently, all of the most relevant content (the posts) are there.

1 Like

If there are ways to make the crawler views better definitely propose them! Since this improves it for Googlebot (and httrack) as well.

i don’t know if enabling spoiler plugin or “fixing” youtube onebox would make it better to bots, but it’d sure make a better printable/escaped version, as these miss pretty relevant content from the original:

spoiler escaped

here’s a quick dirty and hackish way fix for them, if you’re in a rush and is lazy like me:

;(function( discoUrsa, undefined ) { // jquery-ish namespace
fixSpoiler()
fixOnebox()
}
var style = document.createElement('style')
style.type = 'text/css'
style.innerHTML = 
.spoiler.spoiled {background-color: rgba(0, 0, 0, 0); color: rgba(0, 0, 0, 0); text-shadow: gray 0px 0px 10px; user-select: none; cursor: pointer;}
.spoiled.half-spoiled {text-shadow: gray 0px 0px 5px;}
.spoiler {color: gray; cursor: pointer;}
}
function fixSpoiler () {
for (s of document.querySelectorAll('.spoiler')) {
s.onclick = function(){ this.classList.toggle('spoiled') }
s.onmouseenter = function () { this.classList.add('half-spoiled') }
s.onmouseleave = function () { this.classList.remove('half-spoiled') }
}
}
function fixOnebox () {
}
for (o of document.querySelectorAll('.lazyYT')) {
o.innerHTML = <iframe width="${ o.getAttribute('data-width') }" height="${ o.getAttribute('data-height') }" src="https://www.youtube.com/embed/${ o.getAttribute('data-youtube-id') }?${ o.getAttribute('data-parameters') }" frameborder="0" allowfullscreen></iframe>
}
}
}( window.discoUrsa = window.discoUrsa || {} ))
1 Like

Yes, those links are gone, but it’s all summarized on this new page. Also, the output of the code as applied to this DiscourseMeta is now here. I even put it up on GitHub so maybe someone will get interested.

I’d like to edit the original post, but I seem to be past the edit window.

Incidentally, I do think that httrack works much better than I originally thought but I still strongly prefer my version for two main reasons:

• My code explicitly supports MathJax, which is essential for my work.
(I’ll probably need to update my code to work with the new MathPlugin sometime)
• I’ve got much more control over what get’s downloaded and how it’s displayed. For example, I don’t like the way that httrack output points to user links, even if not downloadedl
9 Likes

No problem! I made the first post wiki!

3 Likes

I’m hosting a forum that is currently, in its third iteration, running Discourse. Our last two forums were (I think, phpbb2 or something like that). I have resolved to archive them using Discourse, so that:

1. I scan the phpbb2 database into Discourse (there’s a migration tool)
2. I create a static HTML archive using Discourse.
3. I put up the static HTML archive into public use (preferably in the same place where our dynamic forum running Discourse is).

According to the first message

There are no user pages or category pages

Could it be somehow advanced so that creating category views would be also possible?

Also, any help on how to use the Jupyter notebook thing? First time I hear of this…

@Silvanus Can you indicate a live discourse site that you want to archive? I’d be glad to try it out.

Also, have you tried httrack? I think that a command as simple as httrack yoursiteurl might work quite well.

I’m still in the phase 1 (phpbb2 > phpbb3 > discourse) of my archival, so no site yet. After I’ve managed the phpbb conversion, I’ll get back to this. It feels very, very hard. Been trying to install phpbb3 for a while now, but I get some weird problems all the time.

I’ll have to try that httrack, thanks.

@Silvanus Well, I noticed that you point to the forum at https://uskojarukous.fi/ on your Profile page; I went ahead and created a couple of archives of that. You can (temporarily) take a look at the results here:

• I definitely like my version better; no surprise there because I designed it the way I want it to look.
• The front page of the httrack version doesn’t look so great simply because that’s what the escaped fragment version looks like.
• I think it might make sense to start httrack at a subpage to generate something like this.
• It wouldn’t be too hard to make my archival tool grab the categories; I might do that for the next iteration.
• My code adds MathJax to every page because my forums are mathematical. I should probably try to detect if MathJax is necessary. I’m guessing your forum doesn’t require it.

### The httrack command

The httrack version was generated with a command that looks like so:

• The -x -o combo replaces both external links and errors with a local file indicating the error. So, for example, we don’t link to user profiles on the original that weren’t downloaded locally.
• The -M10000000 restricts the total amount downloaded to 10MB. There appears to be some post processing and downloading of supplemental files that makes the total larger than this anyway.

### The archival tool code

For the most part, the archival tool should run with minimal changes. I run it within a Jupyter notebook but the exact same code could be run from a Python script with the appropriate libraries installed. Of course, you need to tell it what forum you want to download. The few lines of my first input look like so:

base_url = 'https://uskojarukous.fi/'
path = os.path.join(os.getcwd(), 'uskojarukous')
archive_blurb = "A partial archive of uskojarukous.fi as of " + \
date.today().strftime("%A %B %d, %Y") + '.'

Later, in input 6, I define max_more_topics = 2. Essentially, that defines a bound on k in this code here:

'/latest.json?page=k'

But again, there should be some changes made to the code to get it to work for non-mathematical forums.

4 Likes

Very cool, thank you for all the clarifications. Just a quick note, it seems that your tool can’t handle sub-categories (which is why many of the messages seem to be without a category).

3 Likes

@Silvanus Yes, I think you’re absolutely right about the sub-category thing. Thanks - I had wondered about that.