A basic Discourse archival tool

httrack does not work for me. Using:

httrack https://my-forums.org --user-agent "Googlebot"

httrack is quite promising, but long forum thread with multiple pages are incomplete. Once I click on “page 2” it does not work. I.e.

  • file:///home/user/My%20Web%20Sites/my-forums/my-forum.org/t/forum-thread-title/83394658.html looks really good (does not fetch from external resources), but
  • file:///home/user/My%20Web%20Sites/my-forums/my-forum.org/t/forum-thread-title/83394658.html?page=2 is broken.

Any suggestions?

Perhaps httrack can be told somehow to “use print mode”?

Perhaps httrack can be told to “append /print at the end”?

Is there a user agent setting which shows the whole forum thread on a single page? If not, could you please add this feature? You already implemented print mode. Most is already implemented. What’s left is a user agent to which results in providing contents generated for “print mode” to the crawler? Alternatively, if you don’t like the idea of a custom user agent for this purpose, what about a http header or cookie that could be used for this purpose?

ArchiveDiscourse improved/forked by by @kitsandkats is also broken for me.

Could you please consider also implementing /print also for front page / category pages?

Quote myself in https://meta.discourse.org/t/i-dont-like-infinite-scrolling-and-want-to-disable-it/104660/3

(Temporarily) disabling infinite scroll (for some user agents) would make it possible to archive discourse with the htttrack web archive tool.

1 Like