httrack
does not work for me. Using:
httrack https://my-forums.org --user-agent "Googlebot"
httrack
is quite promising, but long forum thread with multiple pages are incomplete. Once I click on “page 2” it does not work. I.e.
file:///home/user/My%20Web%20Sites/my-forums/my-forum.org/t/forum-thread-title/83394658.html
looks really good (does not fetch from external resources), butfile:///home/user/My%20Web%20Sites/my-forums/my-forum.org/t/forum-thread-title/83394658.html
?page=2 is broken.
Any suggestions?
Perhaps httrack
can be told somehow to “use print mode”?
- example standard forum discussion view
- example print forum discussion view same URL just /print was appended at the end
Perhaps httrack
can be told to “append /print at the end”?
Is there a user agent setting which shows the whole forum thread on a single page? If not, could you please add this feature? You already implemented print mode. Most is already implemented. What’s left is a user agent to which results in providing contents generated for “print mode” to the crawler? Alternatively, if you don’t like the idea of a custom user agent for this purpose, what about a http header or cookie that could be used for this purpose?
ArchiveDiscourse improved/forked by by @kitsandkats is also broken for me.
Could you please consider also implementing /print also for front page / category pages?
Quote myself in https://meta.discourse.org/t/i-dont-like-infinite-scrolling-and-want-to-disable-it/104660/3
(Temporarily) disabling infinite scroll (for some user agents) would make it possible to archive discourse with the htttrack web archive tool.