一个基本的Discourse存档工具

It seems that it’s pretty tricky to save an entire discourse site to a static version. According to this post by Jeff Atwood, it’s “much harder than you’d think”. It doesn’t appear that this is a priority for the Discourse team, either, which is perfectly understandable.

For my purposes, though, I found that I really needed some way to generate basic, static HTML versions of my Discourse sites. I’ve been using Discourse for a couple of years now as a discussion board when teaching my college math classes so, every few months, I retire one or two sites and start one or two more. Obviously, the discussions on the retiring sites have value so I really needed some way to save them. Ultimately, I figured I’d build my own tool.

The basic idea is simple: Rather than scan the HTML and use the HTTP protocol to crawl the site, I figured I’d use the Discourse API to crawl the site. You can view the result of applying the tool to this Discourse Meta on my webpage.

Before looking at it though, please temper your expectations. I’m a college math professor, not a professional web developer. And, while I’d like it to look pretty nice, I’m mainly interested in simplicity. My guess is that most folks here would consider this to be proof of concept, rather than a serious, working tool. Taking that into account, here are some features/limitations:

  • The code grabs the site logo and places it in a fixed banner at the top. If no site logo is found, it uses the site logo at the top of meta by default.
  • It uses the API to grab the topic list and generates a new page for each topic. You can limit the number of times you respond to more_topics_url.
  • There is a single main page that links to those topics.
  • MathJax is important for my needs so every page loads and configures MathJax.
  • There is no other JavaSciript and no other plugins are considered.
  • There are no user pages or category pages.
  • It’s not very configurable without messing with the code directly.

In spite of all the limitations, it’s sufficient for my needs and I’m rather happy with it. I have no particular plans to expand it, other than incrementally as needed. If anyone is interested, the code (which is Python) is available here:

Perhaps, someone will push it further or just be inspired by the idea?

42 个赞

We’re definitely interested in this, because sometimes you want to turn off all the fancy hosting and databases and render out a set of static HTML pages for permanent long term archiving with zero security risk.

With the meta topic, others can follow along and edit / contribute as needed.

24 个赞

You can also use our basic HTML version for archiving: this topic in HTML.

You can get this version using a crawler user agent.

Maybe this + recursive wget or similar can help you.

16 个赞

Yes, those links are gone, but it’s all summarized on this new page. Also, the output of the code as applied to this DiscourseMeta is now here. I even put it up on GitHub so maybe someone will get interested.

I’d like to edit the original post, but I seem to be past the edit window.

Incidentally, I do think that httrack works much better than I originally thought but I still strongly prefer my version for two main reasons:

  • My code explicitly supports MathJax, which is essential for my work.
    (I’ll probably need to update my code to work with the new MathPlugin sometime)
  • I’ve got much more control over what get’s downloaded and how it’s displayed. For example, I don’t like the way that httrack output points to user links, even if not downloadedl
9 个赞

I’m hosting a forum that is currently, in its third iteration, running Discourse. Our last two forums were (I think, phpbb2 or something like that). I have resolved to archive them using Discourse, so that:

  1. I scan the phpbb2 database into Discourse (there’s a migration tool)
  2. I create a static HTML archive using Discourse.
  3. I put up the static HTML archive into public use (preferably in the same place where our dynamic forum running Discourse is).

According to the first message

There are no user pages or category pages

Could it be somehow advanced so that creating category views would be also possible?

Also, any help on how to use the Jupyter notebook thing? First time I hear of this…

@Silvanus Can you indicate a live discourse site that you want to archive? I’d be glad to try it out.

Also, have you tried httrack? I think that a command as simple as httrack yoursiteurl might work quite well.

I’m still in the phase 1 (phpbb2 > phpbb3 > discourse) of my archival, so no site yet. After I’ve managed the phpbb conversion, I’ll get back to this. It feels very, very hard. Been trying to install phpbb3 for a while now, but I get some weird problems all the time. :frowning:

I’ll have to try that httrack, thanks.

@Silvanus Well, I noticed that you point to the forum at https://uskojarukous.fi/ on your Profile page; I went ahead and created a couple of archives of that. You can (temporarily) take a look at the results here:

Here are a few comments:

  • I definitely like my version better; no surprise there because I designed it the way I want it to look.
  • The front page of the httrack version doesn’t look so great simply because that’s what the escaped fragment version looks like.
  • I think it might make sense to start httrack at a subpage to generate something like this.
  • It wouldn’t be too hard to make my archival tool grab the categories; I might do that for the next iteration.
  • My code adds MathJax to every page because my forums are mathematical. I should probably try to detect if MathJax is necessary. I’m guessing your forum doesn’t require it.

The httrack command

The httrack version was generated with a command that looks like so:

httrack https://uskojarukous.fi -https://uskojarukous.fi/users* -*.rss -O uskojarukous_arxiv -x -o -M10000000 --user-agent "Googlebot"
  • The -https://uskojarukous.fi/users* -*.rss prevents httrack from downloading files matching those patterns.
  • The -x -o combo replaces both external links and errors with a local file indicating the error. So, for example, we don’t link to user profiles on the original that weren’t downloaded locally.
  • The -M10000000 restricts the total amount downloaded to 10MB. There appears to be some post processing and downloading of supplemental files that makes the total larger than this anyway.
  • The --user-agent "Googlebot" should not be necessary if the forum is powered by a recent version of Discourse.

The archival tool code

For the most part, the archival tool should run with minimal changes. I run it within a Jupyter notebook but the exact same code could be run from a Python script with the appropriate libraries installed. Of course, you need to tell it what forum you want to download. The few lines of my first input look like so:

base_url = 'https://uskojarukous.fi/'
path = os.path.join(os.getcwd(), 'uskojarukous')
archive_blurb = "A partial archive of uskojarukous.fi as of " + \
  date.today().strftime("%A %B %d, %Y") + '.'

Later, in input 6, I define max_more_topics = 2. Essentially, that defines a bound on k in this code here:

'/latest.json?page=k'

But again, there should be some changes made to the code to get it to work for non-mathematical forums.

5 个赞

Very cool, thank you for all the clarifications. Just a quick note, it seems that your tool can’t handle sub-categories (which is why many of the messages seem to be without a category).

3 个赞

@Silvanus Yes, I think you’re absolutely right about the sub-category thing. Thanks - I had wondered about that.

@mcmcclur: as you already realized, I’m the admin of said forum, which is the third of our forums. When we did technological jumps, we didn’t migrate, but started from scratch, and the older forum was archived. The last two forums are in SMF format - but I finally managed to start converting them into Discourse format! :slight_smile:

So, our forum had a public area and a closed area. I’m thinking that the closed area (a few categories) should be archived, but closed off via a password gate. I noticed that the static paths are something like /t/TITLE/MESSAGEID/. This, if course, lends itself for thread-by-thread gating, but is slightly cumbersome - but, heh, I guess that’s what you get when archiving huge loads of stuff from a dynamic forum to a static archive… :slight_smile:

这里有一些小贴士,供其他寻找 httrack 技巧的人参考(它非常符合我的需求)。

  • 命令行标志完整列表:HTTrack Website Copier - Offline Browser
  • 使用 -s0 标志可忽略 robots.txt(如果您拥有无法被爬虫访问的账户)
  • 如果您的网站需要登录,您可以使用类似 cookies.txt 的 Chrome 扩展程序下载已登录状态下的 .txt 格式 Cookie 文件,并将其放在运行 httrack 的目录中。

我通过 cron 使用 httrack 为我们的 Discourse 站点创建 离线存档。然而,通过 httrack 登录的用户会被标记为每个主题的“浏览”,导致每个主题的浏览量数据异常膨胀(cron 每小时运行一次)。

有没有办法在统计/浏览量统计中排除特定用户,使其不被记录在整个网站的统计数据中?

6 个赞

Good point, where would this be intercepted @sam?

1 个赞

We have this method for tracking page views:

We have additional methods for tracking user visits which would be even harder to override.

We only store one page view per day per user, but I get that it can add up.

Hacking this out so certain users are not tracked would either require a plugin or some sort of daily query that nukes all the views by the user and remembers to also reduce views count from the topics table.

4 个赞

Hi all – just jumping in here to say that @mcmcclur’s code was exactly what I was looking for! So thank you very much for sharing :slight_smile:

I made a few small modifications (mainly additional code that makes sure to grab all posts in a topic, not just the first twenty) and the code is here: GitHub - kitsandkats/ArchiveDiscourse: Code for archiving a Discourse site into static HTML., forked from @mcmcclur’s original repo and stored as a python file instead of a Jupyter notebook.

I’m very happy with how it turned out. Thanks again!

11 个赞

Hi just read through this whole thread and wanted to check if this tool works if the the discourse fourm is behind a login and password how would I edit the code so it will allow me to archival the site ?

2 个赞

As it is currently written, the code is not designed to access any material that requires a login. It should be pretty easy to set that up, though. The code interacts with the Discourse site via the Python Requests library which does offer authentication. It’s feasible that adding an auth=('user', 'pass') to the code at the appropriate points is all that’s required. I’m not currently running a Discourse site so I can’t test that at the moment.

7 个赞

httrack 对我来说无法正常工作。我使用的是:

httrack https://my-forums.org --user-agent "Googlebot"

httrack 非常有潜力,但包含多页的长论坛帖子无法完整抓取。一旦点击“第 2 页”,就无法正常工作。例如:

  • file:///home/user/My%20Web%20Sites/my-forums/my-forum.org/t/forum-thread-title/83394658.html 看起来非常完美(不获取外部资源),但
  • file:///home/user/My%20Web%20Sites/my-forums/my-forum.org/t/forum-thread-title/83394658.html?page=2 则无法使用。

有什么建议吗?

是否可以让 httrack 以某种方式“使用 打印模式”?

是否可以让 httrack “在末尾添加 /print”?

是否存在某种用户代理设置,能让整个论坛帖子显示在单个页面上?如果没有,能否添加此功能?你们已经实现了 打印模式。大部分功能已经实现。剩下的就是为爬虫提供一个能获取“打印模式”生成内容的用户代理?或者,如果你们不喜欢为此目的使用自定义用户代理,那么是否可以考虑使用 HTTP 头或 Cookie 来实现这一功能?


@kitsandkats 改进/分叉的 ArchiveDiscourse 对我来说也 无法正常工作


能否考虑也为 首页/分类页面 实现 /print 功能?


引用我在 https://meta.discourse.org/t/i-dont-like-infinite-scrolling-and-want-to-disable-it/104660/3 中的发言:

(临时)禁用无限滚动(针对某些用户代理)将使得使用 httrack 网络存档工具存档 Discourse 成为可能。

1 个赞

如果需要身份验证,Python requests 会自动使用 .netrc 文件(但需要收到 401 HTTP 响应)。

4 个赞

使用 wget 配合身份验证,我取得了很好的效果。详情如下:

https://meta.discourse.org/t/archive-an-old-forum-in-place-to-start-a-new-discourse-forum/13433/14

3 个赞