يبدو أن حفظ موقع نقاش كامل إلى نسخة ثابتة أمر معقد للغاية. وفقًا لـ هذا المنشور الذي كتبه جيف أتوود، فإن الأمر «أصعب مما تتوقع». ولا يبدو أن هذا يمثل أولوية لفريق Discourse أيضًا، وهو أمر مفهوم تمامًا.
لكن لأغراضي الخاصة، اكتشفت أنني بحاجة ماسة إلى طريقة لتوليد نسخ HTML ثابتة وبسيطة من مواقع Discourse الخاصة بي. لقد استخدمت Discourse منذ بضع سنوات كمنصة نقاش عند تدريس فصول الرياضيات الجامعية، لذا، كل بضعة أشهر، أقوم بإحالة موقع أو موقعين جديدين وأبدأ موقعين آخرين. ومن الواضح أن المناقشات في المواقع المُحالّة لها قيمة، لذا كنت بحاجة ماسة إلى طريقة لحفظها. في النهاية، قررت أن أبني أداتي الخاصة.
الفكرة الأساسية بسيطة: استخدام واجهة برمجة تطبيقات Discourse لاستكشاف الموقع، وجلب النسخة المُعالجة من كل منشور، ثم تحويلها إلى HTML. تركز الأداة بشكل كبير على احتياجاتي الخاصة كأستاذ رياضيات جامعي يستخدم منتديات Discourse الصغيرة لدعم فصول الرياضيات الخاصة بي. وبالتالي، يجب تنسيق المحتوى الرياضي، مثل f(x)=e^{-x^2}، تلقائيًا باستخدام MathJax V4، وتُترجم كتل الأكواد المحاطة بعلامات sage إلى خلايا Sage نشطة.
We’re definitely interested in this, because sometimes you want to turn off all the fancy hosting and databases and render out a set of static HTML pages for permanent long term archiving with zero security risk.
With the meta topic, others can follow along and edit / contribute as needed.
Yes, those links are gone, but it’s all summarized on this new page. Also, the output of the code as applied to this DiscourseMeta is now here. I even put it up on GitHub so maybe someone will get interested.
I’d like to edit the original post, but I seem to be past the edit window.
Incidentally, I do think that httrack works much better than I originally thought but I still strongly prefer my version for two main reasons:
My code explicitly supports MathJax, which is essential for my work.
(I’ll probably need to update my code to work with the new MathPlugin sometime)
I’ve got much more control over what get’s downloaded and how it’s displayed. For example, I don’t like the way that httrack output points to user links, even if not downloadedl
I’m hosting a forum that is currently, in its third iteration, running Discourse. Our last two forums were (I think, phpbb2 or something like that). I have resolved to archive them using Discourse, so that:
I scan the phpbb2 database into Discourse (there’s a migration tool)
I create a static HTML archive using Discourse.
I put up the static HTML archive into public use (preferably in the same place where our dynamic forum running Discourse is).
According to the first message
There are no user pages or category pages
Could it be somehow advanced so that creating category views would be also possible?
Also, any help on how to use the Jupyter notebook thing? First time I hear of this…
I’m still in the phase 1 (phpbb2 > phpbb3 > discourse) of my archival, so no site yet. After I’ve managed the phpbb conversion, I’ll get back to this. It feels very, very hard. Been trying to install phpbb3 for a while now, but I get some weird problems all the time.
@Silvanus Well, I noticed that you point to the forum at https://uskojarukous.fi/ on your Profile page; I went ahead and created a couple of archives of that. You can (temporarily) take a look at the results here:
I definitely like my version better; no surprise there because I designed it the way I want it to look.
The front page of the httrack version doesn’t look so great simply because that’s what the escaped fragment version looks like.
I think it might make sense to start httrack at a subpage to generate something like this.
It wouldn’t be too hard to make my archival tool grab the categories; I might do that for the next iteration.
My code adds MathJax to every page because my forums are mathematical. I should probably try to detect if MathJax is necessary. I’m guessing your forum doesn’t require it.
The httrack command
The httrack version was generated with a command that looks like so:
The -https://uskojarukous.fi/users* -*.rss prevents httrack from downloading files matching those patterns.
The -x -o combo replaces both external links and errors with a local file indicating the error. So, for example, we don’t link to user profiles on the original that weren’t downloaded locally.
The -M10000000 restricts the total amount downloaded to 10MB. There appears to be some post processing and downloading of supplemental files that makes the total larger than this anyway.
The --user-agent "Googlebot" should not be necessary if the forum is powered by a recent version of Discourse.
The archival tool code
For the most part, the archival tool should run with minimal changes. I run it within a Jupyter notebook but the exact same code could be run from a Python script with the appropriate libraries installed. Of course, you need to tell it what forum you want to download. The few lines of my first input look like so:
base_url = 'https://uskojarukous.fi/'
path = os.path.join(os.getcwd(), 'uskojarukous')
archive_blurb = "A partial archive of uskojarukous.fi as of " + \
date.today().strftime("%A %B %d, %Y") + '.'
Later, in input 6, I define max_more_topics = 2. Essentially, that defines a bound on k in this code here:
'/latest.json?page=k'
But again, there should be some changes made to the code to get it to work for non-mathematical forums.
Very cool, thank you for all the clarifications. Just a quick note, it seems that your tool can’t handle sub-categories (which is why many of the messages seem to be without a category).
@mcmcclur: as you already realized, I’m the admin of said forum, which is the third of our forums. When we did technological jumps, we didn’t migrate, but started from scratch, and the older forum was archived. The last two forums are in SMF format - but I finally managed to start converting them into Discourse format!
So, our forum had a public area and a closed area. I’m thinking that the closed area (a few categories) should be archived, but closed off via a password gate. I noticed that the static paths are something like /t/TITLE/MESSAGEID/. This, if course, lends itself for thread-by-thread gating, but is slightly cumbersome - but, heh, I guess that’s what you get when archiving huge loads of stuff from a dynamic forum to a static archive…
Using the -s0 flag ignores the robots.txt (if you have a non-spider-able account)
If your site is behind a login, you can download a .txt file of the cookie (once logged in) using a chrome extension like cookies.txt and place that in the directory you’re running httrack from.
I’m using httrack via cron to create an offline archive of our Discourse site. However, the user that is logging in under httrack gets marked as a “view” for each topic, giving super-inflated numbers of views for each topic (the cron runs every hour).
Is there a way to exclude a certain user from being recorded in the statistics / view stats for the site as a whole?
We have additional methods for tracking user visits which would be even harder to override.
We only store one page view per day per user, but I get that it can add up.
Hacking this out so certain users are not tracked would either require a plugin or some sort of daily query that nukes all the views by the user and remembers to also reduce views count from the topics table.
مرحبًا، لقد قرأت هذا الموضوع بالكامل وأردت التحقق مما إذا كانت هذه الأداة تعمل إذا كان منتدى Discourse خلف تسجيل الدخول وكلمة المرور. كيف يمكنني تعديل الكود للسماح لي بأرشفة الموقع؟
كما هو مكتوب حاليًا، لا يُصمم الكود للوصول إلى أي مادة تتطلب تسجيل دخول. ومع ذلك، يجب أن يكون من السهل نسبيًا إعداد ذلك. يتفاعل الكود مع موقع Discourse عبر مكتبة Python Requests التي توفر المصادقة. ومن الممكن أن يكون إضافة auth=('user', 'pass') إلى الكود في النقاط المناسبة هو كل ما يلزم. أنا لا أدير حاليًا موقع Discourse، لذا لا يمكنني اختبار ذلك في الوقت الراهن.
ربما يمكن إخبار httrack بـ “إضافة /print في النهاية”؟
هل توجد إعدادات لوكيل المستخدم تُظهر خيط المنتدى بأكمله في صفحة واحدة؟ إذا لم يكن الأمر كذلك، هل يمكنك إضافة هذه الميزة؟ لقد قمت بالفعل بتطبيق وضع الطباعة. معظم الأشياء مُطبّقة بالفعل. ما تبقى هو وكيل مستخدم يؤدي إلى توفير المحتوى المُولّد لـ “وضع الطباعة” لبرنامج الزحف؟ بدلاً من ذلك، إذا لم يعجبك فكرة وكيل مستخدم مخصص لهذا الغرض، فما رأيك في رأس HTTP أو ملف تعريف ارتباط يمكن استخدامه لهذا الغرض؟