Archive an old forum "in place" to start a new Discourse forum

When possible, we recommend archiving old forums rather than importing them into Discourse. Sometimes this isn’t possible, for various reasons, but getting a fresh start and moving into a new community platform without carrying all your old baggage across is often appealing.

Maybe your community needs a reboot.

But how do you:

  • Keep the valuable information at old forums around without keeping that ancient forum software running, too, with all its future security vulnerabilities?
  • Avoid losing any Google juice for old existing links to good, useful community posts on your domain?

Converting your Forum to Static HTML and JavaScript

We recommend converting your old forum to static HTML and JavaScript, then serving it as plain-vanilla HTML pages from any web server.

What we recommend is HTTrack.

http://www.httrack.com/

HTTrack is a free (GPL, libre/free software) and easy-to-use offline browser utility.

It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack arranges the original site’s relative link-structure. Simply open a page of the “mirrored” website in your browser, and you can browse the site from link to link, as if you were viewing it online. HTTrack can also update an existing mirrored site, and resume interrupted downloads. HTTrack is fully configurable, and has an integrated help system.

It’s available for Windows and Linux.

For OSX there is SiteSucker.

‎SiteSucker on the Mac App Store

SiteSucker is a Macintosh application that automatically downloads Web sites from the Internet. It does this by asynchronously copying the site’s Web pages, images, backgrounds, movies, and other files to your local hard drive, duplicating the site’s directory structure. Just enter a URL (Uniform Resource Locator), press return, and SiteSucker can download an entire Web site.

SiteSucker can be used to make local copies of Web sites. By default, SiteSucker “localizes” the files it downloads, allowing you to browse a site offline, but it can also download sites without modification.

Depending on the URL structures generated, you might need a lookup table to convert links from the original format to the new plain HTML format, e.g.

http://boards.straightdope.com/sdmb/showthread.php?t=717214

might convert to

http://straightdope.com/forumarchives/sdmb/717214

This URL lookup translation could be performed at the webserver layer.


Another promising method has popped up here:

Using WGET

@techapj and I have been working hard on this for a customer and we’ve learned a lot about archiving old forums. It’s … much harder than you’d think.

As far as wget goes, it can be an OK option if you use the new (as of 2012) regular expression exclusion patterns option, which conveniently apply to the entire URL. To do this you’ll want to compile the latest wget with PCRE regex support, instead of the obsolete POSIX default. Here’s how on Ubuntu Server 14.04:

apt-get install build-essentials
apt-get install libssl-dev
apt-get install pkg-config
apt-get install libpcre3 libpcre3-dev

wget {latest tar.gz at http://ftp.gnu.org/gnu/wget/}
tar -xvzf {latest-version-file}
cd {latest-version-folder}
./configure -with-ssl=openssl
make
make install

Once you have it built, the general syntax is mirror -m with a generous dose of --reject-regex to prevent horrible catastrophic mirror size growth

wget -m -l1 --reject-regex '/archive|(sendmessage|showpost|search|report|profile|private|newreply|misc|login|subscription|calendar|usercp|printthread)\.php|\?(p|do|mode|find)=|&(mode|goto|daysprune|order)=' --regex-type pcre  http://oldforums.example.com/forumdisplay.php?f=129

wget -m -l1 --reject-regex '/archive|(sendmessage|showpost|search|report|profile|private|newreply|misc|login|subscription|calendar|usercp|printthread)\.php|\?(p|do|mode|find)=|&(mode|goto|daysprune|order)=' --regex-type pcre http://oldforums.example.com/showthread.php?t=405494

You’ll need to do a fair bit of trial and error with files and exclusions, so work with wget on your forum, limiting yourself to recursion depth level 1, in this order to minimize problems:

  1. very short topic URL (1 post, 2 posts)
  2. long topic URL (100 posts, multi-page)
  3. a very small subforum
  4. a larger subforum
  5. entire forum

Archiving a vBulletin forum? BEWARE as vBulletin in particular will add a meaningless s= (session) parameter to many URLs when it thinks the client doesn’t support cookies. This will cause a massive explosion of spidered URLs with no recourse using wget alone. However. If you have a login/pass and use the save-cookie and load-cookie wget methods to mirror as a logged in vBulletin user, as documented by Archive Team, you can thankfully avoid this annoying behavior.

To get an idea of exclusion patterns, enter the folder you downloaded the content to and list all the downloaded URLs using

find -print

then block the unwanted URLs via regex patterns. While doing this by hand, check the very robust list of “forum” exclusions Archive Team has built for inspiration:

{
    "name": "forums",
    "patterns": [
        "/cron\\.php\\?",
        "/external\\.php\\?type=rss",
        "/login\\.php\\?",
        "/newreply\\.php\\?",
        "/private\\.php\\?",
        "/privmsg\\.php\\?",
        "/register\\.php\\?",
        "/sendmessage\\.php\\?",
        "/subscription\\.php\\?",
        "/posting\\.php\\?",
        "/viewtopic\\.php\\?.+&view=(next|previous)",
        "/viewtopic\\.php\\?.+&hilit=",
        "/feed\\.php\\?",
        "/index\\.php\\?option=com_mailto",
        "&view=login&return=",
        "&format=opensearch",
        "/misc\\.php\\?do=whoposted",
        "/newthread\\.php\\?",
        "/post_thanks\\.php\\?",
        "/blog_post\\.php\\?do=newblog",
        "/forumdisplay\\.php.*[\\?&]do=markread",
        "/userpoll/vote\\.php\\?",
        "/showthread\\.php.*[\\?&]goto=(next(old|new)est|newpost)",
        "/editpost\\.php\\?",
        "/\\?view=getlastpost$",
        "/index\\.php\\?sharelink=",
        "/ucp\\.php\\?mode=delete_cookies"
    ],
    "type": "ignore_patterns"
}

You will eventually find that wget has one rather catastrophic problem: complete and utter ignorance of querystrings. Ignoring querystrings, normalizing querystrings… none of that exists in wget.

This can easily make it unusable for archiving medium to large forums, or even small forums that use a lot of querystrings for navigation. Hard to spider when you lack a fundamental understanding of which querystrings matter, and which don’t.

Instead, check out

And

http://archivebot.readthedocs.org/en/latest/

Both of which leverage the excellent work of the Archive Team. Note that, if the forum in question is public, archivebot (which is active on the Internet Archive IRC channel) may be able to archive the whole site for you, so you don’t have to do anything!

Using WPULL

Install wpull

sudo apt-get install --no-install-recommends git build-essential python3-dev python3-pip
pip3 install wpull

Crawl and save subforum locally

wpull --no-robots --user-agent "Mozilla/5.0" --page-requisites --strip-    session-id --recursive --level 1 http://forum.example.com/forumdisplay.php?f=123

Change filenames

In order to serve php files as static HTML, we need to change ? in file names to _

cd forum.example.com
rename -v 's/\?/_/g' *

Copy saved files to new directory

sudo mkdir -p /var/www/example.com/public_html
cp -a ~/forum.example.com /var/www/example.com/public_html

Install Nginx and set up Virtual Host

Install Nginx
Set up Nginx Virtual Host

Update Nginx virtual host

Open Nginx virtual host file for example.com that you just set up (using above guides):

sudo nano /etc/nginx/sites-available/example.com

Now to serve php files as static HTML, we need to update this block:

location ~ \.php$ {
		try_files "${uri}_${args}" $uri 404.html;
}

Restart Nginx server

service nginx restart

That’s it, now just navigate to the domain you used to setup virtual host and you will see your site served as static HTML!

WARC Downloads Using ArchiveBot

If you do manage to get a full WARC downloaded using ArchiveBot, here’s how to host it:

It’s specifically designed to host WARC with full fidelity:

pywb is a python implementation of web archival replay tools, sometimes also known as ‘Wayback Machine’.

pywb allows high-quality replay (browsing) of archived web data stored in standardized ARC and WARC, and it can also serve as a customizable rewriting proxy to live web content.

The replay system is designed to accurately replay complex dynamic sites, including video and audio content and sites with complex JavaScript.

Additionally, pywb includes an extensive index query api for querying information about archived content.

The software can run as a traditional web application or an HTTP or HTTPS proxy server, and has been tested on Linux, OS X and Windows platforms.

Creating a Searchable WARC

  1. Download a copy of your Discourse site using wget to a WARC archive. Replace the capitalized strings with values specific to your site.
 wget --mirror \
      --warc-file=MY_FORUM \
      --warc-cdx \
      --page-requisites \
      --convert-links \
      --adjust-extension \
      --compression=auto \
      --reject-regex "/search" \
      --no-if-modified-since \
      --no-check-certificate \
      --execute robots=off \
      --random-wait \
      --wait=1 \
      --user-agent="Googlebot/2.1 (+http://www.google.com/bot.html)" \
      --no-cookies \
      --header "Cookie: _t=LOGIN_COOKIE" \
      URL
  1. Using py-wacz , convert the WARC to WACZ, enabling page detection and full-text index generation:
 wacz create MY_FORUM.warc --output MY_FORUM.wacz --detect-pages --text

Last Reviewed by @AlexDev on 2022-06-03T00:00:00Z

12 Likes