I’d like to enquire about what options there are for the public archiving of a forum.
We have recently migrated from phpBB, and we previously had a script that would archive the public posts to a text file. Are there any options like that for Discourse? We have about 30 public categories that would need to be archived. The username/display name of each poster would need to be included, but not the email or IP addresses or any other private data. SQL so that the data could be processed like Stack Exchange’s SEDE would probably be most useful.
Alright, so I came up with this. Seems to work well to me, though I need to double check that I am filtering out all the hidden posts correctly. And some of the columns probably aren’t actually that useful.
This can be used from a cronjob, but the user will need to be added to the docker group. I added a step after the dump to scp the file to the archive server, but you’d could do pretty much anything else with it too.
#!/bin/bash
# Archive public data from the forum
EXEC="docker exec -i -u postgres -w /var/www/discourse app"
# Run commands to construct the exported tables
$EXEC psql discourse << EOF
SET client_min_messages TO warning;
DROP TABLE IF EXISTS users_export;
CREATE TABLE users_export AS
  SELECT id, username, created_at, name
  FROM users
  WHERE staged = false;
DROP TABLE IF EXISTS categories_export;
CREATE TABLE categories_export AS
  SELECT id, name, slug, parent_category_id
  FROM categories 
  WHERE read_restricted = false;
DROP TABLE IF EXISTS topics_export;
CREATE TABLE topics_export AS
  SELECT t.id, t.title, t.last_posted_at, t.created_at, t.updated_at,
    t.views, t.posts_count, t.user_id, t.last_post_user_id, t.reply_count,
    t.category_id, t.closed, t.slug
  FROM topics AS t
  INNER JOIN categories_export AS c ON t.category_id = c.id
  WHERE t.deleted_at is NULL AND t.archived = false;
DROP TABLE IF EXISTS posts_export;
CREATE TABLE posts_export AS
  SELECT p.id, p.user_id, p.topic_id, p.raw, p.created_at, p.updated_at,
    p.reply_to_post_number, p.reply_count, p.quote_count, p.like_count
  FROM posts AS p
  INNER JOIN topics_export AS t ON p.topic_id = t.id
  WHERE p.post_type = 1 AND p.deleted_at is NULL AND p.hidden = false;
EOF
# Dump data
echo Dumping data...
$EXEC pg_dump discourse -t users_export -t categories_export -t topics_export -t posts_export > forum-archive.sql
# Cleanup
$EXEC psql discourse << EOF
SET client_min_messages TO warning;
DROP TABLE IF EXISTS users_export;
DROP TABLE IF EXISTS categories_export;
DROP TABLE IF EXISTS topics_export;
DROP TABLE IF EXISTS posts_export;
EOF
