Deliberately created partial backups for local Postgres queries


(PJH) #1

Continuing the discussion from The (manual) backup (didn’t) failed:

  1. The full backup is over a gig (last night’s ticked over to 1.2G). At least when it all downloads (which is fixed, I believe?.)
  2. I only needed the sql backup, which reduces the size of the backup to a more quickly downloaded (last one was) 135M

Basically I’m only downloading the backups to get a recent database to play with, since I cannot query the live database{1} I’m not personally archiving them{2} - we have Amazon for that.

{1} given that one of my recent queries for potential badges ran to 70 seconds, I’m not sure running it on the live site repeatedly would do much for the host.
{2} unless you call my omission in deleting them from /tmp a ‘backup’ O_o


Bug/unintended behaviour while posting: Cannot use brackets for footnotes if you also have links in the post.


Clicking New (x) moves New (temporarily) to unread
(Jeff Atwood) #2

I see. So you’re downloading just the database to do queries.

Perhaps you can share a bit more of your workflow for others that might like to do the same for their Discourse instance?


(PJH) #3

Obvious Note: Please remember that database archives contain sensitive information, such as usernames with password hashes and Private Messages. Please ensure the machine you’re doing this on is secure.


One-time Setup

1 - Linux system with LAMP stack and postgres installed (yum/apt/synaptic/other favourite package manager) and both set to run on startup and currently running.

2 - phpPgAdmin installed: phpPgAdmin - start (this gives you an easy interface to spelunk through the database tables.)

3 - create a postgres user for playing with the database, (lets use discourse as the name for this example) and give it a memorable password:
 
 createuser discourse -P
 
Select Y when asked when “Shall the new role be a superuser?”
 
4. Fire up phpPgAdmin: http://localhost/phpPgAdmin/

 
5. Open the server (highlighted above) and login

 
6. Create a new database for your discourse user


 
7. Call it discourse for simplicity, change the template to template0 and choose/enter the following values in the other three boxes. Click Create

 
8. You should now have a new database called discourse owned by discourse:

 
9. I now present a script that will
  i. restore a recently downloaded backup to this database and
  ii. gives a couple of examples of queries that can be run against the database that (currently) don’t have direct equivalents in Discourse to demonstrate how to create new ones.

[pjh@sofa tmp]$ cat /usr/local/bin/sql_discourse 
#!/bin/bash

DATABASE=${DATABASE:-discourse}
LIMIT=${LIMIT:-25}

function sql(){
    psql -d $DATABASE -e -c "$@
" | sed 1d
    return
}

function restore(){ # restore a backup from /tmp
    previous_backup=""
    most_recent_backup=$(basename $(ls /tmp/*-20??-??-??-??????.tar.gz -rt | tail -n1))
    if [ -e ~/.${DATABASE}_backup ]; then
        previous_backup=$(basename $(head -n1 ~/.${DATABASE}_backup))
    fi
    if [ "$most_recent_backup" == "$previous_backup" ]; then
        echo "Already using the most recent backup available: $previous_backup"
        exit 0
    fi
    backup_date=$(echo "$most_recent_backup" | sed -e 's/^.*\(20[0-9][0-9].*\)-.......tar.gz/\1/g')

    echo "$most_recent_backup - restore? [Y/n]"
    read yn
    if [[ $yn == "Y" || $yn == "y" || $yn = "" ]]; then
        echo "Restoring..."
    else
        exit 0
    fi
    
    mkdir -p ~/restores/$backup_date
    echo "Extracting $most_recent_backup"
    tar -C ~/restores/$backup_date -xf /tmp/$most_recent_backup dump.sql
    echo "Restoring..."
    psql $DATABASE < ~/restores/$backup_date/dump.sql
    psql -d $DATABASE -c "DROP SCHEMA IF EXISTS backup CASCADE"
    psql -d $DATABASE -c "ALTER SCHEMA public RENAME TO backup"
    psql -d $DATABASE -c "ALTER SCHEMA restore RENAME TO public"

    echo "$most_recent_backup" > ~/.${DATABASE}_backup

}

function backup_date(){ # When the backup was taken
        sql "SELECT created_at
FROM post_actions
ORDER BY created_at DESC
LIMIT 1" | grep "201[45]-"
}

function posts_days(){ # Posts read and days visited
        sql "
SELECT row_number() OVER (ORDER BY sum(posts_read) DESC) as n,
   sum(posts_read) as \"Posts read\",
   count(*) as \"Days visited\",
   u.username as \"User id\"
FROM user_visits v
JOIN users u on u.id = v.user_id
GROUP BY u.username
ORDER BY sum(posts_read) DESC
LIMIT $LIMIT"
}

function likes_liked(){ # Who's liked who the most
        sql "
SELECT l.username Liker, count(*), r.username Liked
FROM post_actions pa
INNER JOIN users l ON l.id=pa.user_id
INNER JOIN posts p on p.id=pa.post_id
INNER JOIN users r on r.id=p.user_id
WHERE pa.post_action_type_id=2
GROUP BY Liker, Liked
ORDER BY count(*) DESC
LIMIT $LIMIT"
}

user=`whoami`
if [ "$user" != "postgres" ]; then
    echo "Not running as postgres. Execute one of:"
    echo "   su - root -c \"su - postgres\""
    echo "   sudo su - postgres"
    echo "To continue."
    exit 1;
fi

function help(){ # Show this help
        grep "^function.*#" $0 | sed -e "s/^function \([^(]*\)(){ #\(.*\)$/\1 -- \2/g"
}

if [ "_$1" = "_" ]; then
    help
else
        grep "$1(" $0 | sed 's/^[^#]*//g'
        tic=$(($(date +%s%N)/1000000))
    "$@"
        toc=$(($(date +%s%N)/1000000))
        delta=$(($toc - $tic))
        LC_NUMERIC=C LC_COLLATE=C
        printf 'Elapsed: %gs\n' $(bc <<< "scale=3; ${delta}/1000")
    echo -n "Backup taken: "; backup_date
fi

This completes the one-time setup.

The script

The script is intended to be run as postgres - in fact it has been coded to refuse to run as any other user - so to change to that user:
su - root -c "su - postgres"

For those that can only use sudo use the following:
sudo su - postgres

Note that the - before postgres is required in both cases to ensure that you end up with postgres's environment and not the calling user’s. You need to su to root in the first because the postgres account doesn’t actually have a password and you need to escalate before coming back down to postgres without the need for a password.

Running the script with help or with no parameter will list all the parameters it recognises:

[postgres@sofa ~]$ sql_discourse 
restore --  restore a backup from /tmp
backup_date --  When the backup was taken
posts_days --  Posts read and days visited
likes_liked --  Who has liked who the most
help --  Show this help
[postgres@sofa ~]$ 

Restoring a backup

  1. To /tmp, download the most recent backup for your site from /admin/backups or create a new Backup (do not include files) and download that. (This latter option may, depending on your site’s content, produce a much smaller backup since it only contains the database backup, not all the images etc that your site has which isn’t needed here.)
    Do not rename it - the script is expecting the filename in a certain format, part of which is the date.

  2. Run the script with the restore parameter:

[postgres@lenovo ~]$ sql_discourse restore
# restore a backup from /tmp
what-the-daily-wtf-2014-10-09-073110.tar.gz - restore? [Y/n]

Restoring...
Extracting what-the-daily-wtf-2014-10-09-073110.tar.gz
Restoring...
SET
ERROR:  unrecognized configuration parameter "lock_timeout"
SET
SET
SET
SET
DROP SCHEMA
CREATE SCHEMA
SET
SET
SET
CREATE TABLE
CREATE SEQUENCE
ALTER SEQUENCE
CREATE TABLE
CREATE SEQUENCE
ALTER SEQUENCE
[...]
Time: 455.629 ms
DROP SCHEMA
ALTER SCHEMA
Elapsed: 334.783s
Backup taken:  2014-10-01 14:05:50.514682
[postgres@sofa ~]$

This should (after a refresh) allow you to see the restored database in your browser:

To subsequently update your local copy, simply download a newer backup and rerun sql_discourse restore (The database is restored to restore in your database, and on completion public is deleted and restore renamed to public)

Attempting to re-run a restore (i.e. restore the same archive again) without downloading a newer archive produces:

[postgres@lenovo ~]$ sql_discourse restore
# restore a backup from /tmp
Already using the most recent backup available: what-the-daily-wtf-2014-10-09-073110.tar.gz
[postgres@lenovo ~]$ 

If you really must re-run the restore, delete the .*_backup file you’ll find in ~postgres (/home/postgres/.discourse_backup if you’ve been following along.)

Running scripted queries

There are two queries in the script presented above. When limit $LIMIT is included in the query, (as it is with the two provided) it defaults to 25 but can be overridden on a per-invocation basis as shown in the second example here, or by (e.g.) export LIMIT=100:

[postgres@sofa ~]$ sql_discourse posts_days
# Posts read and days visited
SELECT row_number() OVER (ORDER BY sum(posts_read) DESC) as n,
   sum(posts_read) as "Posts read",
   count(*) as "Days visited",
   u.username as "User id"
FROM user_visits v
JOIN users u on u.id = v.user_id
GROUP BY u.username
ORDER BY sum(posts_read) DESC
LIMIT 25

 n  | Posts read | Days visited |     User id     
----+------------+--------------+-----------------
  1 |     100230 |          136 | ben_lubar
  2 |      96189 |          104 | VinDuv
  3 |      95248 |            6 | ben_lubot
  4 |      93087 |          132 | Keith
  5 |      93060 |          135 | PJH
  6 |      92210 |           45 | accalia
  7 |      92202 |          134 | dkf
  8 |      91748 |          133 | ChaosTheEternal
  9 |      91058 |          128 | Maciejasjmj
 10 |      88617 |          135 | blakeyrat
 11 |      84501 |          111 | Arantor
 12 |      83657 |          122 | darkmatter
 13 |      83654 |           96 | Yamikuronue
 14 |      83085 |           93 | M_Adams
 15 |      81565 |          121 | antiquarian
 16 |      81090 |          127 | Luhmann
 17 |      80311 |           76 | Spencer
 18 |      77020 |          114 | HardwareGeek
 19 |      76559 |          107 | cartman82
 20 |      76071 |          117 | mott555
 21 |      75862 |          127 | boomzilla
 22 |      75691 |           92 | aliceif
 23 |      69927 |           97 | chubertdev
 24 |      68302 |          114 | jaloopa
 25 |      67390 |           44 | tarunik
(25 rows)

Elapsed: 0.038s
Backup taken:  2014-10-01 14:05:50.514682
[postgres@sofa ~]$ 
[postgres@sofa ~]$ LIMIT=10 sql_discourse likes_liked
# Who's liked who the most
SELECT l.username Liker, count(*), r.username Liked
FROM post_actions pa
INNER JOIN users l ON l.id=pa.user_id
INNER JOIN posts p on p.id=pa.post_id
INNER JOIN users r on r.id=p.user_id
WHERE pa.post_action_type_id=2
GROUP BY Liker, Liked
ORDER BY count(*) DESC
LIMIT 10

    liker    | count |   liked    
-------------+-------+------------
 Arantor     |   489 | Onyx
 Arantor     |   423 | chubertdev
 Arantor     |   394 | Matches
 antiquarian |   367 | Arantor
 Arantor     |   365 | boomzilla
 Arantor     |   335 | faoileag
 Onyx        |   330 | Arantor
 antiquarian |   328 | boomzilla
 DoctorJones |   324 | boomzilla
 cartman82   |   301 | blakeyrat
(10 rows)

Elapsed: 1.054s
Backup taken:  2014-10-01 14:05:50.514682
[postgres@sofa ~]$ 

Adding new queries

It is presumed that you will use phpPgAdmin to hone your latest idea for a query using the SQL link at the top right of the window - next to History | Find | Logout

Most queries will return lots and lots of records - and you’ll normally only be interested in a subset of them. $LIMIT has been provided for this.

Once you’re happy with it, you can add it to the script by adding a function including your query.

The functions take the general form of:

function parameter_provided_on_commandline(){ # Text to display when script called with `help`
        sql "
SELECT <fields>
FROM <tables>
INNER JOIN <field> etc.etc.
WHERE <whatever>
LIMIT $LIMIT"
}

Of note here are

  1. the function name which doubles as a parameter to pass to the script and
  2. the comment on the same line as the function declaration which doubles as help text
  3. the use of LIMIT $LIMIT

More example queries

function most_liked_users(){ # Most liked users
    sql "
SELECT row_number() OVER (ORDER BY SUM(badge_posts.like_count) DESC)  n,
   SUM(badge_posts.like_count) likes,
   users.username
FROM badge_posts, users
WHERE badge_posts.user_id = users.id
GROUP BY users.username
ORDER BY SUM(badge_posts.like_count) DESC
LIMIT $LIMIT"
}

How to use pgweb connect Discourse postgresql database?
What cool badge queries have you come up with?
What cool badge queries have you come up with?
Manually restoring a Discourse backup for development
(Jeff Atwood) #4

This is excellent! I know a number of people who want to run offline queries on their Discourse database, and I will refer them here.


(Sam Saffron) #5

Lets move this to a top level howto!


(PJH) #6

Could we have it run by some of those Jeff has in mind for sanity checking and any sugestions before turning it into a howto?

Edit: e.g. just spotted one problem with the script which makes it work only for our forum…


(PJH) #7

Changes made along with some formatting tweaks, I’m a bit happier about it ‘going public’ as it were, but would still appreciate some feedback from others who actually try to use this.


(Sam Saffron) #8

One very interesting option here is creating a docker image for it, that makes it far easier for people to install.


(PJH) #9

Given the fact that installing docker in the past has been a problem for me, I’m not sure about that…


(Sam Saffron) #10

You need to be on an x64 host for Docker to work, x32 is road block, that said x32 on the desktop is quite rare these days.

Keep in mind on use people will probably want is just running this side-by-side with Discourse, they already have Docker installed, so having an extra template for this is a huge win.


(PJH) #11

@sam, the whole reason I started doing this is because I don’t have docker/an actual install to play with. If I had I’d have probably just restored a backup into it and taken it from there without doing any of the phpPgAdmin stuff…

:slight_smile: