How big is your Discourse backup?

ljpp · March 21, 2016, 6:53am

It looks like my community is generating data roughly at a pace of 100MB’s per month, maybe a little more. With linear projection, my backup size in December would be 1.2GB and in December 2026 around 12GB. The images seem to be ~50% of the whole package.

Thats a lot of data. The previous decade (2006-2016) we did with SMF, and the last backup was 175MB zipped, containing 245K posts but very few images. We seem to generate new posts at a much faster rate with Discourse.

How big is your backup?
How do you manage them when the size goes up?
Will I be in trouble after a few years dumping or restoring a backup with the size measured in several gigabytes?

codinghorror · March 21, 2016, 7:34am

Make sure you tell backup to exclude downloads, if you want smaller backups.

It really depends how many uploaded images there are. Images are huge, relative to text and db data.

ljpp · March 21, 2016, 7:42am

Yes, S3 migration & exclusion of attachments in backups is on my development roadmap. That will save 40…50%, but the backup will still be huge in a couple of years.

How big is yours here at Meta, or how do you manage backups of your commercial customers with a lot of data?

karussell · April 22, 2016, 7:28pm

Could this be related to some big logs? E.g. could it help to reduce delete_email_logs_after_days?

ljpp · April 22, 2016, 7:55pm

Interesting comment - never thought of that and indeed the default value for email logs is 365d. The backup however does not contain any log files, only sql dump and the uploads folder.

Would altering this setting reduce the size of the .sql?
Any reason why would I want to log a full year of email activity?

I’ve since moved to S3 stored image files, but never received response to my question regarding the /optimized folder, that still adds significant backup payload.

karussell · April 22, 2016, 8:12pm

I have also a large backup files (3MB per user!? already compressed) so I was wondering where this comes from. I investigated some tables I stumbled over a huge amount of entries for email_reject_auto_generated and email_reject_empty in the table. Try:

psql -d discourse
SELECT count(email_type) as count,email_type FROM email_logs GROUP BY email_type;

codinghorror · April 22, 2016, 9:26pm

Aha, is this something we need to improve @zogstrip?

karussell · April 22, 2016, 9:31pm

I’m not sure, I think it could be related to the fact that I had some emailing issues at the beginning where potentially several meaningless email things are stored or something. So if nothing goes wrong with emailing this 1year setting might be fine. With my rough understanding of discourse: why is this stored at all - for debugging purposes?

michaeld · April 22, 2016, 10:26pm

Looking at some random examples, it seems like post_timings takes up 30-50% of the database space…

Maybe it’s a good idea to introduce a ‘minimal backup’ which doesn’t include tables like post_timings, post_search_data etcetera. Even posts.cooked could be regenerated if you’d really want to keep things at a minimum.

karussell · April 25, 2016, 4:04pm

Wow, cleaning this data up reduced my backups from 600MB to under 40MB … but again: I had an email configuration problem several months ago where basically my email provider and discourse played ping pong and since that is disabled all is fine but I didn’t know that discourse stores this (that long).

ljpp · April 25, 2016, 6:38pm

I have to repeat my question:

The default setting is to keep email logs for 1 year (365d). To me that sounds an awful long time, but since Discourse generally comes with sane defaults, I have to ask why would I keep them for so long?

If it consumes significant backup space, I would reduce it to 31 days or something.

fefrei · April 25, 2016, 7:07pm

Remember that the email log is not just for debugging. It also stores reply keys, so user can no longer reply to an email once the corresponding log entry is purged.

ljpp · April 25, 2016, 7:42pm

I take it that this is not relevant when replying by email is disabled?

fefrei · April 25, 2016, 7:53pm

Correct, although there may be other uses of the mail log.

ljpp · April 25, 2016, 7:55pm

…which gets us right back to my original question.

DeanMarkTaylor · April 25, 2016, 8:07pm

On one instance my backups are 9.6GB+, another 7.8GB+ compressed, my email_logs table is not an issue at less than 60MB and less than 15MB each uncompressed - both running Discourse well over 1 year.

I haven’t seen any issues with email_logs table growing unexpectedly so far.

What are the results of the following queries for your instance?

Get the count of items and date range of email log table contents:

SELECT count(*), MAX(created_at), MIN(created_at) FROM email_logs;

Get size of top 20 relations:

SELECT nspname || '.' || relname AS "relation",
    pg_size_pretty(pg_relation_size(C.oid)) AS "size"
  FROM pg_class C
  LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
  WHERE nspname NOT IN ('pg_catalog', 'information_schema')
  ORDER BY pg_relation_size(C.oid) DESC
  LIMIT 20;

Get count of rows by email_type:

SELECT count(email_type) as count,email_type FROM email_logs GROUP BY email_type;

Topic		Replies	Views
Recommended backup process for very large forum? Support	8	1278	July 28, 2019
Is there any reason why daily backups would see a reduction in size? Support	27	1323	January 18, 2023
What is the size of your Discourse instance? Hosting	8	3430	January 13, 2022
Backup Growth rate Support	10	880	October 28, 2022
How migrate from Discourse to another platform of community? Migration	21	1786	May 23, 2023

How big is your Discourse backup?

Related topics