It looks like my community is generating data roughly at a pace of 100MB’s per month, maybe a little more. With linear projection, my backup size in December would be 1.2GB and in December 2026 around 12GB. The images seem to be ~50% of the whole package.
Thats a lot of data. The previous decade (2006-2016) we did with SMF, and the last backup was 175MB zipped, containing 245K posts but very few images. We seem to generate new posts at a much faster rate with Discourse.
How big is your backup?
How do you manage them when the size goes up?
Will I be in trouble after a few years dumping or restoring a backup with the size measured in several gigabytes?
Yes, S3 migration & exclusion of attachments in backups is on my development roadmap. That will save 40…50%, but the backup will still be huge in a couple of years.
How big is yours here at Meta, or how do you manage backups of your commercial customers with a lot of data?
Interesting comment - never thought of that and indeed the default value for email logs is 365d. The backup however does not contain any log files, only sql dump and the uploads folder.
Would altering this setting reduce the size of the .sql?
Any reason why would I want to log a full year of email activity?
I’ve since moved to S3 stored image files, but never received response to my question regarding the /optimized folder, that still adds significant backup payload.
I have also a large backup files (3MB per user!? already compressed) so I was wondering where this comes from. I investigated some tables I stumbled over a huge amount of entries for email_reject_auto_generated and email_reject_empty in the table. Try:
psql -d discourse
SELECT count(email_type) as count,email_type FROM email_logs GROUP BY email_type;
I’m not sure, I think it could be related to the fact that I had some emailing issues at the beginning where potentially several meaningless email things are stored or something. So if nothing goes wrong with emailing this 1year setting might be fine. With my rough understanding of discourse: why is this stored at all - for debugging purposes?
Looking at some random examples, it seems like post_timings takes up 30-50% of the database space…
Maybe it’s a good idea to introduce a ‘minimal backup’ which doesn’t include tables like post_timings, post_search_data etcetera. Even posts.cooked could be regenerated if you’d really want to keep things at a minimum.
Wow, cleaning this data up reduced my backups from 600MB to under 40MB … but again: I had an email configuration problem several months ago where basically my email provider and discourse played ping pong and since that is disabled all is fine but I didn’t know that discourse stores this (that long).
The default setting is to keep email logs for 1 year (365d). To me that sounds an awful long time, but since Discourse generally comes with sane defaults, I have to ask why would I keep them for so long?
If it consumes significant backup space, I would reduce it to 31 days or something.
Remember that the email log is not just for debugging. It also stores reply keys, so user can no longer reply to an email once the corresponding log entry is purged.
On one instance my backups are 9.6GB+, another 7.8GB+ compressed, my email_logs table is not an issue at less than 60MB and less than 15MB each uncompressed - both running Discourse well over 1 year.
I haven’t seen any issues with email_logs table growing unexpectedly so far.
What are the results of the following queries for your instance?
Get the count of items and date range of email log table contents:
SELECT count(*), MAX(created_at), MIN(created_at) FROM email_logs;
Get size of top 20 relations:
SELECT nspname || '.' || relname AS "relation",
pg_size_pretty(pg_relation_size(C.oid)) AS "size"
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_relation_size(C.oid) DESC
LIMIT 20;
Get count of rows by email_type:
SELECT count(email_type) as count,email_type FROM email_logs GROUP BY email_type;