Database growth

(Ian) #1

Hi all,

We have Discourse pumping out to our StatsD regularly, courtesy of the ‘nunes’ gem, and have noticed that the PeriodicalUpdates job causes our database to keep growing considerably each time it runs. In an earlier version of Discourse, our PostgreSQL database size ballooned over time from 1GB to 400GB+, even with autovacuuming enabled.

When overlaid in Graphite, the database size appears to step up by approximately 200MB every 15 minutes as the PeriodicalUpdates job runs and the current instance stands at around 16GB.

Naturally we’d prefer if the database didn’t keep growing without limit. Is there anyway to control/optimise this behaviour? Is there a bug somewhere in the code that is causing the stepping and will it plateau?

Before we reached around 27GB before deciding to kill the Sidekiq instance completely, but obviously this is detrimental to the performance of the server over time as maintenance jobs will not operate. However, if we have to balance between an evergrowing PostgreSQL dataset and something that is manageable, we’ll kill Sidekiq again.

We’ve already placed a retry: false option in the PeriodicalUpdate job to stop it repeatedly hammering our server, but even this doesn’t stop the stepping growth.

Any tips or help to stop this or at least control it better?

(Sam Saffron) #2

Can you determine which table is growing and in what rate?

(Jeff Atwood) #3

Are you on the latest version of Discourse, GitHub master?

(Ian) #4

We’re about a day behind on the master at the moment (currently merged off commit 5caa7a0e4d58393aeee163d9ecca81059de60c26). I didn’t have time to pull in today’s changes.

However, we do have quite a large dataset, currently standing at:

  • 31,993 topics
  • 248,087 posts

The current database size is 16GB broken down into:

  • public.posts: 14GB split into 13GB table, 1224MB index
  • public.topics: 521MB split into 434MB table, 87MB index
  • public.post_search_data: 512MB split into 170MB table, 343MB index

The remaining tables are all pretty small, ranging from 87MB as the largest to 176KB.

When observing the database over time, only the posts table seems to grow, and when we graph the pg_db_size against the sidekiq.active_record.sql.count in Graphite, there’s a clear stepping motion at work when PeriodicalUpdates kicks in.

We initially saw this back on a much earlier release, where Sidekiq’s continued retries were causing both our PostgreSQL to jam completely and the dataset to balloon in size, requiring either a full vacuum, or a dump/truncate/restore to shrink the dataset back down. We had to change our database timeouts as a result, to prevent too much locking of the central server and that’s when we added the sidekiq_options retry: false to that job, as we discovered thousands of extra jobs stacking up in the wings with the timeout errors.

However, I would like to add that in the past hour it seems to be stepping up but in much smaller increments, so hopefully this is a plateau and the initial large jumps were related in someway to Sidekiq “catching up” with various changes in the database. The initial period from 11am to 1pm saw regular 200MB jumps every 30 minutes. These have now fallen to about 10MB every 30 minutes.

I’ll keep an eye on it to let you know what happens. The jumping behaviour definitely seems smoother than before, so fingers crossed it won’t reach the insane sizes of before.

(Régis Hanol) #5

Just pushed a fix. The issue was that one query was updating all the rows in the posts table and that’s using a lot a temporary space (which is easily vacuumable).

(Ian) #6

Thanks! We’ll try this on the staging server and see how it goes and then roll out to production.

(Sam Saffron) #7

Note, I am recategorizing this to support cause it is not strictly a bug, the postgres instance you have MUST have a sane vacuum setup. If that is not the case all sorts of nasties can happen, see:

(Ian) #8

Thanks again Sam. However we do have an incredibly sane vacuum system
already running (both via cron/Jenkins and using the autovacuumd) as
our main PostgreSQL server also runs our asset tracking databases for
our company (which is many, many magnitudes larger in scale - the
tables being so large that a simple alter table can take hours).

The original database growth was noted by our main Technical Support
staff purely because it was growing at a rate that was considerably
greater than our asset tracking system which deals with many millions
of rows, indicating that something was going wrong with the Discourse
installation. It was at that point that we added statsd support to get
full instrumentation to pinpoint the cause of the problem.

Anyway we will roll out the patch tomorrow and hopefully this will
rectify the issue.

(Sam Saffron) #9

Thanks, let me know how you go with latest

(Ian) #10

Thanks all, the recent database call changes have definitely helped. Database growth is now down to a respectable 200Kb per hour. Much more manageable! :smile:

(Régis Hanol) #11