How to scale further (2+MM posts, +80k per month)

I’d like to tap from your experience on how to continue scaling our hardware.

Here’s our corner stone data:

2.2 Million posts currently growing ~80k per month
35k users, growing ~1.5k per month

We’re now running on a Hetzner VM with 8 exclusive cores and 32G of RAM. We started to see limitations in December with Nginx worker connections first and upped them from 768 default to 1536. We hit that border again yesterday and upped them to 3072 now. After these changes things are running smooth again but we’re nearing a 100% server utilization limit.

db_work_mem is set to 80MB, db_shared_buffers to 8192MB. We’re not at all using our available memory but i’m not sure if there’s room to use more and benefit from it. Thoughts?

#~> free -m

              total        used        free      shared  buff/cache   available
Mem:          31360        6105        1241        8423       24013       16412

The next easy scaling option with Hetzner would be 16 cores and 64GB of RAM which is fine for us in terms of costs but I wonder if it would still make sense to scale vertically. I was thinking if splitting application and database off to different servers wouldn’t make much more sense or if it introduces a lot more difficulties.

Who’s done that? What are your experiences?

Why are you looking to scale? Are your API response time metrics degrading?

What resource are you refering here? CPU? Disk space? Memory?

2 Likes

How many page views are you having?

Ignore all the advice you find on the internet and set it to 40-50% of your memory.

Can you share some sar output?

2 Likes

Because I’m planning ahead. We might not need to scale right now but looking at the growth we see and the growing need of resources we saw in the past I’m just planning ahead right now to think about possible next steps.

Mainly CPU - after todays update we’re coming back from Load ~8-9 on peak hours to what it seems like ~6ish now.

12M per month grown from 6.5M per month a year ago.

Will get back to the sar output at a later point, didn’t have that running.

You will want to investigate what exactly is using that CPU. It can be:

  • Unicorn web workers
  • PostgreSQL
  • Sidekiq, like image optimization background jobs
  • Redis
  • nginx (unlikely)

Depending on the culprit, we can suggest ways to offload that.

5 Likes

Looking at CPU time I’d say it’s mainly unicorn workers:

1 Like

Maybe ditch the dedicated CPU thing and use this offering here?

image

Sounds like something more appropriate for your needs.

2 Likes

I was thinking about that option already. We used the non dedicated CPUs earlier and the “upgrade” to dedicated ones didn’t move us forward a whole lot.

1 Like

I’m not sure if CPU is your bottleneck. Do you have some sar output to share in the meanwhile?

Yes, I do, but not really peak hours yet. Will have those by tomorrow.

CPU usage is definitely reduced sinced I raised worker connections yesterday - not sure how much though.

12:00:01 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle

08:45:01 AM     all     40.22      2.21      7.44      0.27      0.00     49.86
08:55:01 AM     all     42.84      2.89      8.02      0.16      0.00     46.09
09:05:01 AM     all     38.81      0.86      7.68      0.12      0.00     52.53
09:15:01 AM     all     38.80      0.70      7.66      0.10      0.00     52.73
09:25:01 AM     all     38.71      2.14      7.88      0.12      0.00     51.16
09:35:01 AM     all     38.74      0.84      7.86      0.09      0.00     52.47
09:45:01 AM     all     40.31      1.07      7.95      0.10      0.00     50.57
09:55:01 AM     all     40.03      1.37      7.90      0.08      0.00     50.62
10:05:01 AM     all     39.00      1.29      7.90      0.09      0.00     51.72
10:15:01 AM     all     40.26      2.68      8.07      0.09      0.00     48.91
10:25:01 AM     all     41.59      0.93      8.31      0.08      0.00     49.09
10:35:01 AM     all     40.39      1.55      8.25      0.07      0.00     49.73
10:45:01 AM     all     45.44      2.37      9.08      0.08      0.00     43.03
10:55:01 AM     all     50.56      2.20      9.23      0.06      0.00     37.95
11:05:01 AM     all     41.82      1.54      8.55      0.08      0.00     48.02
11:15:01 AM     all     38.74      1.54      8.11      0.10      0.00     51.50
11:25:01 AM     all     45.41      1.59      9.27      0.19      0.00     43.55
11:35:01 AM     all     38.45      1.78      8.20      0.11      0.00     51.45
11:45:01 AM     all     41.03      1.60      8.48      0.14      0.00     48.75
11:55:01 AM     all     40.65      1.17      8.36      0.15      0.00     49.67
12:05:01 PM     all     40.03      1.29      8.40      0.13      0.00     50.15
12:15:01 PM     all     40.47      1.10      8.19      0.11      0.00     50.13
1 Like

What would be some useful sar stats to consider?

I suspected that Postgres was not getting enough memory, which would be resulting in a high %iowait.
But since the above sar output does not indicate an overloaded system we’ll have to wait for new stats.

I’m trying to help someone else with performance issues. We increased the postgres buffer size. I think it helped a bit but mini profiler still shows over 300ms. Cpu usage is about 30% for 16 cpus.

1 Like

Looks like that’s not going to change. Watched it over two evenings now and didn’t get any higher cpu usage. Your assumption regarding Postgres might be right.

Regarding mini profiler: what are numbers we wanna aim for there? We are at around 300ms right now. Been at 500ish before.

3 Likes

I think if we achieve below 50 ms in all cases irrespective of peak traffic hours, I would call it a fast server response time. rest of the site will be faster as everything depends on intial server response time.

And it will be super ideal if it stay below 50 ms from all geo users and stay consistent. As of now its keep jumping and it’s mostly higher, and that seems the big issue for being slow.

As a little review of the last days things have kept going smooth except about half an hour this afternoon. Haven’t been at the server myself at that time but we had Timeouts from the Backend reported in nginx log. sar shows a slightly higher sys utliization but no general high cpu load. Not sure what that was but it seemingly kept unicorn/redis/postgres from working smoothly.

03:55:01 PM     all     34.99      1.87      6.67      0.12      0.00     56.35
04:05:01 PM     all     33.99      0.35      6.52      0.31      0.00     58.82
04:15:01 PM     all     35.24      1.17      7.14      0.13      0.00     56.31
04:25:02 PM     all     36.45      0.63      7.15      0.13      0.00     55.65
> 04:35:01 PM     all     39.09      0.71     16.78      0.11      0.00     43.32
> 04:45:01 PM     all     35.53      0.95     20.16      0.08      0.00     43.27
> 04:55:01 PM     all     41.64      4.29     15.44      0.24      0.00     38.39
05:05:01 PM     all     36.75      2.47      7.78      0.13      0.00     52.87
05:15:01 PM     all     35.96      1.29      7.81      0.10      0.00     54.85
05:25:01 PM     all     38.69      1.35      8.00      0.09      0.00     51.87
05:35:01 PM     all     37.01      4.53      7.92      0.07      0.00     50.46

It’s the > marked lines.

Can’t see any high traffic at that time so I’m kind of clueless what it was but it really was only happening in that half hour.

Overall when looking at cpu time, Redis has taken the lead and generally uses a good part of the CPU. Not sure if there’s anything that can be done about it. It’s usually at around 20-25% peaking to 35% here and there.

1721566 uuidd      20   0 2035M 1458M  1952 S 16.1  4.6 16h22:37 /usr/bin/redis-server *:6379
1721578 www-data   20   0  108M 69100 12516 S  6.7  0.2  6h32:27 nginx: worker process
2853756 1000       20   0 1355M  444M 19356 R 63.7  1.4  2h38:10 unicorn worker[0] -E production -c config/unicorn.conf.rb
2854380 1000       20   0 1267M  409M 18768 S 41.6  1.3  2h19:45 unicorn worker[4] -E production -c config/unicorn.conf.rb
1721598 1000       20   0  592M  285M  5192 S  1.3  0.9  2h08:53 unicorn master -E production -c config/unicorn.conf.rb
    575 root       20   0 1747M 20468  5040 S  0.7  0.1  2h01:02 /usr/bin/containerd
2854731 1000       20   0 1280M  399M 17880 S 36.9  1.3  1h57:52 unicorn worker[7] -E production -c config/unicorn.conf.rb
1721841 1000       20   0  592M  285M  5192 S  0.7  0.9  1h49:49 unicorn master -E production -c config/unicorn.conf.rb
2855284 1000       20   0 1287M  425M 18396 S 18.8  1.4  1h35:02 unicorn worker[3] -E production -c config/unicorn.conf.rb
2856414 1000       20   0 1223M  391M 19268 S 13.4  1.2  1h14:50 unicorn worker[2] -E production -c config/unicorn.conf.rb
2856478 1000       20   0 1207M  401M 21120 S  5.4  1.3 58:42.50 unicorn worker[5] -E production -c config/unicorn.conf.rb
2856503 1000       20   0 1215M  389M 18980 S  4.7  1.2 47:22.95 unicorn worker[1] -E production -c config/unicorn.conf.rb
1721581 www-data   20   0 69888 28636 13368 S  0.0  0.1 44:49.50 nginx: worker process
2857467 1000       20   0 1199M  385M 18112 S  4.0  1.2 39:23.87 unicorn worker[6] -E production -c config/unicorn.conf.rb
1721594 _apt       20   0 8479M 20036 18128 S  1.3  0.1 32:55.29 postgres: 13/main: walwriter
    580 root       20   0 1747M 20468  5040 S  0.0  0.1 32:15.27 /usr/bin/containerd

Load overall averaging at 5 long term now but I still see it going up to 8 or 9 from time to time.

2 Likes

Regarding mini profiler. I was talking about overall load time. Server response time (initial request) is usually below 10ms on our end. Sometimes a little above but I haven’t seen >20ms in my last checks. Overall load time goes as fas up as 800ms sometimes, very rarely above 1000ms.

Only looking at one geo region as we don’t have significant numbers of users outside that region.

I recommend looking into DISCOURSE_ENABLE_PERFORMANCE_HTTP_HEADERS which should also give you insight into where you can improve performance.

4 Likes