I’d like to tap from your experience on how to continue scaling our hardware.
Here’s our corner stone data:
2.2 Million posts currently growing ~80k per month
35k users, growing ~1.5k per month
We’re now running on a Hetzner VM with 8 exclusive cores and 32G of RAM. We started to see limitations in December with Nginx worker connections first and upped them from 768 default to 1536. We hit that border again yesterday and upped them to 3072 now. After these changes things are running smooth again but we’re nearing a 100% server utilization limit.
db_work_mem is set to 80MB, db_shared_buffers to 8192MB. We’re not at all using our available memory but i’m not sure if there’s room to use more and benefit from it. Thoughts?
#~> free -m
total used free shared buff/cache available
Mem: 31360 6105 1241 8423 24013 16412
The next easy scaling option with Hetzner would be 16 cores and 64GB of RAM which is fine for us in terms of costs but I wonder if it would still make sense to scale vertically. I was thinking if splitting application and database off to different servers wouldn’t make much more sense or if it introduces a lot more difficulties.
Because I’m planning ahead. We might not need to scale right now but looking at the growth we see and the growing need of resources we saw in the past I’m just planning ahead right now to think about possible next steps.
Mainly CPU - after todays update we’re coming back from Load ~8-9 on peak hours to what it seems like ~6ish now.
12M per month grown from 6.5M per month a year ago.
Will get back to the sar output at a later point, didn’t have that running.
I was thinking about that option already. We used the non dedicated CPUs earlier and the “upgrade” to dedicated ones didn’t move us forward a whole lot.
Yes, I do, but not really peak hours yet. Will have those by tomorrow.
CPU usage is definitely reduced sinced I raised worker connections yesterday - not sure how much though.
12:00:01 AM CPU %user %nice %system %iowait %steal %idle
08:45:01 AM all 40.22 2.21 7.44 0.27 0.00 49.86
08:55:01 AM all 42.84 2.89 8.02 0.16 0.00 46.09
09:05:01 AM all 38.81 0.86 7.68 0.12 0.00 52.53
09:15:01 AM all 38.80 0.70 7.66 0.10 0.00 52.73
09:25:01 AM all 38.71 2.14 7.88 0.12 0.00 51.16
09:35:01 AM all 38.74 0.84 7.86 0.09 0.00 52.47
09:45:01 AM all 40.31 1.07 7.95 0.10 0.00 50.57
09:55:01 AM all 40.03 1.37 7.90 0.08 0.00 50.62
10:05:01 AM all 39.00 1.29 7.90 0.09 0.00 51.72
10:15:01 AM all 40.26 2.68 8.07 0.09 0.00 48.91
10:25:01 AM all 41.59 0.93 8.31 0.08 0.00 49.09
10:35:01 AM all 40.39 1.55 8.25 0.07 0.00 49.73
10:45:01 AM all 45.44 2.37 9.08 0.08 0.00 43.03
10:55:01 AM all 50.56 2.20 9.23 0.06 0.00 37.95
11:05:01 AM all 41.82 1.54 8.55 0.08 0.00 48.02
11:15:01 AM all 38.74 1.54 8.11 0.10 0.00 51.50
11:25:01 AM all 45.41 1.59 9.27 0.19 0.00 43.55
11:35:01 AM all 38.45 1.78 8.20 0.11 0.00 51.45
11:45:01 AM all 41.03 1.60 8.48 0.14 0.00 48.75
11:55:01 AM all 40.65 1.17 8.36 0.15 0.00 49.67
12:05:01 PM all 40.03 1.29 8.40 0.13 0.00 50.15
12:15:01 PM all 40.47 1.10 8.19 0.11 0.00 50.13
I suspected that Postgres was not getting enough memory, which would be resulting in a high %iowait.
But since the above sar output does not indicate an overloaded system we’ll have to wait for new stats.
I’m trying to help someone else with performance issues. We increased the postgres buffer size. I think it helped a bit but mini profiler still shows over 300ms. Cpu usage is about 30% for 16 cpus.
Looks like that’s not going to change. Watched it over two evenings now and didn’t get any higher cpu usage. Your assumption regarding Postgres might be right.
Regarding mini profiler: what are numbers we wanna aim for there? We are at around 300ms right now. Been at 500ish before.
I think if we achieve below 50 ms in all cases irrespective of peak traffic hours, I would call it a fast server response time. rest of the site will be faster as everything depends on intial server response time.
And it will be super ideal if it stay below 50 ms from all geo users and stay consistent. As of now its keep jumping and it’s mostly higher, and that seems the big issue for being slow.
As a little review of the last days things have kept going smooth except about half an hour this afternoon. Haven’t been at the server myself at that time but we had Timeouts from the Backend reported in nginx log. sar shows a slightly higher sys utliization but no general high cpu load. Not sure what that was but it seemingly kept unicorn/redis/postgres from working smoothly.
03:55:01 PM all 34.99 1.87 6.67 0.12 0.00 56.35
04:05:01 PM all 33.99 0.35 6.52 0.31 0.00 58.82
04:15:01 PM all 35.24 1.17 7.14 0.13 0.00 56.31
04:25:02 PM all 36.45 0.63 7.15 0.13 0.00 55.65
> 04:35:01 PM all 39.09 0.71 16.78 0.11 0.00 43.32
> 04:45:01 PM all 35.53 0.95 20.16 0.08 0.00 43.27
> 04:55:01 PM all 41.64 4.29 15.44 0.24 0.00 38.39
05:05:01 PM all 36.75 2.47 7.78 0.13 0.00 52.87
05:15:01 PM all 35.96 1.29 7.81 0.10 0.00 54.85
05:25:01 PM all 38.69 1.35 8.00 0.09 0.00 51.87
05:35:01 PM all 37.01 4.53 7.92 0.07 0.00 50.46
It’s the > marked lines.
Can’t see any high traffic at that time so I’m kind of clueless what it was but it really was only happening in that half hour.
Overall when looking at cpu time, Redis has taken the lead and generally uses a good part of the CPU. Not sure if there’s anything that can be done about it. It’s usually at around 20-25% peaking to 35% here and there.
1721566 uuidd 20 0 2035M 1458M 1952 S 16.1 4.6 16h22:37 /usr/bin/redis-server *:6379
1721578 www-data 20 0 108M 69100 12516 S 6.7 0.2 6h32:27 nginx: worker process
2853756 1000 20 0 1355M 444M 19356 R 63.7 1.4 2h38:10 unicorn worker[0] -E production -c config/unicorn.conf.rb
2854380 1000 20 0 1267M 409M 18768 S 41.6 1.3 2h19:45 unicorn worker[4] -E production -c config/unicorn.conf.rb
1721598 1000 20 0 592M 285M 5192 S 1.3 0.9 2h08:53 unicorn master -E production -c config/unicorn.conf.rb
575 root 20 0 1747M 20468 5040 S 0.7 0.1 2h01:02 /usr/bin/containerd
2854731 1000 20 0 1280M 399M 17880 S 36.9 1.3 1h57:52 unicorn worker[7] -E production -c config/unicorn.conf.rb
1721841 1000 20 0 592M 285M 5192 S 0.7 0.9 1h49:49 unicorn master -E production -c config/unicorn.conf.rb
2855284 1000 20 0 1287M 425M 18396 S 18.8 1.4 1h35:02 unicorn worker[3] -E production -c config/unicorn.conf.rb
2856414 1000 20 0 1223M 391M 19268 S 13.4 1.2 1h14:50 unicorn worker[2] -E production -c config/unicorn.conf.rb
2856478 1000 20 0 1207M 401M 21120 S 5.4 1.3 58:42.50 unicorn worker[5] -E production -c config/unicorn.conf.rb
2856503 1000 20 0 1215M 389M 18980 S 4.7 1.2 47:22.95 unicorn worker[1] -E production -c config/unicorn.conf.rb
1721581 www-data 20 0 69888 28636 13368 S 0.0 0.1 44:49.50 nginx: worker process
2857467 1000 20 0 1199M 385M 18112 S 4.0 1.2 39:23.87 unicorn worker[6] -E production -c config/unicorn.conf.rb
1721594 _apt 20 0 8479M 20036 18128 S 1.3 0.1 32:55.29 postgres: 13/main: walwriter
580 root 20 0 1747M 20468 5040 S 0.0 0.1 32:15.27 /usr/bin/containerd
Load overall averaging at 5 long term now but I still see it going up to 8 or 9 from time to time.
Regarding mini profiler. I was talking about overall load time. Server response time (initial request) is usually below 10ms on our end. Sometimes a little above but I haven’t seen >20ms in my last checks. Overall load time goes as fas up as 800ms sometimes, very rarely above 1000ms.
Only looking at one geo region as we don’t have significant numbers of users outside that region.