I just hit my CPU cap on the Digital Ocean 2GB/2xCPU plan


(ljpp) #1

Today it happened that I have been waiting for some time - things got busy on the forum, and I hit the CPU cap of the 2GB/2x CPU-core plan of Digital Ocean. Things started slowing down considerably, but the the site remained online.

Now the problem is that the next level in Digital Ocean’s plans offers 4GB RAM, more SSD storage but still only 2 CPU cores. I am aware that the extra RAM will make things a bit easier for the CPU due to larger read-write buffers, but will that be enough? Their quad-core plan costs 80$, which is more than I am willing to spend.

I would hate to move elsewhere as I have been happy with DO overall - it’s just that their next plan upgrade seems less than ideal for me. Any thoughts or advice on VPS providers with datacenters in Europe and high SLA?


(Matt Palmer) #2

Wow, hit the CPU limits… that’s pretty impressive. It’s usually RAM that gives out before CPU.

One thing you might want to consider is splitting your container into separate data/web containers, and then putting the separate containers onto separate $20/month droplets. That way, the CPUs of one droplet will be dedicated to doing database/redis stuff, and the CPUs of the other droplet will only be doing web work. Essentially, you get four cores for the price of two!

Yes, this is a bit more complicated than just whacking it all in one place. Welcome to systems administration. :grinning:


(ljpp) #3

I have actually thought about that, but not sure if I want added complexity. Double droplets to maintain and so on.


(Sam Saffron) #4

Was stuff paging like crazy when you hit the cpu limit? Which processes had high CPU?


#5

I’d be looking to a dedi if you’re running into high CPU loads.

If you can maintain everything yourself and want to risk a long wait time for support in the event of hardware failure

https://www.kimsufi.com/uk/


(Matt Palmer) #6

I like Hetzner, myself, for dedi hardware in Europe. I use their Server Bidding system to get stupid-cheap systems, and the support for hardware-level faults (like HDD failures) is entirely acceptable.


(Jeff Atwood) #7

I am with @mpalmer here I think it’s extremely unlikely you hit real CPU limits with Discourse – I think something else is happening.

Believe me we do a lot of Discourse hosting and CPU is never the limiting factor for any of the hundreds of sites we host… CPU is important for response times, of course, but it’s rarely the bottleneck itself.

You should look at this a lot more closely before concluding CPU is the bottleneck.


(Michael - DiscourseHosting.com) #8

Are you really sure that the CPU cap was hit because of the fact your forum got busy?
We have never seen something like that. Like Matt says, RAM runs out way before the CPU on such a setup.
Can you quantify this a bit? How many active visits are you talking about?


(Stephen) #9

Indeed, this is typical for a DO 2/2 box running Discourse:


(ljpp) #10

Awesome that this topic raised some interest, as I am not a back-end guru. Unfortunately I don’t run a proper server monitoring tool, so the data is limited. From my (non-guru) point of view hitting the CPU cap was logical, as according to Digital Ocean’s panel the typical load hovers around 20-30% and yesterday we got hit by two concurrent traffic spikes, and we definitely had 4-5x the usual activity by guests and members alike.

I wan’t to emphasize I am not putting any blame on Discourse. It was my estimate all along that I need to upgrade at some point. Just seeking council on hosting options, regarding the CPU cores/power.

First thing to understand that this is a sports fan forum and the user behavioral patterns are nothing like what you see on tech forums. People get emotional, things escalate. This is hockey, our national sport, so think of it as a Super Bowl in a miniature scale. Positive news raise interest on the national level, lots of guests come to read. Negative news cause strong reactions from members who come to the forum to share their frustration. Yesterday we had both. First there as a big player transaction in the morning (knocked off the team’s official site), which caused a massive spike (1st minor lag) in traffic but also increased guest traffic (2-3x) through the whole day. Then in the evening the team was being destroyed in an away game, which activated the frustrated fan base and caused a more serious period of lag.

Google analytics reports for yesterday:

  • 9000 sessions
  • 78.000 page views
  • At least 220+ active concurrect sessions. Could have been more at some point, when I wasn’t looking.

This is the 24h graph of Discourse panel. On the left you see a typical/quiet evening traffic (20% load), night hours (near zero) and then the two events I mentioned in the first chapter. Note that the 24h graph shows the load as an average for a time period and the real-time report showed spikes at 95-100%. The IO graphs shows some spiking for read, which to some extent correlates to CPU load.

This top screenshot I took around 8:24PM, which is slightly after the evening spike and lag, which was at it’s worst around 8PM and a little after. At that time I saw load averages >3.50, and here stings start to settle down. The status of RAM always looks pretty much like that, no matter what the load.

Any thoughts? If RAM is the issue, then I am more than happy to just upgrade one step at Digital Ocean and stay there, even though they are the only VPS provider that has dual-core on their 4GB RAM plan and they seem to have no intentions of increasing that.


(Jeff Atwood) #11

wait wait I think I know how this works

More seriously, are you running a CDN? if not, you need to set that up urgently! Serving assets directly from Discourse is a substantial drain on resources, when that could be outsourced to some geolocated caching service much closer to the user, and not hit you at all.


(ljpp) #12

Close enough, but we are not Canadians :slightly_smiling:

Yes, I run the CloudFlare’s freebie and it helps a lot.

One more thing about use behavior. When something interesting happes, a game or news, the biggest fans seem to hang online and use it in a chat app fashion. Which is great. The Discourse supports this kind of interaction very well, being dynamic and responsive – our old SMF was neither. So we have (potentially) hundreds of people online, constantly receiving notifications and commenting in the threads.

I am not surprised if the CPU is loaded, like many fellow Discourse users above seem to be. But as said, I am not a back-end server guru. But now, if you guys could enlighten me, based on the data I was able to provide.

  • Should I go with Digital Ocean’s 4GB dual-core plan? Could the increased RAM help the load of the CPU, as more stuff is in the buffers? According to the graphs the disk writes are rather constant, but the disk read graph starts to jitter under heave load (swap out?)
  • Or do you think a 4GB quad-core plan is worth the migration effort? Tons of VPS options out there, as almost everyone else offers quad-core at this size/price range.
  • What about OVH.com? Their CLOUD3 offer is impressive
    • Quad-core
    • 8GB RAM
    • 99,99% SLA
    • 33.49$ / month

(Rafael dos Santos Silva) #13

Maybe your server is just ram hungry and paging a lot, and since Linux adds disk wait time to the load average this is the increased load?

A good plan would be:

  1. Set up a CDN
  2. If still with high load, Add more ram
  3. If still with high load, search for a new vps provider.

(ljpp) #14
  1. Yes, we have CloudFlare free plan in place.
  2. It is
  3. Yeah, maybe I should upgrade at DigitalOcean first and see how it goes. The problem is, there is no way of telling when the next “perfect storm” takes place. So it will be too late if it turns out that extra RAM does not help.

(ljpp) #15

So, I went for the 4GB RAM upgrade using Digital Ocean’s flexible resize. Could an expert please advice whether I should modify these settings in app.yml, that are currently in their default values? As said above, the primary target is to maximize tolerance to large peaks in activity.

  ## Set db_shared_buffers to a max of 25% of the total memory.
  ##
  ## On 1GB installs set to 128MB (to leave room for other processes)
  ## on a 4GB instance you may raise to 1GB
  #db_shared_buffers: "256MB"
  #
  ## Set higher on large instances it defaults to 10MB, for a 3GB install 40MB $
  ## this improves sorting performance, but adds memory usage per-connection
  #db_work_mem: "40MB"

It seems that at least initially, the CPU load has settled to a slightly lower level with 4GB ram. The users are back already, based on Google Analytics.


(Rafael dos Santos Silva) #16

If my postgresql-fu isn’t broken, this add per join, so it’s very tricky to get right. Discourse does a lot of joins and ain’t afraid to say let it go to Active Record when needed, so I really recommend you to stick to the recommended values for your ram amount.


(Jeff Atwood) #17

This is correct however you may want to increase the number of Unicorn workers a bit with more memory.


(ljpp) #18

It does not look very promising. We have again reached 60% CPU loads today, even with the upgraded 4GB RAM. The disk IO graph has calmed down a bit though. I’m going to be in trouble when the playoffs begin…:scream:


(ljpp) #19

How does the number of unicorns impact the CPU load? Currently I have the default value 3.


(Matt Palmer) #20

With the data you’ve provided, my recommendation would be to get more data. Install some sort of periodic stats-gathering software (hell, install sysstat and edit the crontab to run every minute; that’ll collect good enough stats). Then you can see all sorts of things, like whether CPU usage really is what’s causing you problems, or if you’re actually swapping, or what else is going on.

Meh, for that money I’d get a dedicated server from Hetzner… 16GB RAM, i7-2600, 2x3TB HDDs for 31 euro (at the moment, via server-bidding). Not a VPS, so there’s no chance of contention causing you problems, and quite frankly when you max out that box, you’ll have enough traffic to justify having someone to manage it for you.