Performance, Scaling, and HA requirements

Hi all,

I am working on upgrading a busy Drupal site with a very active forum. This forum sees about 2 million unique pageviews per month, with peaks to about 3 to 4 million per hour for a few weeks a year. We have about 1500 active forum participants that typically browse logged-in, and we see a few hundred posts per day on average, which can reach thousands per hour during busy times.

Despite tons of resources, the site invariable burns to the ground during peak times, which means we restrict forum write access during busy times, which is one of the main culprits, but obviously defeats the purpose somewhat.

I am considering moving the forums to Discourse, although the UI might be a bit too advanced for the regular forum audience. As we are on a tight hosting budget, and since we have high peak loads, “how much equipment” is pretty much our first question. I have browsed and searched the forum for similar posts, but have not found too much that aligns with our requirements and expectations.

Besides “don’t break when we have traffic” we also have a “load under 1 sec” requirement for anonymous users. For logged in users sub-second load times are desirable, but we’d be happy with under 5. What are the caching considerations we need to look at? do we need to do something else with the docker image?

Finally, we are looking at HA - is there a best practice doc I have missed somewhere? are there known best practices for running Discourse in HA?

Many thanks

2 Likes

[quote=“mdekkers, post:1, topic:60098”]
although the UI might be a bit too advanced for the regular forum audience
[/quote] While I can’t help you with your server requirements, I will go out on a limb and say your current audience will adapt to Discourse will little raucous since the UI is very intuitive. From what forum software will you be migrating, do you have a lot of custom plugins installed, and on what hardware specs is it currently running?

4 Likes

Given the highly variable (but, I assume, predictable) load spikes, an elastic service like AWS will probably be your best bet; you can chuck more app servers into the pool during load spikes to keep things going. There’s nothing special about Discourse from a scaling and HA perspective, it’s just a regular ol’ Docker-enabled Rails application with PostgreSQL and Redis; any guide out there dealing with scaling those technologies will give you what you need to know to make Discourse do well. That’s not to say it’s easy; reliably scaling any technology in the face of large load spikes takes a fair chunk of work, but there’s nothing Discourse-specific in there.

6 Likes

Hi Matt, thanks for your reply. I’m fairly familiar with scaling, loadbalancing, and implementing HA. I’m after some idea of what it will take, in terms of resource, to provide a fast user experience. I’m not a big fan of running interconnected Docker containers in prod. My experience has shown flaky and poorly performing networking, as well as issues with storage, especially for DB. I’d much prefer to be able to deploy without docker, but it appears to be “unsupported”.

As much as I like Discourse, I’m not yet convinced that it makes a good fit for us.

All our infrastructure runs on Docker, and we do a … lot… of traffic with the customers we have now.

To be honest, Docker has historically been a pretty bad citizen in terms of having lots of hideous bugs that cause real production problems. Ask @mpalmer about that, he has the scars. But the good news is in 1.12.5+ and onward, Docker has gotten religion about fixing their stuff before moving forward with brave new world features.

(and with some prodding from us, I might add…)

6 Likes

The only part you have to run in Docker is the application itself; Redis, PostgreSQL, etc can all be run without Docker, and IMO that’s how it should be done at scale, if you already know how to configure and run those services. The all-in-one (“standalone”) and data/app split containers that we provide are made available for the convenience of people running small setups who don’t have the knowledge and experience to do all the setup themselves.

Running the app itself in Docker is a no-brainer, though, and I’d strongly recommend against trying to do it any other way. As @codinghorror said, I’ve got the scars from Docker’s many bugs and inane design decisions, I don’t trust Docker as far as I can comfortably spit a rat, and I’ve also got a nearly decade’s worth of experience deploying (and scaling) Rails apps. Even with all that behind me, if I was going out and running my own Discourse site, I’d still run the app servers in Docker. You couldn’t get me to put redis and pg into containers if you had a gun to my head, but I wouldn’t dream of running the app itself “by hand”. There’s just too much twisty passage craziness to unwind and reproduce, and because the official deployment artifact is the docker image, nobody’s going to warn me in advance if something comes down the pipe that requires a change to my artisinal hand-crafted appserver environment.

10 Likes

Thanks, @codinghorror and @mpalmer for taking the time to write me a thoughtful reply. I appreciate your time is limited, and I’m grateful for the time you took, and assistance you offered.

@mpalmer your suggestion to run only the app in Docker is inline with my own thinking. I have some limited experience with running RoR apps in production, and it wasn’t all nice :smile: I don’t have an issue running Redis (we already have a cluster for a few other things) and pg also won’t be too much of a problem.

I would also like to get some kind of idea of initial resource required. I really appreciate that this is dependent from site to site, and that I will have to find my own balance somewhere along the line, but I’d like to get some kind of idea of how to spec initially. I don’t really want to over or underprovision too much. Let me rephrase my question. Assuming a 4GB DigitalOcean host for the DB, and a similar gaggle of hosts running Redis, what kind of appservers would I need to run the app, allowing for about 300 posts in a 16 hour window? I’d have at least 3 behind HAProxy,

1 Like

Number of posts made is about the least interesting metric, even more so over such a long time period. Page views is far, far more relevant for determining needed resources. It can be a bit tricky to compare the requests and traffic patterns of a “traditional” mostly-server-generated forum to a Discourse forum, because Discourse is very API-driven, so we often serve multiple HTTP hits per “page view”, but we tend to service each HTTP request a lot quicker, so the forum appears to be a lot more responsive to the user, as shown by this breakdown of just dynamically-generated response times:

Most page-generation-oriented “traditional” forums would, for the same level of user behaviour, probably have a lower volume of requests, but they’d be pushed a lot further to the right; it’s rare for a traditional forum to be generating a majority of responses in under 100 milliseconds.

I’m not putting up that graph to brag about how good Discourse is (although it is rather impressive, IMBO), but rather to highlight that the way you think about provisioning capacity for a Discourse site can be a little different to how you’d figure out how many, say, php-fpm workers to keep in stock.

A typical, say, Magento site (which I dealt with a lot in my previous role) might take 1000ms or more to generate a page (I shit you not; Magento is a dog). You’d factor on having to have at least one php-fpm worker per pageview-per-second, to guarantee no contention. As soon as you have any sort of request rate in excess of your capacity, user experience goes straight to hell because every queued request is going to be adding a full second to the TTFB because it’s waiting behind another request that’s also taking a whole second to process.

Discourse, on the other hand, is making many smaller requests, so even if (and that’s a big “if”) it took a second’s worth of requests to render a page, with each of them taking somewhere around 100ms, the apparent responsiveness of the site is improved, because each request gets serviced quicker. This is the same principle at work as OS multitasking: keep the time slices as small as possible to improve interactive responsiveness, even if it costs a little more in context switch overhead.

Even then, though, most of the requests that a Discourse site processes via unicorn are purely “async”, tracking activity and so on. For example, here’s a relative breakdown of the routes that are most often hit:

(Y-axis scale deliberately filed off, because it isn’t the exact numbers that matter, it’s the relative weightings)

Leading the pack is topic/timings, which is a purely background (async) route that gets POSTed to to record “this user took this long to read these posts”, which counts towards both the “how long has someone spent reading” (for trust level calcs, amongst other things) and also the little “how long does it take someone to read this topic” data that comes up when you load long topics.

The next route by request volume, showing avatars, is dynamic because avatars come in a ridiculous number of sizes, so we often have to regenerate new ones. Worst case, a single “show me some posts” request could result in 20 requests to the various avatar display routes, but that’s pretty rare because usually most avatars have been seen before and have been cached.

It’s topics/show and topics/posts where we start to get to into what would normally be considered “page views”, and even then, performance is pretty solid, with the majority of responses being made in under 100ms, as shown by this graph of the aggregate of response times for topics/show and topics/posts:

(I split the 100-1000 group in half, just to show we weren’t cheating with a lot of nearly-one-second responses or anything)

One thing you’ll note isn’t on the list of frequently hit routes is posts/create. While draft/update gets hit a fair bit (pretty much any time someone updates a draft post they’re working on, they’ll hit that route in the background), actual post creation doesn’t happen very often, relative to reading. So, a metric of “we get N posts per day” doesn’t say much at all about actual site traffic. Attempts to extrapolate from number of posts made to total traffic volume are very sensitive to the read/write ratio used in the calculation, and since the read/write ratio varies greatly between different sites, you end up with some very wide ranges of estimated site traffic. You’re far better off just measuring it for your actual site and using those numbers for your scaling calculations.

The rule of thumb I would apply to figure out how big Discourse app servers needed to be, on a dedicated site, would be as follows:

  1. Determine how many page views per second I wanted to cater for, at absolute peak. My definition for “page view” would be something like “viewing a list of a subset of topics, or viewing a subset of posts in a topic”. How to determine that from an existing forum’s traffic data depends on exactly how the existing forum software works. Completely ignore all other requests, because they work very differently in Discourse, and will be accounted for in the rest of these calculations anyway.
  2. Divide your desired peak “page views per second” by two, to get the total number of unicorns you need to run to service that volume of traffic. Looking at some ratios of “total time spent in Unicorns to page view rate”, they seem to vary between about 0.29 and 0.35, on the ridonkulously fast CPUs we use, so on the slower CPUs you usually see in cloud providers, it’s a reasonable estimate that you can service about two concurrent page views per second worth of requests per unicorn.
  3. Now you know how many unicorns you need, divide that by two to get how many CPU cores you need, and multiply it by 300MB to get your unicorn RAM requirements.
  4. Get as many machines of whatever size you need to satisfy those needs. Tack on a maybe a half a GB of RAM and a half a CPU core per machine for “system overhead” and disk cache.

Et voila! App server capacity calculation done.

Running those calculations, for a site with very peaky load, you’ll probably come out with a number that makes you go a bit pale. It’s probably a lot more droplet than you were expecting to need. That’s because you’re calculating based on absolute peak requests, and you only get those maybe 1% of the time. This is where cloud elasticity comes in handy. You don’t need to be paying for all those droplets all the time, so turn 'em on and off as you need to.

The big cloud players, like AWS, give you shiny autoscaling logic for “free” (which mostly seems to involve making the rulesets easy to screw up so you rapidly cycle instances up and down, which makes your bill bigger), but if you’ve got a sensible monitoring system like Prometheus (big plug: we use it here at CDCK and it is delightful) you can setup your own autoscaling triggers to fire up a new droplet when CPU usage starts to go bananas, and kill off a droplet when things slow down, pretty easily. You need to wire up service discovery and a few other bits and pieces to make it all work, but it can save you a bucketload of money and it’s fun to build. Even if you don’t want to go that wild, if you know when your peaks are likely to be (say you’re running a forum for enthusiasts of a particular sport, and there’s “off-season” traffic levels, “in-season” traffic levels, and “finals” traffic levels) you can setup more droplets when the traffic levels are going to be predictably higher, and then turn 'em off after everyone goes away again. In each case, you work out your droplet requirements based on the peak page views in each group and the above calculations. It won’t save you as much money as doing it dynamically, and if you get a bigger surge of traffic than you expect at finals time you might get overloaded, have a poor user experience for a bit, and need to add some more emergency capacity (assuming your monitoring system let you know that things Went Bad), but it’ll still be a lot cheaper than running peak-capacity droplets 24x7x365.

Or, you just throw your hands in the air, figure you’ve got better things to do than fiddle around with all this stuff yourself, and <contractually-obligated-plug>just drop a small :moneybag: on our doorstep to have me and the rest of the CDCK ops team take care of all this for you. :grinning:</contractually-obligated-plug>

24 Likes

Imgur

9 Likes