Meta is moving to the Cloud đŸŒ©

We might have to just live with massively delayed notifications for a while, as @sam is working on another much more urgent internal problem.

4 Likes

Well, this one is a doozie. But I can explain it and it makes perfect sense. Ever since we move meta to AWS, AWS started playing “silly buggers” with all our outbound mail. This is a known issue due to:

https://aws.amazon.com/premiumsupport/knowledge-center/ec2-port-25-throttle/

We already opened an issue with AWS about this but, they are super slow to respond. To circumvent it we can use a different port (port 587 for example) however that would require extensive reconfiguration of some public infrastructure we have. So instead, @supermathie is just going to move outgoing mail from meta to a third party provider. We were able to get this working.

TLDR

  • Amazon clogged our mail, by making mail jobs take forever.
  • Our mail is constantly retrying clogging our Sidekiq
  • All jobs are massively delayed including onebox, notifications and so on
  • Profit

Should be fixed soon.

Illustrated explanation:

image

17 Likes

Durrrrrrrrrrrrrrr oh man so obvious when you realize it.

Nice one!

3 Likes

That explains why the notification to a reply came three hours later :laughing: well done in sorting it Sam n co :clap:

Well actually not, it is peanuts, we are not even considering this style of hosting for our regular standard/business/enterprise customers. For them we will plan to continue hosting on bare metal.

We are investigating a new region with bare metal in Europe at the moment to service people who must be hosted in Europe. They will get the same excellent performance and reliability our current hosting in SF provides.

AWS customers are in the “super enterprise” level. They are the type of customer that MUST be hosted on AWS, cause of mantras such as “Nobody ever got fired for choosing IBM”. For them they must be on AWS cause they simply must be. They are used to the very high costs of AWS, it is a fact of life.

Our “super enterprise” customers get an isolated “rack” in the cloud. This includes extensive monitoring, extensive levels of failover (multiple AZs), large EC2 instances and large DB instances, Logs forwarded to elastic search and the list goes on and on. This means that for this “modest” meta config we have something like 12-15 ec2 instances and dedicated database and ElasticCache instances.

Yes, there are economies of scale if we host a giant multisite in the cloud at which point we get to share the monitoring aspect of our cloud infrastructure and cut costs, however this is not on our plans for 2018. A rack in Europe is though.

22 Likes

Thanks to @supermathie mail is A-OK again and notification will arrive nice and quick, the AWS :sneezing_face: is over.

11 Likes

As someone who has been following the “cloud vs. dedicated hardware” argument since Discourse’s early days, I’m seriously interested in a detailed discussion of the differences/costs/etc. What’s here is already instructive.

7 Likes

We’ve slowly been working costs down as we go, and we have hosting meta down to around $1000/month on AWS – that’s with multiple tweaks over the last 6-8 months. When we started this it was closer to $3000/month. Really!

Before: $2,717 / month

image

After: $1,030.06 / month

image

Note that meta is deployed as “super enterprisey” in our testing so it is somewhat 
 overprovisioned. :wink:

We could do better long term by doing long term reserved instances which could cut the cost in half, essentially pre-paying for multiple years of service.

Any other thoughts this far in @sam?

15 Likes

I am curious what changes/tweaks you’ve been doing to cut the cost in half? Is it more paying attention to AWS requirements, or is it tweaking code? Combination of many factors? Would the tweaks your making be useful information for others who may be or may be thinking of hosting on AWS?

The main piece of advice I have is 
 don’t do it. Don’t take on a complex, “enterprisey” cloud install unless you have to. It’s extremely expensive for what you get. Compare to a simple monolithic Digital Ocean droplet running our standard Docker image, which can get you a very long way even at the $40 and $80 per month price points.

9 Likes

Fair, I never plan to use AWS for it, as I said was more of a curiosity for me. :slight_smile:

Not really, I think you nailed it. Note we do have 1 year reserved instances for a few of our EC2 VMs so the cost is probably closer to the $1300 a month once amortized.

We can probably reduce cost a bit by moving Redis to a EC2 instance and rolling our own vs using ElasticCache which is a bit of a premium.

Overall, we have been very happy with our AWS experience, but it is certainly a bit pricey compared to our bare metal setup. We also squeeze a tiny bit more performance out of our bare meta setup than AWS, but we are not talking 2x difference, more like 5-30% difference on the server side.

Note it is important to have full perspective on costs here, cause even if you can do $80 on digital ocean, you miss out on:

  • Auto scaling, which helps us a fair bit on some super enterprise setups

  • Accounting for Prometheus based monitoring which we have. For context NewRelic would be sitting at say $100 per server and then you would also need DataDog which is another $50 or so

  • We also ship with ELK so you would need something like logit Hosted ELK Pricing - Elasticsearch Pricing | Logit.io which is yet more money

  • Our PG setup has automatic failover so you would need 2x instances on digital ocean to account for something like this plus a complicated setup.

  • Our Redis setup has automatic failover (so another 2x instances for that)

The bottom line is that $10-$80 is perfectly fine for an unmonitored monolithic setup. But once you need to start talking SLAs and need to know this thing will be rock solid and survive random failures
 well costs start mounting.

21 Likes

Does meta require all this power, or are you just using it to test this hosting option or something?

It’s mostly for testing. Meta is moderately busy, you can view its stats (or any discourse site stats) by going to /about

2 Likes

Note, our hosting infrastructure on AWS offer an economy of scale. Hosting the first site is fairly costly, but subsequent sites on the same virtual cloud get a substantial discount cause we reuse monitoring/access and logging infrastructure. Not “digital ocean” cheap, but adding one more site say to the meta cloud would be a few hundred dollars vs a thousand dollars.

9 Likes

<— Disclaimer: New Relic developer here.

What a great and insightful response!

Just out of curiosity, what aspects of Datadog make it a critical ingredient? I work on our Insights product. Would love to know if there’s a way I can help reduce costs. Also, I’m just generally curious.

5 Likes

Comparing insights to datadog is so out of scope of what I can do here. I have some experience around using the newrelic app monitoring and some around datadog server monitoring.

What I am digging at here is that in order to run a proper monitored service you need both application level monitoring and server level monitoring. Meaning
 you want to know when a server dies or goes to 100% CPU. You also want to know when Discourse has a ton of web requests queueing or if somehow database time for /latest became 4x slower.

Apologies if both comprehensive server and application monitoring can be covered by Newrelic and I put some misinformation out there. Looking through your site it looks like you have enough coverage here.

11 Likes

Sam, how was AWS for the Meta for than an year after you have moved it there? Do you guys need to pay for overage bandwidth?

Meta is hosted in the same setup we use for our Enterprise customers, so it certainly doesn’t fit the free tier. See this previous post:

Only change so far is moving away from Elasticache since the service had some rough edges. We run Redis ourselves.

9 Likes

Thank you Rafael. I don’t now your bare-metal setup, but it looks to me that any Dedicated Cloud colocated in any Tier 4 data center would work better in a long run, than running all the services with any major cloud (AWS or any other).