The Discourse Servers

sam · April 21, 2013, 10:05pm

Jeff:

if they do not ramp up under, say, prime95 or some other CPU torture test, it is bad

it seems Debian is ultra-conservative?

it’s supposed to ramp clock speed under load

3.5 Ghz vs 2.7 Ghz is leaving 30% of perf on the table

Sam:

if I need to move to kvm I will

All our focus around xen testing was trying to get it to run turbo boost, in fact a few days later:

now you see why I hate Xen, not supporting turbo is lame… it is NOT that new with sandy bridge xeons

the numbers you gave, looking at the chat logs, were 10.39 seconds for KVM and 8.48 seconds for Xen

Its impossible to untangle all of this, but for us xen was a non-starter, not to be even tested methodlogically unless it supported turbo boost.

ultimape · April 22, 2013, 8:37pm

You mention IPMI. How does that compare to technology like Intel’s vPro (specifically AMT)? I’ve got experience with vPro and it was a fascinating technology, but have never heard of IPMI.

We’ve been researching things like that to provide a lights-out server where the nodes are shutdown when not in use, and were going to run with a raspberry pi to provide the VPN access. That TomatoUSB looks amazingly useful.

vPro includes a hardware KVM in a similar fashion, even remote viewing of the BIOS: YouTube

There is a pretty comprehensive review at tomshardware and howtogeek had an article a while back.

ultimape · April 22, 2013, 8:49pm

I just realized the screenshots from the supermicro IPMI screens are the same ones from the halrious videos at http://www.thewebsiteisdown.com/

Do not click ‘Recompute Base Encryption Hash’…

codinghorror · April 22, 2013, 10:12pm

IPMI works great in my experience, you can just click the link in the article to learn more:

rubydoob · May 29, 2013, 2:19pm

Jeff, other than the cost of the actual servers, what are the costs involved with setting up the actual rack. You have a router, then your servers, db servers, but what do you need to connect them all together like switches, power supply, etc…

codinghorror · May 29, 2013, 8:41pm

Gigabit Ethernet hubs, cat6 cables, and a 1u power strip are all relatively inexpensive. I do recommend having two switches racked with one as a hot spare because if your switch dies, you are in big trouble!

rubydoob · May 29, 2013, 8:58pm

Say a switch goes down, realistically how long would it take for you guys to drive down and fix it?

Also, with your experience running SO and SE, how often did you need to get physical access to the servers to fix a failed drive etc?

I’m just trying to compare things when going with something like ec2 or a managed dedicated box. Obviously you get all these benefits of simply buying a powerhouse server for $1.5-2.5K, not paying $50/month for 1GB of ram etc., but it also has some real issues when something does go wrong you have to drive down and diagnose the issue.

codinghorror · May 29, 2013, 9:48pm

The datacenter, he.net does offer remote hands for $100/hour. So if it is very urgent I would call them, they would disconnect and reconnect all the network cables to the hot spare secondary switch in the same order. Pretty easy, since both our live and hot spare are the exact same switch and stacked right on top of each other.

If I had to drive down, it is about an hour to get there. (Berkeley to San Jose)

The main things that fail are hard drives and power supplies. Failure for new, burned-in server hardware is not that common… I never saw any failures at all for the ~10 servers we built up in the 3 years after deployed server hardware for Stack Exchange.

However, in my experience, while you are getting the servers initially set up and configured, you will need physical access a LOT in the beginning. Not because things are failing, but because you always forget something in the configuration. After racking the servers, plan for a few weeks of visiting the datacenter once a week. Once that is over, you’ll barely ever go back.

(and IPMI aka remote KVM-over-IP works amazingly well, you can reboot and edit the BIOS over the internet… as long as the server has power, it can be managed using IPMI which is basically a dedicated little ARM computer with its own networking inside the server.)

Just my experience!

rubydoob · May 30, 2013, 7:44pm

Do you know of any good write-ups/ tutorials where people outline exactly how they setup their co-location rack? A detailed account on exactly what they bought and tips and tricks etc.

Mark_Howell · August 12, 2013, 9:08am

@codinghorror I’m curious about your decision to configure your HAProxy servers, Tie Fighter 10 and 11, in a single chassis sharing one power supply. I understand having two HAProxy instances would allow for high-availability, but what about a scenario in which the power supply fails in that chassis? That seems to imply both servers will go down, and in your own words, “nothing will be accessible.” In choosing the Iris 1125, is downtime caused by PSU failure something you decided was acceptable? Or am I missing something from your configuration that makes this a non-issue?

mattdm · August 12, 2013, 1:25pm

We saw 20% to 40% performance loss running Discourse benchmarks under Xen and KVM on multiple servers. We tried and tried, and could not do better than that

So, maybe this is obvious, but did you make sure that the guest CPU configuration (in KVM) is the same as the host CPU configuration? This isn’t the default because it reduces portability (that is, live migration between different CPU types) but leaving it general can indeed cut performance by the percentages you’re talking about.

sam · August 12, 2013, 9:33pm

Its a long time ago, but we tried quite a lot here, including raw images, cpu pinning and so on. It may be worth testing again.

jorge_castro · August 13, 2013, 12:38pm

Have you guys tried LXC containers yet? You get the nice seperation without the VM overhead, it’s a good middle ground with many benefits.

codinghorror · August 13, 2013, 6:38pm

We have a spare PSU on-site in the cage (we actually have a few spare PSUs and SSDs in the cage, as mentioned in the article). So the time it would take me to drive down there, and install it, is acceptable versus likelihood of PSU failure.

sam · August 13, 2013, 10:46pm

LXC containers and docker is something both I and @supermathie are very interested in, I don’t know of anyone who set up a good Discourse container.

Heisenberg · August 15, 2013, 9:52pm

How are those Samsung SSD disks doing? Have any burnt out yet? I used Jeff’s server blue prints for building a couple of db servers (SQL Server) but with the 840 pro disks. The performance was pretty damn good, I’m just wondering how long they’ll hold out

codinghorror · August 15, 2013, 11:15pm

This depends entirely on the I/O rate on the disks, which depends entirely what you’re doing on that server. For “typical” server use, barring any random unlucky failures, I think it’s safe to expect ~3 years before I’d even remotely be worried.

However, it is a very good idea to get SSD disks much larger than what you need, so the drive has lots of space to reassign used-up cells. I would never, ever run a server with a 128GB drive that is always near capacity, for example. (Drives do reserve some space internally that you can’t use, but the more you have, the more “enterprisey” the SSD is because it is more tolerant of the most common SSD failure mode: worn out cells.)

supermathie · August 18, 2013, 2:16am

Probably pretty good. Unfortunately they don’t support the SMART wear levelling indicator but I can get a fairly generic “sense” of how they’re doing as they are exposing a different attribute:

server  ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
live db 177 Wear_Leveling_Count     0x0013   092   092   000    Pre-fail  Always       -       286
live db 177 Wear_Leveling_Count     0x0013   085   085   000    Pre-fail  Always       -       535

back db 177 Wear_Leveling_Count     0x0013   093   093   000    Pre-fail  Always       -       247
back db 177 Wear_Leveling_Count     0x0013   084   084   000    Pre-fail  Always       -       550

webonly 177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       12
webonly 177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       14

webonly 177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       11
webonly 177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       15

So the live and replica database servers have more wear on them. No surprise there. I should graph this.

There’s another excellent reason for this and that’s performance. For another customer, I was evaluating 128GB and 256GB “value” drives (i.e. not overprovisioned like the Enterprise drives) as replacements for the 50GB SSDs that reached end of life.

The overprovisioned 50GB SSDs gave you VERY consistent performance on a workload. You knew you were getting the IOPS and latency you needed:

Whereas the “Value” drives let you use all that space, but you have to manually enforce overprovisioning if you must avoid high write completion latency and maintain high IOPS:

(yes, the graphs are slightly different things but the 50GB drive maintains that red line like it was aimed by NASA)

Heisenberg · August 18, 2013, 11:17am

That’s awesome! Thanks for taking the time to supply all this data… really interesting!

pauska · August 29, 2013, 9:53am

Hubs?

Why aren’t you using NIC teams on all the servers and utilize the fact that you have two switches? You could just set it to be active/passive if you insist on only one switch being active at the time.

A single switch failure would mean zero seconds of downtime, instead of driving down there or rent hands at the colo.

Topic		Replies	Views
How much is Discourse affected by a faster CPU? Hosting	43	22017	December 24, 2019
FREE to USA Discourse hoster -- four mini-servers Marketplace	41	5839	October 26, 2019
Recommended Hosting Providers for Self Hosters Hosting	109	29407	April 16, 2025
ScaleWay review? Hosting	53	20635	October 23, 2018
Will we ever be able to install Discourse on shared servers? Installation	55	8008	June 16, 2018

The Discourse Servers

Related topics