OVH downtime incident


(ljpp) #1

A word of warning regarding OVH

Today shit has hit the fan in OVH’s Gravelines data center. Services went down in the morning, local French time, and several major websites are down. Nobody has been able to reach OVH’s staff, there is nothing on their status pages (all green :thinking:) and support tickets are not being updated.

They have now been down about 7 hours and have not even been able give any status report about what is going on. I know major customers that are really pissed off right now.


Migrate from Digital Ocean to OVH?
Migrate from Digital Ocean to OVH?
(ljpp) #2

Their Ceph storage is borked.

http://travaux.ovh.net/?do=details&id=20636


(Jeff Atwood) #3

:frowning: we had super bad experiences with ceph and glusterfs early on and decided to strongly avoid them.


(ljpp) #4

Technical problems are one thing, but the way they communicate the issue is below all imaginable levels.

  • Still no ETA.
  • Updates on the issue tracker are rare and in very poor english.
  • They have not given any reason what has caused this
    • Surely the fact that they did have HD/SSD replacements in their maintenance schudule has nothing to do with this…
  • They are impossible to reach directly. I know a major newspaper site that is down due to this incident and they are screaming at their web technology provider/site manager hourly.

(Kane York) #5

Yep, that’s what “bargain basement service provider” implies…


(ljpp) #6

OVH and our site is recovering. Now what I am wondering is that how do I know my data integrity is still 100%? Based on their reports, that finally started happening, the issue was pretty damn severe.


(Martial) #7

OVH will offer free month for the downtime (Oles is the CEO)


(ljpp) #8

Well, if the customers settle for one month, OVH is getting out of this cheap. We lost the busiest day (a big game day) of the early fall season (August-October) and user base was not happy.


(Gabor Meszaros) #9

I do not really understand the lack of communication in these situations. Communication is cheap, and it has a huge impact on customer experience.


(Martial) #10

Technical problems can happen, this is life.

You want no downtime at all but you also choose the cheaper solution, i don’t follow your complaint.
Go to DO and pay more for (i think) no downtime.

The question is : Does the business day you lost, cost you more than the money you save by going to OVH Cloud ?


(Rafael dos Santos Silva) #11

I think I will put this in a frame. Well said.

Discourse offers a hosting service with uptime agreement. Also, you can self host on big names, AWS/Azure/GCloud, etc.


(ljpp) #12

Yes, and I have emphasized that in my earlier posts. What I am critical about is how they handled the corrective action and especially communication. In a company of such a significant size I find it interesting that the CEO steps down to run the war room and do the communication. Sure, it’s great that executives get their hand dirty, but what the heck are the hundreds of people below him doing if the top man has to come and lead the show in order to get something happening.

And regarding the losses - we did not lose money. The damage was social/mental only as our site is not a business. Unfortunately we lack a solid revenue stream, which is why we are not able to go premium in our hosting. I am now attempting change that by trying to get a local VPS provider as a site sponsor, so that we could afford better and more resources.

Question back to you: When you buy a new car, whether it was a Dacia or a Mercedes, you expect it to take you from A to B. And if it does not, still under warranty, do you expect the mechanics to take care of it promptly, even if you bought a cheap car? If the Dacia mechs fix it slowly and keep you uninformed, do you think it is Ok since you did not buy a Mercedes in the first place?


#13

But the central issue is the quality of “warranty” (service) for the price you pay. This has nothing to do with your expectations or what you consider reasonable. You pay for a cheap service and what happened to you is the level of service provided for your cheaper “car”. That includes the poor quality of communication such as its lack of timeliness.

Another example would be if you expected to get the car towed to the service center when the car dealer requires you to get the car to the service center by your own means. It’s no use saying that they should tow the car in if they never provided that level of service for a cheap car.

You literally got what you paid for. The benefit from observing your predicament is that we have a clearer picture of the risks involved in accepting the service levels provided by cheap hosting at OVH. :relieved:


(Jeff Atwood) #14

Not a huge fan of this split @erlend_sh – why would this topic even exist here, what relevance does it have to Discourse at that point?


(Erlend Sogge Heggen) #15

I still consider it valid as a discussion about a Discourse hosting experience with a particular host, but I wanted to keep it separate from what had until recently been a strictly technical discussion about how to get Discourse up and running on OVH.


(Jeff Atwood) #16

What’s Discourse specific about downtime at any host? Why wouldn’t we have Discourse topics for downtime on any hosting service, anywhere?

It seems only relevant in the case of “this SUPER LOW COST hosting service seems to work OK…” and that causality is now destroyed. Low cost = risk. You’ve moved the risk out.


(ljpp) #17

Hosting experiences in general are a valuable topic of conversation and its great to hear 1st hand experiences from fellow discoursers. I get questions regarding OVH and ScaleWay on a weekly basis.

BTW, the “low cost”-argument regarding this specific incident is a big pile of :poop: and I don’t understand why people keep pushing it. The incident took place in their Public Cloud range of products, that comes with 99,99% SLA and High Availablity storage. They will not reach 99.99% in 2016.


(Matt Palmer) #18

People keep pushing the “low cost” argument because “you get what you pay for” is the truth. There are certain invariant costs that need to be paid in order to have a quality service – competent people being the big one – and at a couple of euro/month, you can’t do that. The reason the CEO had to step in and provide communication is probably because nobody at a lower level in the org has the level of incident response training or experience, or perhaps the ever-necessary givafuk factor, required to understand why customer communication is critical in a major incident.

And now you have valuable experience of why an SLA that isn’t backed by strong penalties against the provider (ie “every SLA ever”) isn’t worth :poop:. If you choose a provider based on an SLA, you’re going to get burnt to a crisp sooner or later.


#19

You’re missing the point. The level of service provided is the level of service that you received. The SLA does not determine the level of service provided.

The fact that a service provider failed to meet their published SLA is not unusual or even remarkable. It happens all the time. That is why it is not sufficient to choose a service provider based purely on ideal conditions and unsubstantiated claims. Actual performance must be considered.

A contract does not guarantee the stated level of service. It simply provides you with legal options for redress.


(ljpp) #20

Yes, yes guys - I know. This is not my 1st server that has fried. Been around since -96, making some web that is.

Well, it looks like I can’t make my point written on a foreign language. Or maybe its due to my background in a huge corporation, where the giveafck-factor was relatively high, whether we made 15$ or 500$ products. Cultural thing, maybe.

Anyway, bad incident, shit happens. Lots of good in OVH too. Network is solid and performance great. Google will tell you stories about their support.

Now it looks like I may soon be able to tell you stories about yet another host: UpCloud. But thats belongs to another thread.