Portal Home > Knowledgebase > Articles Database > A bad day at Rackspace
A bad day at Rackspace
Posted by brunnock, 07-24-2006, 09:55 PM |
This morning one of our hard drives stopped responding. The error messages indicated a hardware problem to me, but the technician who responded insisted that it was just a corrupt journal and an fsck would do the trick. It did. For 4 hours. Then the filesystem couldn't be fixed so a new hard drive was installed. It took several hours to restore the data. Then I had to repeatedly explain to a technician that it didn't make sense that they backed up 5 GBs of data on Sunday but could only restore 3.3 GB today. They managed to restore most of the data, but we're still missing files.
So our main server was out of commission for 13 hours today. As far as I know Rackspace is still the best, but if I'm mistaken I'd sure like to find out.
|
Posted by ntfu2, 07-24-2006, 09:57 PM |
Softlayer.com will show you the light
|
Posted by submenu, 07-24-2006, 10:36 PM |
What kind of backup was it? A control panel backup, rsync, dd or something else?
Hard drive failures happen. Ultimately unless you have a raid setup, you just have to live with it.
|
Posted by brunnock, 07-24-2006, 10:43 PM |
Rackspace uses EMC Legato.
I don't think you understand- they backed up 5 GBs of data and then restored 3.3 GBs. Those two numbers shouldn't be different. I already had to explain that once today.
|
Posted by HL-Justin, 07-25-2006, 06:15 PM |
Stick with rackspace, I continue to think they are the best. Softlayer.com == good too, have a few friends with them. I wouldn't switch for a million bucks, but thats just me.
|
Posted by [inx]Olly, 07-25-2006, 06:20 PM |
Give their response times, you would seem to do a lot worse...
|
Posted by Laws, 07-25-2006, 06:24 PM |
To the people suggesting softlayer, remember they are unmanaged, they would have replaced the drive and stopped at that. They are not there to backup/restore files.
|
Posted by WindowsMaster, 07-25-2006, 06:33 PM |
Stuff happens at ALL providers. Keep in mind that probably 100 percent of the extra cost you pay at RackSpace is devoted to two things. Competant people and advertising that BIG NAME
|
Posted by reiteration, 07-25-2006, 06:52 PM |
Personally thats not good enough.
"I don't blame people for their mistakes, but I do expect them to pay for them"
Compensation HAS to be paid.
|
Posted by music, 07-25-2006, 07:51 PM |
It would be nice if everyone paid for their mistakes, but thats not they way alot of the world works.
RackSpace is a Great company, hope it all works out.
|
Posted by HostTitan, 07-25-2006, 08:23 PM |
Depending upon the config and the terms of your contract, upgrading to a server with raid may be in your interest.
You might also want to think about buying a cheap server somewhere else as a failover; outsource your DNS to a competent company (there are a number of good companies that do this for a $10-$15 a month with a strong track record) and they can handle the immediate failover.
|
Posted by sprintserve, 07-25-2006, 08:25 PM |
For 4-5GB of data, 13 hours is a bit long. I would expect less than 5 hours as reasonable from start to end. That is from experience, how long it would take to replace a drive, reinstall OS, reconfigure, and restore backups. If there's much more data (example 10-20times what you have), about 8 hours.
|
Posted by NameSniper, 07-27-2006, 02:37 PM |
Exactly for 4-5 GB 13 hours is too long
When my hdd has failed my hosting company has restored it within 3 hours (10 GB backup)
|
Posted by brunnock, 07-27-2006, 03:00 PM |
I should point out that it didn't take Rackspace 13 hours to replace a drive. I notified them of the second failure at 12:30pm and it was restored at 7:30pm. They had to restore it twice because they botched it the first time.
Last edited by brunnock; 07-27-2006 at 03:03 PM.
|
Posted by reiteration, 07-27-2006, 04:04 PM |
So they've failed the "fanatical" support and you'll be compensated ?
|
Posted by Dave W, 07-27-2006, 04:35 PM |
lol, reiteration you are a funny guy....
Rackspace is not giving anyone any money....
|
Posted by brunnock, 07-27-2006, 04:35 PM |
Yes. I've been with them for over 5 years and they've never quibbled over compensation.
|
Posted by Neosurge, 07-27-2006, 04:50 PM |
Softlayer provides a quality service, but they are not playing on the same ballfield as rackspace yet (especially in managed services). I disagree with your reccomendation.
|
Posted by Neosurge, 07-27-2006, 04:54 PM |
Do you have direct experience with them that proves otherwise?
|
Posted by dollar, 07-27-2006, 04:59 PM |
I would also like to know if you have any direct experinece with them. Not too long ago here on WHT there was a very interesting thread about the bandwidth prices at Rackspace and IIRC the OP that had started the thread was completely taken care of shortly after it started. Rackspace is not in the business of unhappy customers and I highly doubt they wouldn't compensate a 5 year customer in this incident.
|
Posted by reiteration, 07-27-2006, 07:04 PM |
Notice "OP that had started the thread was completely taken care of shortly after it started"
So you have to make it public for them to pay out ?
|
Posted by dollar, 07-27-2006, 07:05 PM |
Well I can't comment on the non-public issues as they are... non-public
|
Posted by reiteration, 07-27-2006, 07:08 PM |
Thats just rackspace trying to protect their name when a customer posts negative information about them.
A company that stands by its contract doesn't have to wait for it to be made public for them to compensate.
Hardly inspires confidence...
|
Posted by brunnock, 07-27-2006, 07:09 PM |
No, you don't have to publicly gripe to get compensation from Rackspace.
|
Posted by dollar, 07-27-2006, 07:10 PM |
Have you had any experience with rackspace?
|
Posted by reiteration, 07-27-2006, 07:14 PM |
Never had service from them, only experienced sales.
Simply commenting on the post about a public gripe to get rackspace to honour their contract.
|
Posted by dollar, 07-27-2006, 07:16 PM |
Actually Rackspace was honoring their contract, in fact the OP pointed out in that thread that he wast he one at fault for the situation he was in with bandwidth charges. The meat of the thread was based around the difference in cost when orering additional bandwidth compared to the cost of bandwidth in the origional contract.
|
Posted by reiteration, 07-27-2006, 07:20 PM |
ahh thats different then ;-)
|
Posted by Spyro, 07-27-2006, 09:02 PM |
Fanatical is not the same thing as flawless. With that said, I can't think of too many providers that would keep backups of your server for you anyway. I'll echo the post that said that softlayer is not in the same field as rackspace, though perhaps softlayer and some 3rd party server watchdog company could possibly be.
edit: ame -> same
|
Posted by MrZillNet, 07-27-2006, 11:28 PM |
Personally I think that you should always see yourself as being responsible for backups in any case.
For 5GB of data, you should be using rsync or whatnot and copying the files down to a server you control, you could even use your cable modem and have the server at your house or office.
|
Posted by Mike J, 08-04-2006, 08:51 PM |
If you don't have a Raid-1+ array, this will happen again, be it at Rackspace or at ValueWeb
|
Posted by brunnock, 08-04-2006, 10:07 PM |
Can you prove that? Everytime salespeople try to get me to upgrade to RAID, I ask if they can prove that servers with RAIDs have fewer problems then servers without. No one has been able to prove it. One of my servers stayed up for over 2 years and handled over 2 billion SQL queries. No RAID.
To me, more moving parts equals more points of failure.
Last edited by brunnock; 08-04-2006 at 10:10 PM.
|
Posted by SaniX11, 08-04-2006, 11:20 PM |
Well im doing business with RS right now and as for overage charges on bandwidth they have actually accomodated me in the contract (i cant specify) very nicley about overages. Also even though it may have seemed to be a long time to replace the drives, their contract has only a 24hour gaurantee. I do think that after that 24hour period you would receive some sort of compensation, but from the way they work with me now, they gave me a few free upgrades for my account so I am under the impression that if something happened in that nature, if you bark at them a little you may get the same type of treatment.
From my experiance, there is no one that can compete with the amount of effort rackspace puts into customer satisfaction and support.
Just my 2 cents.
|
Posted by brunnock, 08-05-2006, 05:31 AM |
24 hours?! Try 1 hour. See http://www.rackspace.com/solutions/ .
|
Posted by sprintserve, 08-05-2006, 06:42 AM |
No one said RAID will have fewer problems as far as harddisk failures are concerned. If a harddisk is going to fail, it will fail RAID or no RAID. The difference is that when it fails, if you have a redundant RAID mode, you have a much higher chance of surviving that failure without downtime (if hotswap) and without loss of data (redundant RAID modes like RAID 1, 5, 1+0 etc)
|
Posted by brunnock, 08-05-2006, 06:52 AM |
Can you prove that?
|
Posted by Bilco105, 08-05-2006, 07:01 AM |
Since your the owner of a managed UK providor yourself, it would be stupid of you to comment on Rackspace's business ethic without first hand expierience, which I extremely doubt you have.
Same goes for me, I work for a direct competitor of Rackspace - so although i'll subscribe to the thread, I won't be making comments like the above.
|
Posted by JohnCrowley, 08-05-2006, 08:23 AM |
Prove what? That having 2 or more drives that have the same data is more reliable than 1 point of failure? That's like saying prove to me it's better if you have 10 fingers instead of 9.
With a properly setup RAID1 or higher configuration, a single drive failure will cause *zero* downtime and *zero* loss of data. This requires RAID 1 or higher with a hot swap backplane. Drive dies, RAID array ignores it and alerts monitoring system, DC swaps in new drive while server is running, raid array rebuilds in background seamlessly.
With one drive, the above scenario means downtime, put new drive in, copy data from backup drive, loss of recent data, etc... RAID is not the end-all and be-all of servers, but when you need uptime and want to minimize a single point of failure, hot swappable RAID is the way to go.
Can I prove it? Not to your level of satisfaction. We only manage around 100 servers, so my depth of experience is limited, but RAID has saved the day many a times. Having RAID does not make your server more fragile or susceptible to problems. Sure, another drive could fail, but this does not hurt the server or your uptime, as the array can easily compensate for this.
- John C.
|
Posted by brunnock, 08-05-2006, 08:31 AM |
You manage 100 servers? That's fine. What is the average uptime for RAID vs non-RAID servers?
|
Posted by JohnCrowley, 08-05-2006, 08:36 AM |
Yes, we've been a host since 1995, so the servers add up over time.
We have never had a RAID enabled server completely die due to hard drive issues. We have had drives fail in a RAID array, which just required swapping the drives out and rebuilding the array which was done automatically in the background.
We have had many non raid servers have primary hard drive failures requiring new drives and copying data from backups to restore.
Don't think of RAID as another point of failure. Think of it as increasing your ability to deal with a hard drive failure. Hard drives are the most common and most destructive failures for a server, so being able to mitigate the impact of a drive failing is a good thing.
- John C.
|
Posted by brunnock, 08-05-2006, 09:04 AM |
Average uptime for RAID vs non-RAID servers please. Thank you.
|
Posted by JohnCrowley, 08-05-2006, 09:18 AM |
It's too nice a day outside right now for me to dig through stats and logs to get that data. I know for a fact that in the past month we have had 1 primary SCSI drive fail on a non RAID server resulting in 2 hours of downtime, and 1 RAID-1 SCSI server had 1 drive fail resulting in 10 minutes of downtime to swap the drive out (non hotswappable RAID array). I'll leave the bean counting to someone else today.
- John C.
|
Posted by brunnock, 08-05-2006, 09:31 AM |
Like I said. Salespeople will talk your ears off about how a more expensive RAID solution is better, but when you ask for proof, they don't deliver.
|
Posted by richy, 08-05-2006, 09:42 AM |
You really cannot be serious?
I don't post often here but you really make me chuckle. On one system vs another you might not notice a huge difference, but compare 50 raid systems vs 50 single drive systems and you will see superior reliability in the raid systems as the probability of a failure affecting a servers availability is lower.
Simple maths. Chance of failure for one drive = X.
Change of a single server going down due to drive failure therefore = X.
Change of a two drive system having a failure of a drive is therefore 2X, yes it is therefore more likely. For that failure to result in downtime, acts of dog withstanding, BOTH drives would have to fail. The chance of two drives failing due to mechnical faults at the precise same time? statistically X2?
I respect you may not wish to believe your sales people, they have a vested interest in selling you more bells and whistles and if we all believed sales people then we would probably all be driving around in Ferrari's and be having our 15 bedroom houses repo'd. However, independant people are suggesting that a RAID array with a hot swap capability would have mitigated your mechnical failure.
Personally I'm on a raid 10 array, my host backs me up nightly, and because I'm a pedant I back it up myself as well because I dont want to be the one whining on WHT about my host not backing stuff up properly. Sure if they ballsed it up you have a right to call them on it.
I'll bite to your uptime comment, my uptime is 61 days, therefore my uptime for this month is 100%. Whats your none raid servers uptime for this month? Theres your comparrison. It wont make a huge different 19 months out of 20, that one month where you get a drive failure it will.
So please respond with your % uptime for this month so we can have a fair comparison between raid and non raid
|
Posted by brunnock, 08-05-2006, 10:29 AM |
As I said previously, one of the servers we retired was up for 2+ years. The other had been running for over a year. Not a single crash or reboot for either server. Our servers are typically up for months at a time. I could point you to the graphs at Netcraft, but they seem to have disabled that feature.
|
Posted by dollar, 08-05-2006, 10:31 AM |
Have you ever had a hard drive failure?
|
Posted by Weppel, 08-05-2006, 10:34 AM |
A RAID solution is all about reliability and making sure the single point of failures are kept to a bare minimum. If you're not able to see that for yourself and keep on asking for proof and uptime graphs please, by all means stick to single-drive servers but don't come crying here when another drive dies or fails on you.
Last edited by Weppel; 08-05-2006 at 10:37 AM.
|
Posted by Spyro, 08-05-2006, 11:37 AM |
The primary purpose of raid is not to increase uptime but to decrease the downtime in the event of a hard drive crash. The fact of the matter is that in hosting, 2 hours of downtime may translate to losing hundreds of customers depending on the circumstances in which the downtime is occured. To me, it is not worth leaving that possibility open.
Now if you are not in the hosting business, where downtime of any length can be lethal, then I suppose I could understand where you are coming from.
|
Posted by brunnock, 08-05-2006, 02:17 PM |
I am not whining about a hard drive failure. I have dealt with many HD failures over the years. I complained about Rackspace's response to what should have been a routine situation.
A poster stated that a RAID would eliminate HD-related downtime. Obviously, that's not true. I have asked, repeatedly, for proof that RAID servers are more reliable than non-RAID servers. No one has been able to prove this to me. All I've been reading is marketing hype. If someone could offer some proof that RAID servers are more reliable, then great. So far, all I'm seeing is anecdotes and opinions.
|
Posted by dollar, 08-05-2006, 02:28 PM |
From as much as I can read the person you state said that RAID would eliminate HD-related downtime actually stated that:
Maybe it could have been worded "it will most lkely happen again" or something of the sort, but I'm not the one to argue over minor semantics. You keep inisting on people to give you statics, maybe if I spout out some numbers you'll be happy?
Here at home i've lost two HD's recently. One ended up in a complete system loss and downtime the other one resulted in 0 downtime save for a reboot (no hot sawp on home machines).
Is it an opinion or a fact that if your server was setup with a RAID-1 array you would not have expereinced the 13 hours of downtime that you did?
|
Posted by brunnock, 08-05-2006, 02:36 PM |
I asked for proof that RAID servers are more reliable than non-RAID servers and I don't want anecdotes. You just gave me another anecdote.
|
Posted by richy, 08-05-2006, 04:11 PM |
Yes a correctly configured redundant array would reduce the level of hd related downtime. Thats an irrefuteable fact.
I assume since you are at rackspace then your hosting is valuable and uptime is important. The concept you are failing to grasp is redundancy. We are dealing with probabilities and you may not see it because you havent dealt with serious numbers of servers where the issue is more pronounced. Hard drives WILL fail. Its as particular as death, taxes and bush rigging an election. Now you may not see it because you may not have enough servers for it to affect you considerably, but the thing is one day it will and that is where raid saves your ***.
Answer me this, when the second mechanical failure occured, if there were a duplicate drive which automatically took over would you have had downtime? I think youll find the answer is no. Seriously answer me that question.
I also note you failed to provide the uptime for your server this month to compare with mine. That itself proves the point. You may have a server up for a few years without a failure. I hate to think how many drives I have running right now that are over 2 yrs old. However all my data is backed up, in duplicate. All the servers have raid and two backups. You simply dont take chances, and for someone willing to pony up the greenbacks for rackspace it seems incredulous that you dont wish to sit down and work out that RAID really does work.
I installed about 120 pcs in a school, we built them ourselves, we bought excelstore hard drives (this is about 4 years ago) and of the 140 we bought I reckon 10 never worked out the box and another 20 didnt survive the burnin. They were poor drives. They also didnt have to work to well and they could be cheaply replaced. The reason being besides the o/s and drivers and apps nothing was installed on them. Now the servers that held all the work, they were scsi raid 10 all the way. I can say hand on heart 4 years later, despite several scsi drive failures, the only time the server has been down has been for reboots for the security updates and for its yearly power supply replacement. The cost of RAID in that system was fairly respectable. However, given the need for it to work no matter what (as we couldnt be onsite 24 7) it was neglible.
I find it amusing that you pay for rackspace quality, yet are quibbling over a few dollars for raid? As for doubting it would increase uptime, if you wish to go against the advice and logic provided by people who have made plenty of money in the industry for a long time then your welcome to, but whine elsewhere. As for rackshacks messing up, sure if servint ballsed up a backup I would ream them, but it wouldnt screw me over because I have my own backups elsewhere which are just as up to date and my downtime would be long enought to skip over to ev1 and instant order a server and pull the backups, on a bad day 30 minutes.
Have fun
|
Posted by richy, 08-05-2006, 04:14 PM |
http://www.cs.cmu.edu/~garth/RAIDpaper/Patterson88.pdf
amongst others.
It has been proven statistically, if the logic is too tough then we cant help you.
|
Posted by brunnock, 08-05-2006, 04:16 PM |
The CMU paper was written in 1988. That's nearly 20 years ago.
Would you like to take a guess what the MTBF for HDs is nowadays?
Last edited by brunnock; 08-05-2006 at 04:21 PM.
|
Posted by JohnCrowley, 08-05-2006, 04:24 PM |
ok, my stats "at a glance".
10 years hosting experience, numerous hard drive failures, and extended 2-6 hour downtime periods per failed drive on single drive servers.
Total downtime for all RAID-1 SCSI drive related failures: less than 45 minutes in total.
Out of approx. 100 servers, we see a hard drive failure every month or two.
But don't take my word for it, as I'm just a salesman at heart.
- John C.
|
Posted by Bilco105, 08-05-2006, 04:25 PM |
Ok, so lets put the brakes on this thread for a minute, since you're evidently trying to prove a point. In one paragraph, what is it you're trying to tell us, that redundant raid is what? - pointless?
|
Posted by brunnock, 08-05-2006, 04:28 PM |
Why do I have to keep repeating myself? A poster stated that a RAID would eliminate HD related downtime. I asked for proof to back up that statement. So far, I've seen responses along the lines of "Duh! That's obvious" to anecdotes. That's not proof.
|
Posted by brunnock, 08-05-2006, 04:36 PM |
I skimmed the CMU paper. The article states that RAIDs offer better I/O than SLEDs. There is no mention in the paper about reliability.
If you have to resort to lying, then you should drop the argument.
|
Posted by Bilco105, 08-05-2006, 04:36 PM |
No one stated anything even remotely similiar to that, you've simply taken the words of one person reccommending raid and come up with that statement all by yourself.
Its common logic that there will be less downtime involved taking your server offline and swapping one of the drives in a rebuilable raid array, than it would to take the server offline, swap in a new drive, restore all your data, then bring your server back online.
|
Posted by brunnock, 08-05-2006, 04:42 PM |
Wrong. The poster stated "If you don't have a Raid-1+ array, this will happen again...". This translates to "If you have a RAID-1+ array, this won't happen again". What are you reading?
Common logic? If you have more drives, then you'll have more drive problems.
|
Posted by richy, 08-05-2006, 04:52 PM |
sure that is logical, if you have two drives you have twice the chance of failure. Now sit down and work with me on this.
Say a drive fails in a normal single drive system your stuffed. Now if you have two drives you have twice the chance of a drive failing, that is correct. Now what you are missing is that when one drive fails it doesnt take the server down or lose your data. You simply hot swap them. An alarm goes off and the dc swaps them out and you get zero downtime.
Your only downtime is if both drives were to fail. The chance of that happening are far lower.
Take a coin, flip it, its 50 50 head or tails right? Now take two coins, its far more likely you will get one head from the two right? thats your point, your failure is twice as likely. What your missing is that for a failure to result in a downtime you would require two heads in a row. Chance of at least one head = 75%, chance of two heads, 25%. The maths isnt perfect but the point remains the same. You seem to be arguing statistics without comprehending them.
Yes that paper was old and mtbfs may have risen, in reality they have risen a huge amount, case in point it hdd warranties in some cases exclude drives from 24 7 use, not something that used to happen.
Do you simply not get that for hdd related downtime both drives must die at the same time and that the chance of this is extremely low compared to the chance of a single failure? Just be a man and answer that.
The proof is you wont give me your servers uptime this month.
|
Posted by Bilco105, 08-05-2006, 05:00 PM |
Yep, and if you have more than 1 car, you're more likely to have an accident. Come back when you lose your 1-dimensional outlook on life.
|
Posted by brunnock, 08-05-2006, 05:12 PM |
We did have 2 drives fail in a server at the same time last year. The primary and the backup. We had to restore from tape. That happens all the time.
I have already proven that you're a liar. The CMU paper says nothing about reliability.
You really should stop participating in this thread.
|
Posted by Bilco105, 08-05-2006, 05:16 PM |
Ditto, since you seem to be under the constant impression that someone said raid = reliability, they did not - you simply took it that way.
|
Posted by JohnCrowley, 08-05-2006, 05:21 PM |
Sean, the bottom line is a properly setup RAID-1 or higher system is more likely to have more uptime than a single drive system given everything else being equal. Numerous people have "alluded" to it, I have stated my observations (stats if you will) over 10 years of doing this for a living, but it seems you just want to argue for the sake of arguing.
We've had single server systems run fine for 5+ years continuously under heavy load, and then some that have drives fail like it's a monthly party. But with our RAID systems, even with hard drive failures, we've never had one have both drive fails at the same time, which for us and our customers, is "proof enough" that RAID provides better uptime over the long run. YMMV.
- John C.
|
Posted by richy, 08-05-2006, 05:26 PM |
a liar? jesus you try and help some people.
ok whats your uptime for your server for this month? come one answer me or are you afraid youll look more than a little stupid?
why on earth would i lie? do you think i have shares in a raid manufacturer?
I showed you the maths, the chance of two drives failing at once are lower then one drive failing. Why wont you own up to that?
I'll participate in the thread for as long as I'm having a laugh winding you up.
So you asked me for proof. My raid server has 100% uptime this month, whats your servers uptime this month?
got an issue with answering that?
and wheres your PROOF im a liar?
|
Posted by brunnock, 08-05-2006, 05:28 PM |
Your observations are anecdotal evidence, John. That's not an insult, it's just not proof that servers with RAIDs are more reliable.
I have pointed out that we had a server with 2 drives fail at the same time last year.
And I have been doing UNIX and Internet systems management for nearly 20 years at MIT, Apple, BBN, and a bunch of other companies.
Cheers.
|
Posted by JohnCrowley, 08-05-2006, 05:30 PM |
Sean, we can split hairs all day. I guess you'll want the head tech at Rackspace to post an excel spreadhseet showing failure rates broken down by type of system? If I felt it would actually help you, I would put together the actual stats over the past 10 years as a host, but wasting that time here is not something I'm interested in this weekend. Good luck at Rackspace.
- John C.
(Just an old anecdotal fool)
|
Posted by richy, 08-05-2006, 05:31 PM |
So whats your uptime this month?
You have yet to provide any evidence yourself to back up your 'claims'.
If raid was useless, why does it exist?
Nobody said 2 failures never happen at once, just that its less likely. Thats pure statistics, so care to provide any proof of why your right and statistics and everybody else here is wrong?
|
Posted by richy, 08-05-2006, 06:02 PM |
You say you worked for apple, they seem to side with the people here that raid is better for uptime.
http://www.apple.com/xserve/raid/
http://images.apple.com/server/pdfs/...ve_RAID_TO.pdf
Seriously I do understand your logic, you are perfectly correct that 2x the drives = x the chance of failure, but in a raid system, the chance of enough drives failing to impact uptime is lower then a single drive system. Its pure irrefutable maths, even your old employers think so.
|
Posted by dollar, 08-05-2006, 06:06 PM |
If you have a RAID-1+ Array, this won't happen again.
This being a single HD failure causing downtime on your host.
Unless you would like to show me the statistics that show two disks failing at the same time is as common as a single disk failing, or that a quality raid card is more likely to fail than a quality disk drive.
|
Posted by robinbalen, 08-06-2006, 05:11 AM |
I asked for proof that RAID servers are more reliable than non-RAID servers and I don't want anecdotes. You just gave me another anecdote.
I'm going to take the risk of being called a liar and give you the following anecdote:
We have many hundreds of servers online - some with RAID, some without. Over the past 2 years, we have had approx 10 hard disk drive failures, spread evenly between servers with RAID and servers without.
Total downtime resulting from HDD failure on servers with RAID: NONE
Total downtime resulting from HDD failure on servers without RAID: Between 2.5 and 8 hours per failure depending on the amount of data and type of backups used.
Every single customer who has initally opted for a non-RAID system and has experienced a HDD failure has since upgraded to using RAID. It's a no-brainer as the cost of a RAID card and a few additional HDDs is so much less than the cost of having a server offline for even the shortest amount of time for most of our clients.
I think I can give this information from a fairly neutral point of view... as actually we take a slight short-term financial hit on our RAID systems in order to encourage our customers to use them. We believe that over time it will save us time and hassle as a provider as well.
Last edited by robinbalen; 08-06-2006 at 05:15 AM.
|
Posted by brunnock, 08-06-2006, 06:55 AM |
Robin, it sounds like it's more expensive to support non-RAID servers than RAID servers. Is that correct?
|
Posted by mrzippy, 08-06-2006, 06:58 AM |
Why are you guys trying to convince brunnock that RAID is "better"? He/she obviously has his/her opinion set into stone and will not be changing his/her mind.
There are hundreds of "is RAID really better?" threads and discussions on newsgroups and forums from 20 years ago to present day.
I'm sure most of you will agree that experience is the best teacher, and in this case it is likely the inevitable future for brunnock if he stays in the business long enough. Reality hits hard when a drive fails, customers start calling, and you start wondering if maybe.. just maybe.. a RAID setup might have been a good idea.
So why continue to waste your time?
|
Posted by cywkevin, 08-06-2006, 06:58 AM |
Expense is relative. If downtime is expensive for you than yes non-RAID systems are more expensive in the long run. If you could care less about downtime then RAID is more expensive.
|
Posted by richy, 08-06-2006, 07:19 AM |
it depends which way you look at the expensive, on a provider level it is cheaper to support raid servers because you have less intangible loss. A clients uptime is key to the provider as well as the client, a client won't grow if they get downtime and raid is a relatively cheap way of reducing one area of downtime. There is an initial cost to supply raid, there is also a small increase in time taken to deploy a raid server, but in the long term even including the tech time to swap out a drive when an alarm goes off it is BETTER to have raid systems as a provider. Redundant power supplies are also nice.
I know I simply wouldnt go back to a single drive solution. Its like driving a car without airbags, sure day to day it doesnt save your ***, and sure it prices a bit more but when you need it, and odds are you will at some point, they will save your ***.
In my spare time I shoot photos, my home machine has a raid 5 setup and the backup has a raid 6 array. There is a speed increase and also a redundancy. Sure it prices more to do it that way but its money well spent. For the same reason last time I was in HI I made sure I rented a spare camera to take on a shoot even thought I had two digitals and a film body, Id have looked pretty foolish if they had failed and I was down to a single body and I would have been sued. The few hundred bucks to rent a spare body was nothing compared to the prices of even one body breaking on me and the time taken to switch lenses.
You are perfectly right in saying more drives = more failures, nobody is stating that, but a drive mechanically failing (not due to a power surge) is a rare event but when it does happen if it takes your server down, your down.
If your at rackspace and paying a premium for their support whats the issue with paying a small amount more for hardware to match the support?
Also I note you still haven't supplied your uptime for this month, I wonder why. Just to jog your memory my raid server is at 100%.
|
Posted by Bilco105, 08-06-2006, 08:07 AM |
I don't mean to sound rude, but could you please paragraph your responses. I'm sure you're making great points, but they're lost amoungst everything else when not setout properly.
|
Posted by richy, 08-06-2006, 08:39 AM |
Sorry
I'm still waiting on his uptime so we can have a fair comparisson between a raid and non raid systems uptime
|
Posted by Karl Austin, 08-06-2006, 11:04 AM |
Well here's what the numbers say, for chance of failure in any given 24 hour period:
2 Drives + Raid Card: 0.0009230913%
Single Drive: 0.0012000000%
That's based on 1m hour MTBF drives and 1.3m hour MTBF for the RAID card.
Now of course that doesn't mean that you won't get two drive failures at the same time, but it certainly statistically (and in our experience) reduces the chances of downtime caused by drive failure.
|
Posted by brunnock, 08-06-2006, 11:21 AM |
Karl,
Your numbers indicate that a RAID card is slightly more reliable than a hard drive.
I think I would be better off leasing a backup server.
|
Posted by richy, 08-06-2006, 11:40 AM |
or throw in a load balancer and a second front end server? if your budget allows that would add redundancy at a server level.
|
Posted by Karl Austin, 08-06-2006, 11:43 AM |
Well it's an extra 299,980 hours on the MTBF of a single drive - That's about the same MTBF as standard desktop drives. That was also just the first RAID card I plucked out of the air. Some have MTBFs over 2.5m hours.
At the end of the day, it depends how much downtime prices you, backups take time to restore, even over GE, 100GB of files will take at best take around 14-15 minutes, plus unless that is a bare-metal backup, you've also got to bring a server online etc.
Backups are great for recovering from complete disaster, but they're not the solution if you stand to loose money on downtime.
|
Posted by brunnock, 08-06-2006, 06:12 PM |
Karl,
1 million hours is over 100 years.
Do you really care if another vendor claims that their card will last 200 years?
|
Posted by ddosguru, 08-06-2006, 08:18 PM |
I happen to be an Operations major and unless something has changed, you can't actually declare a MTBF unless you actually tested it to be accurate.
|
Posted by Karl Austin, 08-07-2006, 04:30 AM |
Yes, because it is Mean Time Between Failures - Doesn't mean every one will last 1m or 2m hours etc. Basically, the higher the number, the higher the chances of it not failing within it's 3 to 5 year service Window it'll have with us - So yes, I do care, without reliability you are nothing.
Jeff - I believe you're correct, a particular number of samples have to be tested in an accepted way, before an MTBF can be declared, or at least that is my understanding of it. I know some manufacturers like Emacs/Zippy actually publish details of the samples used to find the MTBF.
|
Posted by KarlZimmer, 08-07-2006, 04:49 AM |
Correct, if you had a RAID1 array you would likely never face this issue of needing to shutdown the box, backup your data and restore it, thus what he said is correct. I don't believe he was saying if you go with RAID1 you will somehow magically never have a disk die on you.
In a RAID1 array, if one disk fails, you simply remove it, hot swap in another disk, and depending on the card used it will either rebuild automatically or simply require a reboot. That is worst case, 5-10 minutes of downtime. Here you're stating that you were down for many HOURS because of a downtime. So are you saying then that you would rather have a single multihour downtime every 2 years, or so, instead of a single 0-10 minute downtime every one year? Yes, it is more drives, meaning that your chances of a drive failing go up, but with the redundant disk it will take a MUCH shorter period of time to resolve that issue.
|
Posted by KarlZimmer, 08-07-2006, 05:02 AM |
Your point is? The debate was single drive vs. RAID, right?
Yes, you can get a backup server, but why not spend a lot less and just add redundancy to your current system? RAID will be a lot cheaper than getting a whole new system, and would provide roughly the same reliability.
Note: No one is saying to not keep off-site backups in addition to using RAID. RAID can save you in a failure, but even RAID can fail, that doesn't mean it is useless.
|
Posted by robinbalen, 08-07-2006, 05:24 AM |
In a RAID1 array, if one disk fails, you simply remove it, hot swap in another disk, and depending on the card used it will either rebuild automatically or simply require a reboot. That is worst case, 5-10 minutes of downtime.
Reboots really ought to be unneeded. We've done one once in our whole history because of an unrelated issue, every other time the rebuild has worked perfectly whilst the server remains fully online and functional. With 3ware (and others, I presume), you can even set the rebuild priority on a sliding scale, trading off rebuild speed with server I/O speed.
Note: No one is saying to not keep off-site backups in addition to using RAID. RAID can save you in a failure, but even RAID can fail, that doesn't mean it is useless.
Yes, this is an important point. RAID protects from disk failures, but not from disasters affecting the whole server / rack / facility / city etc. Also, it doesn't help if someone's accidentally deleted a file and needs it restored from 3 weeks ago
|
Posted by brunnock, 08-07-2006, 08:58 AM |
My server was down for a long time because the hosting company misdiagnosed a hardware issue and then botched a restoration. A RAID won't make hosting companies become better at their jobs.
And how easy is it to recover from a RAID crash?
20 years ago, hard drives were much more unreliable and there was a good case for RAID. Drives nowadays are much more reliable. They're actually approaching RAIDs in terms of MTBF.
I think that RAIDs have outlived their usefulness. Your arguments in favor of RAIDs point to redundant servers rather than redundant drives. It appears that Google learned that years ago.
|
Posted by Karl Austin, 08-07-2006, 09:16 AM |
Redundant servers pose issues all of their own, most applications just won't work over redundant servers.
|
Posted by richy, 08-07-2006, 09:18 AM |
Google do use RAID. For three reasons, redundancy, speed and capacity. The reason they use multiple servers is due to the economic viability of having a single server powerful enough to serve all their content vs having multiple server farms.
Also in this case it would have helped the dc staff, when the drive first exhibited issues the raid card would have emitted an alarm, the tech would have simply removed the drive caddy, replaced the drive and let the raid card rebuild the array. There would have been no downtime involved at all if it was done correctly and it would have removed the need for the tech to diagnose the issue in the first place.
If drives these days are so much more reliable then why did yours fail? More reliable possibly, but not significantly so. The technological boundries are pushed further these days and warranties are tighter because drives are more prone to failure.
Last edited by richy; 08-07-2006 at 09:23 AM.
|
Posted by Devin-VST, 08-07-2006, 09:22 AM |
Another thread on WHT that will never end, but I might as well chime in.
I worked in an organization with ~7TB or so of actively accessed data in a normal work day. If a harddrive goes down and people can't access their data, operations can't proceed as normal. Downtime is definitely NOT an option in any scenario.
Do you think that they would have chosen RAID over non-RAID for the storage setup, even if it cost a few hundred, even thousand more dollars for a controller, extra drives, etc?
How easy is it to recover from a drive crash?
/me pushes switch, takes out harddrive, puts in new harddrive
/me goes home, takes a nap
It doesn't matter how reliable harddrives are becoming, it matters that using RAID WILL give you harddrive redundancy in case of failure. Unless I missed something about the harddrives that never fail, RAID is the way to go if you really care about eliminating at least one point of failure and maintaning your business.
I don't think you'd want to tell your clients "Well, we have one really reliable harddrive with all your data on it!! But if it fails, you're screwed. Hope the backups are recent!"
If you did have RAID when this hardware failure occured, your webhost technician (no matter how stupid) would only have to swap out a drive and not restore it. Yet another point of failure eliminated.
Richy is right too, but server redundancy is a whole different category - we're just talking about harddrives here, right?
Devin
|
Posted by richy, 08-07-2006, 09:33 AM |
Surely a second server would be twice the number of moving parts and therefore twice as likely to fail by your logic brunnock
btw, you may wish to invest a small amount in an offsite backup at somewhere like bqbackup.com , if your backing up 5gb, do a weekly backup yourself. It wont cost you much in transit and the account itself would be very cheap to maintain ($5 usd per month) and one day it really will save you.
|
Posted by Russ Foster, 08-07-2006, 10:12 AM |
Let me give you a situation that is happening right now...One of our older Virtuozzo boxes has just load a disk in a RAID. Now if this didnt' have RAID at this exact moment there would be a number of customers, and their customer screaming why is there server down? Actually the scenario is instead that I'm copying the VPS to a spare node for redundancy and then the current node will have the hard disk replace.and the software upgraded. It was schedule for maitnance in a few weeks so I've just bought it forward.
So no RAID . Loss of business, SLA credits, reputation. Rough cost? £2-3k? who knows?
So with RAID - No downtime, no SLA credits. Reputation still seems intact. Cost? £350 for RAID card and hard disk + £75 ish for a new HDD.
Also there are backups just in case. As such I always spec RAID
|
Posted by lockbull, 08-07-2006, 10:56 AM |
brunnock, I highly suggest you read "Blueprints for High Availability" by Marcus & Stern; it is an excellent overview of the subject and lays out some fairly simple formulas you can use to calculate availability of devices, subsystems and entire systems. It also contains some nice real world examples to reinforce the more academic treament of the subject.
I would add that the market seems to disagree with you--more and more chipsets/motherboards are incorporating RAID, even on consumer models. I think you'll find a lot more articles on RAID on enthusiast websites now than say a few years back. And I'm not sure I understand your point of having a backup server--having a hot spare machine and doing realtime or near realtime replication is much more expensive and more difficult than setting up RAID. And of course the two are not mutually exclusive--anyone with important data to protect should be doing both.
My understanding is that Google is doing RAID, though perhaps not always at the drive level in the same server. They wrote their own file system that essentially includes RAID functionality; multiple single drive machines have the same data for redundancy purposes, and if one fails they just chuck the whole server and put in a new one.
|
Posted by brunnock, 08-07-2006, 11:11 AM |
So what RAID standard have the server motherboard manufacturers settled on? I'd be curious to find out which flavor of RAID Apple will include with the new iPods.
That's not RAID. That's RAIS.
|
Posted by sirius, 08-07-2006, 11:14 AM |
When iPods become a critical part of the business world, maybe they will think through it.
I think this post has run it's course. There are some really sharp minds here, some of the smartest in the business. They've stopped to offer you some common sense assistance and solutions and that has resulted in them being called liars and argued with, over, what is in fact, common sense.
Best of luck.
Sirius
|
Add to Favourites Print this Article
Also Read
TTL (Views: 668)