Portal Home > Knowledgebase > Industry Announcements > Web Hosting Main Forums > Providers and Network Outages and Updates > WebFaction ~48 hour down time plus data loss


WebFaction ~48 hour down time plus data loss




Posted by gerdemb, 07-13-2011, 09:24 AM
Just a warning to everyone out there about WebFaction. My server was down for almost 48 hours and now it's back up, but with some of the recent data missing (files and database). During the outage, I was never contacted directly by WebFaction, but they did keep their status blog updated. I'm not a system administrator so I hesitate to criticize too much, but they wasted the first 24 hours trying to FSCK a failing disk twice, and then they finally resorted to restoring from what appears to be an old backup. I don't know why they couldn't just restore from the RAID array which should have been much faster.

I sent the following message to support a few minutes ago and I am waiting for their response:

Can you confirm if any data was lost on server web194 after it was restored? I've already had the following problems:

1. My Postgres password was reset and I had to change it back to the password I used before
2. My database seems to be missing data that was added recently
3. My SVN repository is missing a branch that I created a few days ago
4. A new Django application I created through the panel is missing.

I am extremely concerned because it appears that some of my recent data has been lost. Can you confirm how old the backup was that was used to restore the server? Items #2 and #3 were added a couple days ago which implies that the backup that was used was also a couple days old.

Finally, is this the appropriate area to raise questions about billing? I would like a credit for the downtime which was almost 48 hours. Additionally, with the data loss I'm very seriously considering switching to another host.

Posted by BeZazz, 07-13-2011, 09:41 AM
Quote:
Originally Posted by freezebox
Thats unacceptable.
Without knowing all the facts I think it is a bit harsh to say that.

Nobody likes downtime but it is a fact of life. 48 hours does seem too long but like I said without all the facts it is hard to make a comment.

Posted by cd/home, 07-13-2011, 09:43 AM
Didnt you have your own off-site remote backups?

Posted by gerdemb, 07-13-2011, 09:48 AM
Quote:
Originally Posted by cd/home
Didnt you have your own off-site remote backups?
No, but obviously I should have. If you have to start rolling your own backups though then you have to wonder exactly what are you paying your host for....

Poking through my database I can see that the last entry was on 7/8 and since today is 7/13 that mean about 5 DAYS of data lost!

Posted by cd/home, 07-13-2011, 10:08 AM
Quote:
Originally Posted by gerdemb
No, but obviously I should have. If you have to start rolling your own backups though then you have to wonder exactly what are you paying your host for....

Poking through my database I can see that the last entry was on 7/8 and since today is 7/13 that mean about 5 DAYS of data lost!
Your paying your webhost to host the content, Not keep additional backups for you, Its YOUR data so its your responsibility to keep some off-site remote backups.

In the case of the dates it appears your host takes weekly backups and not daily.

I find taking, Daily, Weekly, Monthly to be best as it gives better data retention.

Posted by JLHC, 07-13-2011, 10:17 AM
It is always recommended to keep your own copy of backup even if your hosting provider promises to keep them as well. You can't go wrong with multiple copies of backups.

From their website they do state that they perform backups daily. You may want to check with them on why they are restoring the backup from 5 days ago and not anything recent.

Posted by gerdemb, 07-13-2011, 10:23 AM
Quote:
Originally Posted by cd/home
Your paying your webhost to host the content, Not keep additional backups for you, Its YOUR data so its your responsibility to keep some off-site remote backups.

In the case of the dates it appears your host takes weekly backups and not daily.

I find taking, Daily, Weekly, Monthly to be best as it gives better data retention.
It seems to me there are two kinds of backups. The first, are backups that are "live" or as close to live as possible. These backups are used in emergencies to restore a server when it goes down. The second kind of backup is a daily/weekly/monthly backup which is really more like an archive.

I think the "live" backups are the responsibility of the host. It's not feasible or reasonable to expect that customers keep a continuously updated copy of their site offsite. The archival backups can and should be done by the customer.

Just my opinion and of course the hosting provider may have a different idea, but I've never seen a shared host advertise that they only "host files" and if the server goes down you're responsible for restoring everything...

Posted by cd/home, 07-13-2011, 11:48 AM
Quote:
Originally Posted by gerdemb
Just my opinion and of course the hosting provider may have a different idea, but I've never seen a shared host advertise that they only "host files" and if the server goes down you're responsible for restoring everything...
I never said the client would be responsible for restoring everything, I said the client is responsible for their OWN data so keeping their own backups is critical, A webhosting provider is merely putting your site on the internet.

I just wish people stop expecting webhosts to do everything for them, I mean do you expect your webhost to keep your local machine virus free and cook you breakfast in the morning?

Posted by gerdemb, 07-13-2011, 12:10 PM
Quote:
Originally Posted by cd/home
I never said the client would be responsible for restoring everything, I said the client is responsible for their OWN data so keeping their own backups is critical, A webhosting provider is merely putting your site on the internet.

I just wish people stop expecting webhosts to do everything for them, I mean do you expect your webhost to keep your local machine virus free and cook you breakfast in the morning?
Naturally a client should keep backups of their data, but I don't think it's unreasonable to expect a provider to have a recent backup of the server that includes client data. Without a backup in the data center, the host could lose all the client data on a server and expect their clients to restore everything manually?! I doubt many clients would call that reasonable.

Anyway, in this case the backup the host restored appeared to 2-3 days old which probably isn't much better than I would've had if I'd been doing it manually. I'll repeat what I said before which is that I expect my hosting provider to handle server problems without losing my data. Obviously "stuff happens", but a 48 hour data downtime with 1-2 days of data lost is unacceptable in my opinion.

Still waiting for WebFaction to get back to me BTW. So far they have not admitted/confirmed that data was lost.

Posted by quantumphysics, 07-13-2011, 12:12 PM
Your data is your responsibility. Hardware fails, **** happens.

Also, minute by minute updates is already a hell of a lot more notification than most hosts do.

Posted by cd/home, 07-13-2011, 12:59 PM
Quote:
Originally Posted by gerdemb
Naturally a client should keep backups of their data, but I don't think it's unreasonable to expect a provider to have a recent backup of the server that includes client data. Without a backup in the data center, the host could lose all the client data on a server and expect their clients to restore everything manually?! I doubt many clients would call that reasonable.

Anyway, in this case the backup the host restored appeared to 2-3 days old which probably isn't much better than I would've had if I'd been doing it manually. I'll repeat what I said before which is that I expect my hosting provider to handle server problems without losing my data. Obviously "stuff happens", but a 48 hour data downtime with 1-2 days of data lost is unacceptable in my opinion.

Still waiting for WebFaction to get back to me BTW. So far they have not admitted/confirmed that data was lost.
Unfortunately restoring servers can be very painfull and take alot time, Sometimes anywhere upto 72 Hours...

If downtime because of data loss is important to you, You need to get your own dedicated server.

Posted by gerdemb, 07-13-2011, 05:05 PM
WebFaction support got back to me and confirmed that data was lost during the server outage. They are claiming that backup they restored from was from the same day the disk died, but I have some circumstantial evidence that seems to indicate that more data was lost. Anyway, they offered me 3 months free as compensation for the problem, and I think I'm going to stick with them for the time-being.

Throughout the outage, WebFaction gave regular status updates on their blog and they got back to me quickly after I opened a ticket. I'll give them kudos for good support and I'm hopeful this really was a once "in 8 years of business" kind of event.

--Ben

Posted by jamesgrey, 07-13-2011, 05:13 PM
Quote:
Originally Posted by gerdemb
I'm not a system administrator so I hesitate to criticize too much, but they wasted the first 24 hours trying to FSCK a failing disk twice, and then they finally resorted to restoring from what appears to be an old backup. I don't know why they couldn't just restore from the RAID array which should have been much faster.
I am a System Administrator so that skews my view a little here. I've read over all the status posts and it looks like the failing disk was replaced before they ever did the FSCK. So that was the right step. I'm assuming that once the new drive was inserted and the RAID started rebuilding the filesystem went read-only and required a FSCK to boot.

That usually means that there is a problem with the file system. I'm sure that any sysadmin worth his/her salt would try to skip the FSCK just to get the server back online to check it out. I think you are wrong about the FSCK'ing of the failing disk since they said the had the disk replaced before the FSCK posts were on there.

I can't say anything about the old backup since I don't know about it.

A RAID array doesn't work that way. It works like this (depending on the type, for this one I'm pretty sure this is how it works): there is a RAID controller and 2 hard drives, the RAID controller writes all data to both hard drives, when one hard drive is replaced the data from the one remaining hard drive is transferred to the new hard drive. ALL of the data.

So, since the drive was replaced only 1 drive had any data on it. That data was then copied to the new drive, an FSCK is used to fix file system errors since ALL of the data was copied to the new drive it would have copied any of the file system errors as well. If this is how their RAID is set up there isn't an option to "restore from the RAID" because the errors were in the filesystem.

I know other people already commented about having your own backups but no matter the host, if it is your data you own the availability for it. If one of my servers goes down, I'm responsible even though it may be the datacenter that took it offline. If the machine catches on fire, I'm responsible for having backups hot and ready to go. Looking at the pricing they have you could easily get another server for less than $20 a month total and "load balance" your sites.

Posted by gerdemb, 07-13-2011, 07:10 PM
Quote:
Originally Posted by jamesgrey
I am a System Administrator so that skews my view a little here. I've read over all the status posts and it looks like the failing disk was replaced before they ever did the FSCK. So that was the right step. I'm assuming that once the new drive was inserted and the RAID started rebuilding the filesystem went read-only and required a FSCK to boot.

That usually means that there is a problem with the file system. I'm sure that any sysadmin worth his/her salt would try to skip the FSCK just to get the server back online to check it out. I think you are wrong about the FSCK'ing of the failing disk since they said the had the disk replaced before the FSCK posts were on there.

I can't say anything about the old backup since I don't know about it.

A RAID array doesn't work that way. It works like this (depending on the type, for this one I'm pretty sure this is how it works): there is a RAID controller and 2 hard drives, the RAID controller writes all data to both hard drives, when one hard drive is replaced the data from the one remaining hard drive is transferred to the new hard drive. ALL of the data.

So, since the drive was replaced only 1 drive had any data on it. That data was then copied to the new drive, an FSCK is used to fix file system errors since ALL of the data was copied to the new drive it would have copied any of the file system errors as well. If this is how their RAID is set up there isn't an option to "restore from the RAID" because the errors were in the filesystem.

I know other people already commented about having your own backups but no matter the host, if it is your data you own the availability for it. If one of my servers goes down, I'm responsible even though it may be the datacenter that took it offline. If the machine catches on fire, I'm responsible for having backups hot and ready to go. Looking at the pricing they have you could easily get another server for less than $20 a month total and "load balance" your sites.
Thanks very much for your detailed interpretation! Like I said, I'm not a system administrator, and your explanation helped clarify for me what probably happened. It seems like recovering from a crashed disk can take a long time--I thought with a RAID system you could just mirror from the good drive, but if the file system is corrupted than I guess you're just screwed?

How do the "big websites" handle this? I mean you very rarely hear about them being down for long or losing data, but they must be just as susceptible to disk crashes as anyone else. Redundant servers with redundant copies of the data I guess?

Also, it seems like a backup could be restored at the same time the file system repair was going. Once it became obvious that the file system was dead the backup could have been rolled out quickly. That's probably what they're talking about when they say they are "revising procedure". Oh well, live and learn...

Posted by quantumphysics, 07-13-2011, 07:11 PM
Quote:
Originally Posted by gerdemb
Thanks very much for your detailed interpretation! Like I said, I'm not a system administrator, and your explanation helped clarify for me what probably happened. It seems like recovering from a crashed disk can take a long time--I thought with a RAID system you could just mirror from the good drive, but if the file system is corrupted than I guess you're just screwed?

How do the "big websites" handle this? I mean you very rarely hear about them being down for long or losing data, but they must be just as susceptible to disk crashes as anyone else. Redundant servers with redundant copies of the data I guess?

Also, it seems like a backup could be restored at the same time the file system repair was going. Once it became obvious that the file system was dead the backup could have been rolled out quickly. That's probably what they're talking about when they say they are "revising procedure". Oh well, live and learn...
Redundant backups
Redundant servers
Redundant datacenters
Redundant copies of copies of copies
Redundant redundancy
Massive load balancing and clustering

Posted by Webbyhost, 04-16-2013, 10:46 AM
When I read this post, it was a deja vue.

The write: Daily backup of all data to a remote location*

And the star means: * Please note that WebFaction cannot guarantee the existence or completeness of any backups. Customers are responsible for backing up their data.

So why they say they do a daily backup????

So the situation is, server down for about 36 hours, some sites are up again but missing pics.

Posted by 0x01-Security, 04-18-2013, 02:52 PM
I think if a provider advertises daily backups a feature or product perk to the customer than they should follow through with that, also a good idea for customer to setup data retention for their own safety in the event of multiple failures by the provider(hardware + backup(s)).



Was this answer helpful?

Add to Favourites Add to Favourites    Print this Article Print this Article

Also Read
slicehost (Views: 1031)
ukservers.com down (Views: 1041)
NetRouting Down?? (Views: 1007)


Language: