Portal Home > Knowledgebase > Industry Announcements > Web Hosting Main Forums > Providers and Network Outages and Updates > Liquid Web Down?
Liquid Web Down?
Posted by griffe, 09-29-2010, 01:35 AM |
Seems that all our servers are down.
The ones on Storm on Demand are intermittent...
When calling their support# I get "Automated Attendant-no one can be reached.. Transferring to backup attendant, backup attendant cannot be reached."
Their main site works.. and their PIMS Sonar says my server is OK!
What???? |
Posted by Dustin Cisneros, 09-29-2010, 01:38 AM |
My box is up and running. |
Posted by Ackk, 09-29-2010, 01:38 AM |
All of our servers are inaccessible as well.
I get the same error message when trying to call the tech support. No response from their email support system or live chat either.
Everything has been down for ~15 minutes now, and still no idea what is going on. |
Posted by Ackk, 09-29-2010, 01:39 AM |
When I try running a trace to the servers, it seems to die here:
12 ae-1-100.ebr2.Denver1.Level3.net (4.69.132.38) 72.201 ms 73.761 ms 72.606 ms
13 ae-3-3.ebr1.Chicago2.Level3.net (4.69.132.62) 81.713 ms 97.504 ms 82.216 ms
14 ae-1-51.edge2.Chicago2.Level3.net (4.69.138.131) 80.763 ms 81.555 ms 81.824 ms
15 GLOBAL-INTE.edge2.Chicago2.Level3.net (4.59.29.78) 90.572 ms 86.687 ms 97.689 ms
*
*
*
* |
Posted by Jedito, 09-29-2010, 01:39 AM |
half of our servers are down too. |
Posted by griffe, 09-29-2010, 01:40 AM |
This is crazy!
What Data centers are both of you hosted in?
We are in Datacenter 2: https://www.liquidweb.com/datacenter/datacenter2.html
Seems that the whole DC is down and their phone system is hosted there too... Their web site must be in Datacenter 1 or 3. |
Posted by litobelo, 09-29-2010, 01:42 AM |
All our servers down too |
Posted by Dustin Cisneros, 09-29-2010, 01:43 AM |
WHT is also up :-p |
Posted by griffe, 09-29-2010, 01:45 AM |
Quote:
Originally Posted by semoweb
WHT is also up :-p
|
Yes - its probably only DataCenter 2 thats down.. They have three... WHT is probably on DC 1 or 3. |
Posted by LiquidWebTravis, 09-29-2010, 01:46 AM |
There is an event that is affecting a small subset of customers in one of our datacenters. I will be updating people as more information becomes available. We will also be contacting affected customers. Thank you |
Posted by griffe, 09-29-2010, 01:48 AM |
Quote:
Originally Posted by LiquidWebTravis
There is an event that is affecting a small subset of customers in one of our datacenters. I will be updating people as more information becomes available. We will also be contacting affected customers. Thank you
|
You call your Phone & Ticketing Systems being down a "small subset of customers"?
Be honest with us here.. |
Posted by Ackk, 09-29-2010, 01:48 AM |
Hi Travis -- thanks for the response. Is it just a connectivity issue? Are our servers/data safe? |
Posted by roccol, 09-29-2010, 01:54 AM |
I'm down and am in DC1.
"We will also be contacting affected customers."
What does that mean? Doesn't sound good. |
Posted by WebSavvyGuy, 09-29-2010, 01:57 AM |
Half our servers are down also.
Hope it gets up soon |
Posted by griffe, 09-29-2010, 02:01 AM |
I wish this would end. |
Posted by thinkaholic, 09-29-2010, 02:03 AM |
Sounds like a connectivity issue to the DC. I have a server that's down and one that's up, so they must be in different DCs. It sucks that it's down and I was working on a website when it did, but I'm confident my server didn't crash since others seem to be affected. At least WHT is up so we can get some answers! |
Posted by LiquidWebTravis, 09-29-2010, 02:03 AM |
I assure everyone that we will provide complete details. We are in the process of identifying the issue and getting systems back online. Things are coming back up now and I guarantee you I will update this thread as soon as we have more detailed information. |
Posted by griffe, 09-29-2010, 02:04 AM |
One of our server just came back online & their phones are back online.
Ok thanks Travis. |
Posted by thinkaholic, 09-29-2010, 02:05 AM |
Yep, my site is loading now. Yay! Thanks for the fast response Trav! |
Posted by roccol, 09-29-2010, 02:07 AM |
Well my servers were rebooted. It must be an ip problem, as I have some sites working on the same server and others that are not. |
Posted by thinkaholic, 09-29-2010, 02:07 AM |
Seems to be intermittent. Down again. |
Posted by knkk, 09-29-2010, 02:11 AM |
I host with them, too. My server has been down for about a hour now, I get no responses on the support ticket I have raised, and I am unable to reach phone support (get the same message that "griffe" right on top mentioned). |
Posted by LiquidWebTravis, 09-29-2010, 02:13 AM |
Quote:
Originally Posted by knkk
I host with them, too. My server has been down for about a hour now, I get no responses on the support ticket I have raised, and I am unable to reach phone support (get the same message that "griffe" right on top mentioned).
|
I am very sorry for this issue and I assure you we are doing everything we can to fix it. The support status page has been updated: http://www.liquidweb.com/support/
I will follow up with more details as soon as possible. |
Posted by knkk, 09-29-2010, 02:14 AM |
Yes, appears to be back. After downtime of about a hour... |
Posted by LiquidWebTravis, 09-29-2010, 02:15 AM |
We have not finished completely investigating this issue, but here is an update:
The support status page has been updated http://www.liquidweb.com/support/
"DC2 Power Event
A subset of customers were affected by an apparent power surge which took a UPS system offline and tripped the breaker protecting that system. Power has been restored to those affected. We are working to completely restore all services. We will provide regular updates via this support page. As always please feel free to contact us via ticket or by calling our toll-free support number." |
Posted by CretaForce, 09-29-2010, 02:16 AM |
One server is rebooted. Looks like a power outage. |
Posted by litobelo, 09-29-2010, 02:16 AM |
Quote:
Originally Posted by knkk
I host with them, too. My server has been down for about a hour now, I get no responses on the support ticket I have raised, and I am unable to reach phone support (get the same message that "griffe" right on top mentioned).
|
Now we all know what "Heroic Support" means actually |
Posted by roccol, 09-29-2010, 02:17 AM |
Quote:
Originally Posted by LiquidWebTravis
We have not finished completely investigating this issue, but here is an update:
The support status page has been updated http://www.liquidweb.com/support/
"DC2 Power Event
A subset of customers were affected by an apparent power surge which took a UPS system offline and tripped the breaker protecting that system. Power has been restored to those affected. We are working to completely restore all services. We will provide regular updates via this support page. As always please feel free to contact us via ticket or by calling our toll-free support number."
|
There has to be more than this. My servers are up, yet some sites resolve and others do not. |
Posted by Dustin Cisneros, 09-29-2010, 02:18 AM |
Yup,
DC2 Power Event
A subset of customers were affected by an apparent power surge which took a UPS system offline and tripped the breaker protecting that system. Power has been restored to those affected. We are working to completely restore all services. We will provide regular updates via this support page. As always please feel free to contact us via ticket or by calling our toll-free support number.
www.liquidweb.com/netstatus/ |
Posted by knkk, 09-29-2010, 02:20 AM |
Thanks, Travis. It happens to the best of us |
Posted by griffe, 09-29-2010, 02:21 AM |
All our servers are back online now. |
Posted by Dustin Cisneros, 09-29-2010, 02:21 AM |
Well least the breaker did its job to prevent damage . Also good thing the breaker didnt go bad that could of took time to replace. |
Posted by roccol, 09-29-2010, 02:21 AM |
Yes but are all your sites online? I have most sites routing, but plenty are not. |
Posted by legrenzi, 09-29-2010, 02:31 AM |
WHT is the only thing keeping me sane at the moment! My server has been down for just over an hour now. I'm not getting any sites back yet like some are reporting, I also haven't had any response from my ticket I issued. |
Posted by LiquidWebTravis, 09-29-2010, 03:01 AM |
Quote:
Originally Posted by legrenzi
WHT is the only thing keeping me sane at the moment! My server has been down for just over an hour now. I'm not getting any sites back yet like some are reporting, I also haven't had any response from my ticket I issued.
|
I am sorry to hear that. Most services are coming back online but I would be happy to look at your ticket and try to help. Could you please provide me your ticket number? Thank you and I apologize for this hassle today. |
Posted by legrenzi, 09-29-2010, 03:08 AM |
Thanks Travis, my ticket number is 2325308, I have had a response just after I posted my message here. Although it contained the information already given on the support page. |
Posted by LiquidWebTravis, 09-29-2010, 03:10 AM |
A quick note: if your server is still having trouble please make sure to submit a ticket or call us at 1-800-580-4985. We have technicians standing by to restore any systems that didn't restart completely after the power cycle. Thank you |
Posted by Ackk, 09-29-2010, 03:56 AM |
Hi Travis -- any chance you can look into ticket #2325748. One of our critical servers is still having networking issues, and it's been ~40 minutes since the ticket was opened without any updates. Thanks. |
Posted by LiquidWebTravis, 09-29-2010, 03:58 AM |
Quote:
Originally Posted by Ackk
Hi Travis -- any chance you can look into ticket #2325748. One of our critical servers is still having networking issues, and it's been ~40 minutes since the ticket was opened without any updates. Thanks.
|
I will have a supervisor take a look as soon as possible. |
Posted by ellinas, 09-29-2010, 04:17 AM |
My server just came up after openning a ticket through PIMS (ticket was answered in 3 minutes!!). |
Posted by Ackk, 09-29-2010, 04:24 AM |
Quote:
Originally Posted by LiquidWebTravis
I will have a supervisor take a look as soon as possible.
|
Thanks, issue has been fixed now. Now the fun begins of bringing up all the software that was unexpectedly shut down. |
Posted by roccol, 09-29-2010, 05:46 AM |
I have one server down again for 40 minutes. Nobody has responded to my ticket. |
Posted by roccol, 09-29-2010, 06:03 AM |
I have to say I am terribly disappointed with LW. One hour and not even a response to my ticket. On hold over phone for 20 minutes. |
Posted by DianeV, 09-29-2010, 06:25 AM |
One of our clients' VPSes was brought back up, but is now down. Just FYI, I posted this in PIMS -- a tracert shows that the request goes into LiquidWeb but never makes it to the client server. Here's the tracert (my IP address edited out)
3 xx.xx.xxx.xxx 7ms 7ms 8ms TTL: 0 (cpe-xxx-xxx-xxx-xxx.carolina.res.rr.com ok)
4 24.93.64.171 11ms 13ms 11ms TTL: 0 (ge-2-2-0.rlghncrdc-pop1.southeast.rr.com fraudulent rDNS)
5 66.109.6.80 18ms 19ms 18ms TTL: 0 (ae-3-0.cr0.dca10.tbone.rr.com fraudulent rDNS)
6 66.109.6.169 20ms 23ms 19ms TTL: 0 (ae-2-0.pr0.dca10.tbone.rr.com fraudulent rDNS)
7 64.209.111.137 19ms 19ms 19ms TTL: 0 (No rDNS)
8 208.49.135.162 48ms 47ms 49ms TTL: 0 (GIGLINX.TenGigabitEthernet3-1.ar5.CHI2.gblx.net fraudulent rDNS)
9 209.59.157.224 55ms 58ms 55ms TTL: 0 (lw-dc2-core3-te9-1.rtr.liquidweb.com fraudulent rDNS)
10 209.59.157.216 57ms 56ms 58ms TTL: 0 (lw-dc2-sec1-dist1-po1.rtr.liquidweb.com fraudulent rDNS)
11 No Response * * *
12 No Response * * *
13 No Response * * *
14 No Response * * *
15 No Response * * *
16 No Response * * *
17 No Response * * *
18 No Response * * *
19 No Response * * *
20 No Response * * *
21 No Response * * *
22 No Response * * *
23 No Response * * *
24 No Response * * *
25 No Response * * *
26 No Response * * *
27 No Response * * *
28 No Response * * *
29 No Response * * *
Hope this helps. |
Posted by Vaya, 09-29-2010, 07:06 AM |
1:45 of downtime so far for me. I initially submitted a reboot request which was responded to within 20 minutes (no mention whatsoever of network issues) but after pointing out that everything is still down I've heard nothing at all, over an hour later. Not enjoying this. |
Posted by Jedito, 09-29-2010, 07:19 AM |
I have a server still down, 6 hours already. |
Posted by BIG DUANE, 09-29-2010, 07:22 AM |
I'm in the same boat as Jedito.
Noticed sites not responding around 1:35am Eastern, submitted a ticket at 1:41, got a reply at 2:14, but still nothing back up and running yet.
Resubmitting a ticket seems to have worked for some, maybe I'll try that. |
Posted by roccol, 09-29-2010, 07:26 AM |
My server went down, came back upfor a while and then went down again for a couple of hours. It is now back up again. Hopefully for good. I don't know what the problem was the second time around, but the server didn't need to rebooted, so it was network related. |
Posted by Jedito, 09-29-2010, 07:26 AM |
I got a canned response of somebody who don't even read the ticket. |
Posted by badboyx, 09-29-2010, 07:30 AM |
i'm wondering what is the benefit of Sonar Monitoring if our servers is down since 4+ hours |
Posted by Vaya, 09-29-2010, 07:31 AM |
Mine's back up after 2 hours 7 minutes. I had another 1 hour 20 of downtime earlier today, presumably connected, so that's about 3.5 hours today. I just hope that's the last of it. |
Posted by SpaceWalker, 09-29-2010, 07:33 AM |
Shouldn't they know about offline VPSs through their monitoring system ?
Mine is down from 4 hours ago now.
Opened a ticket 30+ minutes ago, without any response yet (I guess they're running low on support staff). |
Posted by SpaceWalker, 09-29-2010, 07:55 AM |
It's been an hour since I submitted my ticket, I've got not response to the ticket & my VPS is still down.
That's not a good sign, specially while they claim to be Heroic support!!! |
Posted by jtice, 09-29-2010, 08:10 AM |
My VPS is still completely down and I called in asked for an estimate so I'll have something to tell my clients... sometime today was the response. I sure as hell hope it's sooner than sometime today. It's been 7 hours already. |
Posted by badboyx, 09-29-2010, 09:10 AM |
storm by liquidweb |
Posted by SpaceWalker, 09-29-2010, 09:25 AM |
After being down for 7 hours now, with almost 2 hours with no response what so ever to my ticket.
I'm moving out to another host as this is unacceptable.
They have no disaster recovery plan + lack of staff + No updates or ETAs. |
Posted by hostmaniac, 09-29-2010, 09:57 AM |
One of our servers has been down for 9 hours now.. I think you need to let us know if there is no quick fix in sight so we know what to expect and tell our customers. Also there's only been one update at liquidweb.com/support/ about 7 hours ago... |
Posted by sirius, 09-29-2010, 10:14 AM |
Yeah, we've been down since last night... |
Posted by jtice, 09-29-2010, 10:24 AM |
Quote:
Originally Posted by dwalters
This appears to be anything but a typical situation.
|
It certainly is atypical. In my years with LW I think this is the first time down more than a few minutes. The thing that unnerves me is knowing that the potential for this existed while my impression was quite the contrary. And I have assured my clients that they are on one of the most reliable systems in the biz. |
Posted by hostmaniac, 09-29-2010, 10:53 AM |
What surprises me is the lack of updates and when I ask for an ETA or why it's taking so long, I really don't get an answer except "we're sorry we're working on it". It doesn't sound good but at least it would be nice to know how bad the situation is... |
Posted by LiquidWebTravis, 09-29-2010, 11:16 AM |
Hi Everyone, I apologize for any delays and I will be updating everyone with additional information soon. If anyone still has a ticket number that they would like me to look at please Private Message me and I will follow up ASAP. |
Posted by BIG DUANE, 09-29-2010, 11:19 AM |
Quote:
Originally Posted by jtice
It certainly is atypical. In my years with LW I think this is the first time down more than a few minutes. The thing that unnerves me is knowing that the potential for this existed while my impression was quite the contrary. And I have assured my clients that they are on one of the most reliable systems in the biz.
|
Agreed. This is the first major outage I've experienced in over three years. |
Posted by Dustin Cisneros, 09-29-2010, 11:21 AM |
My stuff went offline last night aswell but it held upthrough out the outage quite well till about 2am Pacific time. Its now up but wonder what really is going on there is no way this is a breaker which tripped. |
Posted by wonderpoint, 09-29-2010, 11:24 AM |
My server went down for 30 mins and then came online. It again went down for 4 hours before coming back online again for few minutes. In total the server has been down for more than 6 hours. I opened a ticket and the only response I got was "we are working on the issue". No ETAs, no details. I know there has been a power outage but I want to know if my data is safe and by when can I expect the services to be fully up. If you don't have exact timeline, at least let us know approximate number of hours that it will take to fix the issue? |
Posted by BIG DUANE, 09-29-2010, 11:24 AM |
Quote:
Originally Posted by LiquidWebTravis
Hi Everyone, I apologize for any delays and I will be updating everyone with additional information soon. If anyone still has a ticket number that they would like me to look at please Private Message me and I will follow up ASAP.
|
Unfortunately I can't PM you due to low post count. Do you receive messages on Facebook? Thanks. |
Posted by LiquidWebTravis, 09-29-2010, 11:25 AM |
Quote:
Originally Posted by BIG DUANE
Unfortunately I can't PM you due to low post count. Do you receive messages on Facebook? Thanks.
|
Feel free to post your ticket number here and I will look into it. |
Posted by BIG DUANE, 09-29-2010, 11:31 AM |
Wasn't sure if it was kosher to post that publicly.
2325237
Appreciated. |
Posted by BIG DUANE, 09-29-2010, 02:12 PM |
Back up and running fine now.
It's been a long 13 hours, but what can you do?
Thanks to everyone at Liquidweb for their hard work. |
Posted by wonderpoint, 09-29-2010, 02:28 PM |
My server is back online as well (ping, http, etc). Although some databases have crashed and we are repairing them. Close to 12 hours of downtime. The server came up intermittently few times in this 12 hours window. I think its definitely more than a simple "power surge" issue. Will wait to hear from Liquidweb about what actually went wrong and why it took 12 hours to fix it. After sending my ticket number to Travis via PM, I received regular updates on the server status via helpdesk. Thanks everyone at LiquidWeb for resolving the issue. |
Posted by LiquidWebTravis, 09-29-2010, 02:37 PM |
Again I am sorry for the delay. We are still correcting individual server issues and will reply with complete details as soon as things have settled down. Our first priority is getting everyone up and running again.
If you have any tickets that you would like looked at, please feel free to call, email or PM me. |
Posted by jtice, 09-29-2010, 07:35 PM |
Eighteen hours and counting, server still down. PMs sent hours ago. |
Posted by chrisciardi, 09-29-2010, 07:44 PM |
also down since early this morning. Travis, I cannot get a straight answer from support on ETA. i am not feeling like things have been very clearly communicated today man. I send a lot of clients to you based on my experiences with support and uptime. i am afraid a few of them will be migrating away.
things happen that cause a data center to go down, it is how well your support staff communicates with customers that forgive such long and difficult outages. |
Posted by Dustin Cisneros, 09-29-2010, 11:26 PM |
LiquidWebs phones have been back up so its going to be better to contact them that way as Travis will only get to them at certain times as to where LW will be there 24x7 via phone unless there phones systems go down again. |
Posted by jtice, 09-29-2010, 11:44 PM |
Twenty-two hours. Talked to support by phone but there's nothing to report beyond that they are aware of the problem, working to restore the server. I really don't want to face the clients again if it's not up in the morning. |
Posted by Dustin Cisneros, 09-30-2010, 01:07 AM |
Jtice, Also shoot an email to benny@liquidweb.com
She goes above and beyond. How ever dont give up with the support techs on the phone keep pushing it as i understand how frustrating downtime can be. |
Posted by jtice, 09-30-2010, 08:01 AM |
Thirty hours. I'm starting to worry that whatever is going on with my server... well, too painful to verbalize.
I emailed Benny. |
Posted by Dustin Cisneros, 09-30-2010, 12:06 PM |
Quote:
Originally Posted by jtice
Thirty hours. I'm starting to worry that whatever is going on with my server... well, too painful to verbalize.
I emailed Benny.
|
Perfect, normally they are quite active here. Im sure they will resolve your issue and give you an explanation to why. They are truly a great provider. I hope your issue gets solved quick. |
Posted by MD Hosting, 09-30-2010, 01:19 PM |
I was told by support that when they got the servers back online, they tried to reboot with a bunch of software updates that are apparently giving the Sys Admins fits on the restore team.
So, they're working on downgrading the software on those servers to a working state.
Also, they said they are working on multiple parent servers, trying to get to the ones most easily fixed first.
I am starting to think my server is one of the worst... |
Posted by jtice, 09-30-2010, 04:34 PM |
Mine came back online about 2pm today. Looks like it was restored from a 2 day old backup- no big deal; only lost a handful of forum posts. Sure am glad it's finally over. It's not just the downtime; I don't get anything done during an outage except watch, worry and answer phone calls. Sure am glad I got the 100% uptime guarantee ;-) Seriously, though I do appreciate what the staff there must have been dealing with, and their persistence in getting it back up with no significant data loss. |
Posted by LiquidWebTravis, 09-30-2010, 06:31 PM |
Status Update:
In working with our UPS equipment provider it has been determined that
at 1:27 AM EDT on 9/29/10 a power surge caused significant equipment
damage to one of the UPS units in DC2 and tripped the breaker protecting
customer equipment connected to that UPS. This resulted in power loss to
some customer equipment in that datacenter. Our datacenters are designed
to allow this equipment to be bypassed, and is normally bypassed in the
event of equipment failure without power interruption, however the
nature of the surge caused the protection breaker to sever power to all
downstream equipment. We immediately began investigating why the failure
occurred and whether we could safely engage the bypass equipment to
restore power to the affected customer equipment. At 2:02 AM EDT it was
determined that we could safely engage bypass power and we restored
power to the affected area of the datacenter.
As a result of the power surge the only equipment that was damaged was
the UPS system, however sudden power loss can cause significant problems
with active servers. While we were able to restore power to the affected
area of the datacenter by 2:02 AM EDT many servers needed filesystem
consistency checks and quite a few were affected by filesystem
corruption or data loss due to the sudden loss of power during write
operations. This extended the duration of the outage experienced by many
customers to a longer period than the 35 minutes during which power was
unavailable.
In addition to customer equipment being affected by the power loss
Liquid Web's phone systems were also taken offline by the power loss
which resulted in the inability of our clients to contact us during this
emergency. We have already determined a plan to correct this single
point of failure in this system and will be implementing a solution to
that in the coming weeks. |
Posted by LiquidWebTravis, 09-30-2010, 06:38 PM |
More information is coming shortly... |
Posted by Dustin Cisneros, 09-30-2010, 06:41 PM |
Quote:
Originally Posted by LiquidWebTravis
Status Update:
In working with our UPS equipment provider it has been determined that
at 1:27 AM EDT on 9/29/10 a power surge caused significant equipment
damage to one of the UPS units in DC2 and tripped the breaker protecting
customer equipment connected to that UPS. This resulted in power loss to
some customer equipment in that datacenter. Our datacenters are designed
to allow this equipment to be bypassed, and is normally bypassed in the
event of equipment failure without power interruption, however the
nature of the surge caused the protection breaker to sever power to all
downstream equipment. We immediately began investigating why the failure
occurred and whether we could safely engage the bypass equipment to
restore power to the affected customer equipment. At 2:02 AM EDT it was
determined that we could safely engage bypass power and we restored
power to the affected area of the datacenter.
As a result of the power surge the only equipment that was damaged was
the UPS system, however sudden power loss can cause significant problems
with active servers. While we were able to restore power to the affected
area of the datacenter by 2:02 AM EDT many servers needed filesystem
consistency checks and quite a few were affected by filesystem
corruption or data loss due to the sudden loss of power during write
operations. This extended the duration of the outage experienced by many
customers to a longer period than the 35 minutes during which power was
unavailable.
In addition to customer equipment being affected by the power loss
Liquid Web's phone systems were also taken offline by the power loss
which resulted in the inability of our clients to contact us during this
emergency. We have already determined a plan to correct this single
point of failure in this system and will be implementing a solution to
that in the coming weeks.
|
Very detailed update Travis, keep up the good work. |
Posted by MD Hosting, 09-30-2010, 06:41 PM |
What he said. |
Posted by LiquidWebTravis, 09-30-2010, 07:37 PM |
Here is the complete update with resolution: (The first few paragraphs are the same as above)
In working with our UPS equipment provider it has been determined that
at 1:27 AM EDT on 9/29/10 a power surge caused significant equipment
damage to one of the UPS units in DC2 and tripped the breaker protecting
customer equipment connected to that UPS. This resulted in power loss to
some customer equipment in that datacenter. Our datacenters are designed
to allow this equipment to be bypassed, and is normally bypassed in the
event of equipment failure without power interruption, however the
nature of the surge caused the protection breaker to sever power to all
downstream equipment. We immediately began investigating why the failure
occurred and whether we could safely engage the bypass equipment to
restore power to the affected customer equipment. At 2:02 AM EDT it was
determined that we could safely engage bypass power and we restored
power to the affected area of the datacenter.
As a result of the power surge the only equipment that was damaged was
the UPS system, however sudden power loss can cause significant problems
with active servers. While we were able to restore power to the affected
area of the datacenter by 2:02 AM EDT many servers needed filesystem
consistency checks and quite a few were affected by filesystem
corruption or data loss due to the sudden loss of power during write
operations. This extended the duration of the outage experienced by many
customers to a longer period than the 35 minutes during which power was
unavailable.
In addition to customer equipment being affected by the power loss
Liquid Web's phone systems were also taken offline by the power loss
which resulted in the inability of our clients to contact us during this
emergency. We have already determined a plan to correct this single
point of failure in this system and will be implementing a solution to
that in the coming weeks.
We have at this time replaced the damaged components in the UPS system
and have done load bank testing on the affected unit to ensure no
unidentified problems remained prior to putting the unit back in
production with customer equipment. We have resumed normal operations
with all power gear at this time.
Liquid Web stands behind our commitment to 100% uptime protected by our
industry leading Service Level Agreement. We guarantee 10 times the
downtime will be credited to you should such problems arise. If you were
affected and would like to receive SLA credit please open a support
request stating you would like SLA credit for the outage, our staff will
see to it you are well cared for. |
Posted by MD Hosting, 09-30-2010, 08:17 PM |
After hearing the full story, it all makes a lot more sense.
Should have fessed up with that full back story yesterday, though.
Heck I can't blame you for a faulty piece of equipment - stuff fails, no matter how technically advanced it is.
But, if I would have heard that story yesterday, I'd have been a lot more understanding (although the techs I spoke with said I was one of the more reasonable people they had to deal with - your folks sound frazzled - someone better take them out and treat them to a night out after this...)
Besides, this is the age of new media. Everyone knows everything, and talks about it at length. You can't keep things quiet anymore.
Sounds like a teachable moment for the upper management at Liquidweb - you have a great rep, so don't ruin it by going all BP when things go FUBAR on you. Instead, let your clients know for cripes sakes. Open those lines of communication, wide freaking open. Doesn't hurt to tell the truth, because reasonable people will understand.
Just being told over and over and over and over again that "the restore team is working on it and there's no ETA on when it will be back up" ultimately leaves your clients feeling helpless. That is serious bad juju.
And when it's our income on the line... in the middle of a horrible recession... when you see people with MBAs delivering pizza to your front door...
Well, you know what I'm getting at.
Heck, I've been with you guys for 6 years. You can assume that I want to give you the benefit of a doubt. But when I don't have much info to go on, and keep getting absolutely meaningless responses from support, it makes it really hard to do just that.
'Nuff said. Some of my sites have finally started coming back online, and the tech in charge of my ticket explained why and is working on getting the rest up... and he has been apologizing profusely, even though he doesn't have to.
Result is that I'm fairly happy right now, even though it's not completely resolved... proving my point that a little honest communication is all it takes. |
Posted by LiquidWebTravis, 09-30-2010, 08:24 PM |
I appreciate the feedback and I'm sorry that you feel we didn't communicate soon enough or fully enough. We take your comments very seriously and we are committed to clear early communication with our customers. I posted this complete description of events literally minutes after the completion of repair on the UPS system. |
Posted by MD Hosting, 09-30-2010, 08:28 PM |
If only I'd heard it from my tech support people... it would have meant a lot more to me getting it here.
Communication has to trickle down to line employees - give your people the information they need to make clients feel comfortable, and trust them to do just that with it. |
Posted by LiquidWebTravis, 09-30-2010, 08:30 PM |
Quote:
Originally Posted by MD Hosting
If only I'd heard it from my tech support people... it would have meant a lot more to me getting it here.
Communication has to trickle down to line employees - give your people the information they need to make clients feel comfortable, and trust them to do just that with it.
|
I can assure you that all information that was on http://liquidweb.com/support and WHT was relayed to all LiquidWeb employees. |
Posted by MD Hosting, 09-30-2010, 08:35 PM |
Not trying to get into a p***ing contest with you over this, but the info on your support page was sparse and vague.
Just be a man and eat some crow, will you? |
Posted by liquidweb, 09-30-2010, 08:53 PM |
I think everyone is on the same page here in that we are glad to be completely resolved with a full explanation of events. The difficulty in communicating information as a resolution is being formulated is that the true cause and ultimate solution are not known until certain thresholds of resolution are crossed. We have a systematic approach to resolving failures of any sort through procedures developed to handle catastrophic events, however as evaluations are made during the resolution there are many possible reasons for everything.
A repair such as this involved everyone from Liquid Web data center engineers, to electrical contractors, to UPS vendor engineers and technicians all on site working towards a common solution. Without certainty of a final resolution timeline (which we consider the equipment entering an again nominal operating condition) it is often destructive to attempt to report status when true status is unknown until a resolution is met. In short, until a final resolve is met, much of what we would be reporting would be speculation both of time lines and final resolve. Speculation very easily becomes misinformation if the practical application of a solution prove initial assumptions wrong. Misinformation is to the benefit of none involved. With that said, we report what we do know and are certain of.
Regardless of this we will be certain to do a careful postmortem of both the event and our response to ensure that communications protocols are improved where deficiencies can be identified.
Thank you everyone for your patience and understanding.
Regards,
Matthew Hill
CEO, Liquid Web Inc. |
Posted by MD Hosting, 09-30-2010, 08:55 PM |
Now there's a complete and honest response. Thank you.
By the way, you guys all talk (write) like engineers. |
Posted by jtice, 09-30-2010, 09:55 PM |
Quote:
Originally Posted by liquidweb
Speculation very easily becomes misinformation if the practical application of a solution prove initial assumptions wrong. Misinformation is to the benefit of none involved. With that said, we report what we do know and are certain of.
Regardless of this we will be certain to do a careful postmortem of both the event and our response to ensure that communications protocols are improved where deficiencies can be identified.
|
First, I want to thank you all for your hard work and diligence in getting everything fixed and the servers back online. As others have said, more transparency in terms of what is actually going on and the expectation we should have or not have, even if somewhat speculative, would be better than not having any information. Even though you are dealing in hardware, data and procedures, you must understand that humans and their emotions, both direct clients (us) and our clients, are ultimately what we are dealing with in such a situation. When I talk to my clients I empathize and do my best to make them feel like we are partners... all in this together. Given the unfortunate situation and assuming the tech side is being handled with competence, whether they decide to go looking for a new provider or become more confident than before that we are their best alternative, is largely determined by how we handle their feelings throughout the crisis. Taking care of their emotional response is every bit as important to the business relationships as getting their sites back up.
Thanks again for all your hard work. |
Posted by MD Hosting, 10-01-2010, 03:19 AM |
Quote:
Originally Posted by liquidweb
I think everyone is on the same page here in that we are glad to be completely resolved with a full explanation of events. The difficulty in communicating information as a resolution is being formulated is that the true cause and ultimate solution are not known until certain thresholds of resolution are crossed. We have a systematic approach to resolving failures of any sort through procedures developed to handle catastrophic events, however as evaluations are made during the resolution there are many possible reasons for everything.
A repair such as this involved everyone from Liquid Web data center engineers, to electrical contractors, to UPS vendor engineers and technicians all on site working towards a common solution. Without certainty of a final resolution timeline (which we consider the equipment entering an again nominal operating condition) it is often destructive to attempt to report status when true status is unknown until a resolution is met. In short, until a final resolve is met, much of what we would be reporting would be speculation both of time lines and final resolve. Speculation very easily becomes misinformation if the practical application of a solution prove initial assumptions wrong. Misinformation is to the benefit of none involved. With that said, we report what we do know and are certain of.
Regardless of this we will be certain to do a careful postmortem of both the event and our response to ensure that communications protocols are improved where deficiencies can be identified.
Thank you everyone for your patience and understanding.
Regards,
Matthew Hill
CEO, Liquid Web Inc.
|
It's 2 AM - I check in on my ticket and find that all my databases only exist on the back up server and have yet to be restored to the parent.
All my clients' sites run on databases... all my orders go through web apps that run on databases... thus, basically I am still at square one until my databases are restored.
So, to say you're "glad to be completely resolved" is just more spin and corporate speak.
I'm not here to kiss up to your upper management (as it seems others here on this forum are doing - what's the deal?).
No, I am here to let you know that I am a pissed off end user and longtime client who is severely disappointed in how this situation has "resolved" as you put it.
At least your support staff has been speaking in plain language to me... good gravy, my wife works with lawyers and they speak more clearly than you guys do.
I sincerely hope my clients do not wake up to a THIRD DAY of their sites being down.
And if I have to hear about your SLA one more time, I will honestly puke The Dictionary of Corporate Bulls4!t. |
Posted by MD Hosting, 10-01-2010, 03:41 AM |
Quote:
Originally Posted by MD Hosting
Now there's a complete and honest response. Thank you.
|
*sarcasm*
Liquidweb #fail |
Posted by MD Hosting, 10-01-2010, 10:17 AM |
So, after monitoring my ticket status through the night, I awake bleary-eyed to yet another day in the saga I am titling "Blackhawk Server Down"...
No sites online. No databases restored. No hope in sight.
No responses from my multiple requests for updates from the LW support team.
Stay tuned for more... |
Posted by LiquidWebTravis, 10-01-2010, 04:40 PM |
Please note that this status update only relates to 1 VPS parent server and those that were hosted on that machine. All other servers and VPS parents have been online and active for some time.
---- Status Update for Individual VPS Parent Server ----
As a result of the power loss the parent server suffered catastrophic
filesystem corruption resulting in unrecoverable data loss.
Unfortunately we did not have a standby Virtuozzo server prepared to
restore customer accounts to, at that point we built and began imaging
one. The nature of the Virtuozzo install process is extremely time
consuming, particularly with the way in which it handles the formatting
of the areas of the disk in which the server instances reside. At this
time we began restoring customer accounts to the newly prepared server,
however during this process the new server also displayed filesystem
corruption requiring us to begin anew rather than risk restoring
customer instances to a server which was unstable. This added yet
another significant delay prior to even being able to begin restoring
customer instances onto an active parent server.
After the restoration of customer data was completed on the new parent
server we encountered further technical problems with the Virtuozzo
copy-on-write implementation which required an offline filesystem
optimization to correct. As a result of this operation a very large
amount of extraneous data was written to the disk areas for each
customer instance and required our administrators to perform clean up
operations prior to being able to restore customer service.
------
We are terribly sorry for the delay in restoration of this server. We have proactively given affected customers on this server a 3 month SLA credit. That means that we are essentially giving 1 free month of hosting per day of downtime as our way of saying we are sorry. We are in the very final stages of bringing the last customers database online and we will be following up with each customer individually. I currently have 2 supervisors monitoring this issue personally. |
Posted by MD Hosting, 10-02-2010, 07:17 AM |
"Lo, and in those days businesses relied on their hosting companies to provide them with excellent customer service, persistent uptimes, and rapid response during downtime events.
But in that day, the god of lightning (for he is a fickle and capricious god) saw to it to destroy the server the people relied on for their daily sustenance.
And the leader of the people beseeched the gods of hosting, Matt and Travis for succor; and was denied. For they spoke in riddles and strange tongues whilst they mocked and scoffed at his words.
For the greater gods' hearts remained cold and hardened by the winds and great amounts of cash that blew into their stony towers.
And there was a great weeping and gnashing of teeth among the peoples. For they had been in drought, yet, they survived through their wiles and labors; but, without the VPS parent server, they had no means to provide for their families.
But the leader of the people was not deterred. He sent prayers, and offerings, and multiple support ticket requests, and phone calls to the lesser gods in hopes that one would hear his pleas and avenge his loss.
And his prayers and cries were answered; yea, but not until evening of the third day.
On the evening of the third day, the lesser gods of System Admin and Support Supervision heard his pleas, and were greatly distraught. 'For his people have no websites, nor any hosting, nor any means of sustenance during this time of drought, and although the greater gods Matt and Travis have scoffed and mocked him in his time of need, we will do what we can in secret to help him.'
And so, the lesser gods brought him a great burning brand that they called 'New Server' - and magically transferred all his data, and his people's data, and the data thereof to the 'New Server'. And although the trickster MySQL caused them much mischief, they were able to restore the power of hosting to the leader and his people.
And on that night the people rejoiced.
But, the leader of the people was perplexed, and greatly troubled. 'Why was this not done three days ago?'
And there was no answer. So the lesser gods restored a fraction of his losses according to their SLA, which amounted to a pittance in his eyes.
But the lesser gods' powers were limited by the dominion of the greater gods.
And yet, the leader and his people remained, perhaps to seek new lands and new hosting servers, and to tell others through tales and fables of the poor treatment they received at the hands of the greater gods.
For the leader of the people was defiant, and not one to easily forget." |
Add to Favourites Print this Article
Also Read
wiredtree down (Views: 1008)
Calpop Outage (Views: 1053)