Portal Home > Knowledgebase > Industry Announcements > Web Hosting Main Forums > Providers and Network Outages and Updates > RackSRV inaccessible due to synergyworks fault


RackSRV inaccessible due to synergyworks fault




Posted by lemon09, 11-24-2009, 11:38 PM
I can not access my dedics at racksrv.
[root@koala ~]# tracepath -n racksrv.com
1: 125.214.74.18 0.163ms pmtu 1500
1: 125.214.67.244 0.038ms
2: 216.14.207.88 asymm 3 0.708ms
3: 114.31.197.17 asymm 4 2.109ms
4: 114.31.196.33 asymm 9 163.922ms
5: 114.31.192.82 asymm 9 164.439ms
6: 114.31.192.103 asymm 7 166.879ms
7: 12.126.40.41 asymm 12 167.193ms
8: 12.122.136.58 asymm 14 168.855ms
9: 12.123.15.165 asymm 12 166.439ms
10: 192.205.33.178 asymm 12 166.989ms
11: 129.250.3.76 asymm 17 217.644ms
12: 129.250.3.97 asymm 17 219.502ms
13: 129.250.6.14 asymm 15 237.441ms
14: 129.250.2.25 asymm 15 242.477ms
15: 129.250.3.255 asymm 17 319.412ms
16: 129.250.2.78 asymm 17 316.401ms
17: 83.231.148.19 314.467ms
18: 92.48.64.27 asymm 17 312.907ms
19: 92.48.64.27 asymm 17 395.289ms !H
Resume: pmtu 1500
[root@koala ~]#

Posted by squirrelhost, 11-24-2009, 11:46 PM
more info here:

http://www.synergyworks.co.uk/support/

Posted by lemon09, 11-25-2009, 12:17 AM
It took them more than four hours to recover
Quote:
25/11/2009 @ 12:05am: We are aware of a major service outage effecting KSP. Engineers are onsite and updates should follow here.

25/11/2009 @ 01:08am: There has been a major failure to the UPS system.

25/11/2009 @ 01:47am: Distribution from one of the N+1 UPS units onto the common building distribution has burn't out (suspected manufacturing fault) and has shorted itself out on the distro causing the remaining UPS to short also. This is currently being removed and replaced.

25/11/2009 @ 02:33am: The Distro tracks have been cleared of debris from the unit that fried itself, thus removing the short. The remaining units are now being reconnected and we hope to restore power very shortly.

25/11/2009 @ 03:30am: UPS units powered back on. Currently testing before returning load.

25/11/2009 @ 04:07am: All power has been restored to customer services. If customers still cannot access their equipment past 04:30 please telephone 01622 808 420.

Posted by squirrelhost, 11-25-2009, 12:23 AM
http://www.racksrv.com/

is now back up. the staff are probably still asleep
unaware that everything was down for 4+ hours !!

Posted by squirrelhost, 11-25-2009, 12:26 AM
strange their own website was down, as their 'network' info page says:

"Purely for disaster recovery purposes we have a POP located within a private server room in Essex where key internal services such as our web server, offsite backup storage and internal network monitoring services are hosted from. After all, what’s the logic in hosting our mail server from within the same facility as the core network because if there was a critical outage, there would be no means to contact clients to keep them informed!"

Posted by PhoneSupport, 11-25-2009, 12:40 AM
Hi,

Jon was aware of the issue at around 1am, though there was not much he or Racksrv could do as it was a DC issue. I was able to email Jon and received a prompt response.

It is extremely frustrating that this has occured during a marketing campaign I was running, I've received a number of PM's saying no one can get on to my site. RackSRV is generally a great company who I will always recommend, and I understand that incidents like these can occur, (and they have done in the past), but really why should I be held liable for the faults of others?

From what I read above it was due to a "manufacturing fault" - so who should this make liable for the lack of service I received and loss of income?

I guess that's another loss for me then!

Posted by Jon-RackSRV, 11-27-2009, 12:56 PM
All customers by now should have book marked our offsite status website, we've been emailing out about it for the last 2-3 months

http://status.racksrv.com

Quote:
Originally Posted by squirrelhost
strange their own website was down, as their 'network' info page says....
Thanks for reminding me, I will get that updated!

Quote:
Originally Posted by PhoneSupport
From what I read above it was due to a "manufacturing fault" - so who should this make liable for the lack of service I received and loss of income?
This is an issue we have open at the moment and nobody has yet accepted liability. From an in-depth conversation I had with our supplier earlier, it would seem that the fault lies with the copper in what's effectively a junction box as it disintegrated for no reason under normal load (half its designed workload).

Regardless, my apologies go out to anyone affected by this freakish and unpredictable outage

Posted by PhoneSupport, 12-02-2009, 11:58 PM
Quote:
Originally Posted by aXeR
All customers by now should have book marked our offsite status website, we've been emailing out about it for the last 2-3 months

http://status.racksrv.com



Thanks for reminding me, I will get that updated!



This is an issue we have open at the moment and nobody has yet accepted liability. From an in-depth conversation I had with our supplier earlier, it would seem that the fault lies with the copper in what's effectively a junction box as it disintegrated for no reason under normal load (half its designed workload).

Regardless, my apologies go out to anyone affected by this freakish and unpredictable outage
Any further update on this?

Best,

Posted by PhoneSupport, 12-09-2009, 08:36 AM
Second major downtime in the past fortnight now.

Servers are offline... again.

Best,

Posted by lemon09, 12-09-2009, 09:05 AM
I can confirm, all of my servers are unavailable again. What a shame!

Posted by vord, 12-09-2009, 09:13 AM
The outage started at about 12:15pm UK time - appears to be the whole datacenter (again).

Properly designed UPS should not fail shorted out and catch fire. They are designed to fail safely. What's the betting they are using cheap Chinese ones?

Wonder how many more they have fitted.

Posted by BobLawley, 12-09-2009, 09:32 AM
Just noticed this after starting my own thread. Seems like i'm not the only one having problems

Has anyone had any contact with Vooservers?

Posted by holy-war, 12-09-2009, 09:55 AM
Well ...
i am with them for about 4 months , and four days ago i got another server ...
it seams that it was a bad idea ...

MY CLIENTS ARE DRIVING ME CRAZY !!!

How we can report the problem is there website is down as well ???!!!

Posted by lemon09, 12-09-2009, 10:01 AM
Quote:
Originally Posted by holy-war
How we can report the problem is there website is down as well ???!!!
You can monitor their network issues here status [dot] racksrv [dot] com

Posted by holy-war, 12-09-2009, 10:27 AM
Quote:
Originally Posted by lemon09
You can monitor their network issues here status [dot] racksrv [dot] com
well , at least they know what is the problem ... it is a matter of time now

Posted by vord, 12-09-2009, 10:50 AM
Whohoooooo - back up. I just downloaded a spam mail from my server.

Posted by Jon-RackSRV, 12-09-2009, 03:42 PM
Hi Guys,

I can't apologise enough for those inconvenienced by a re-occurrence of what was described to me by the KSP as a freak occurrence which was highly unlikely to ever be repeated.

The problem itself and as was notified to clients last time, is nothing to do with the UPS units themselves but actually to do with a non-mechanical part - the 'tap off' which (in my understanding) is nothing more than a large scale junction box. Consequently I can only assume there has been some confusion/crossed wires with regards to vords assumption that the facility is using 'cheap Chinese ones' as this really is not the case (KSP uses the same brand of UPS equipment that BSQ (amongst others) use).

It is my understanding that a senior electrical engineer specialising in data centre power systems will be attending site tomorrow to draw up a revised power distribution solution that does not rely on the usage of any tap off’s which seems to be the part causing us such catastrophic issues.

After this visit has been completed I should be in a position to accurately write and publish an official RFO and once this is ready I will put a link to it here.

Regards, Jon

Posted by SynergyWorks, 12-09-2009, 08:18 PM
Hello All.

I figured I should make an appearance here to try and shed a little bit of light with regards to what has happened - and what will happen to stop this recurring.

The UPS units in the facility are PowerWave 9000 (the model up from the UPS system Bluesquare use).

In the effected suites (Suite A and B); there are 3 PowerWave UPS modules in Suite A and 3 PowerWave UPS modules in Suite B. Just over two UPS modules is required to sustain building load so installing 6 gives N+1.

The problem is occurring on the protected side of the critical load on the common busbar that parallels the units. The tapoff units which take rack power from this N+1 protected busbar are failing in a freakish manor and the fragments of what is left is making its way into the bus bar. Naturally this creates an output short and all the UPS units have to shut down to prevent a fire or damage to customer equipment.

The first failure two weeks ago was with a tapoff that feeds some racks in Suite A. This was put down to a potential freak manufacturing fault in the tapoff and it was replaced. Todays failure was the same, in a tapoff that feeds some racks in Suite B.

Given (despite 2.5 years service) these are proving unreliable - as a quick fix today, Suite A and B have been split into their own respective systems. The UPS stacks are no longer in parallel. Suite B load has been hard wired to the Suite B UPS stack and for speed of resolution Suite A remains on its, now not parellel'd busbar using the previously replaced tapoffs. Unfortunately this means we loose N+1 until further modules can be installed in the next day or so.

Going forwards Suite A will need to be re-engineered in a similar way to Suite B.

I can assure you that everybody involved is 100% committed to re-engineering the infrastructure within KSP to ensure such outages cannot recurr. This will be done as quickly as feasibly possible and irrespective of the cost. The intention is to restore total customer confidence in the KSP operation as rapidly as possible.

I hope that helps shed some light...


Kind Regards,



Was this answer helpful?

Add to Favourites Add to Favourites    Print this Article Print this Article

Also Read
Imountain Down (Views: 956)
Ubiquity downtime (Views: 989)


Language: