Outage Report - 11/10/2004
Summary:
Earlier today, The Planet experienced a system malfunction on the DWDM system that connects the DLLSTX4 and DLLSTX2 datacenters. Planet engineers were able to bring the affected links back up, and systems appeared to be functioning normally. Shortly thereafter, it was reported that customers in the DLLSTX4 datacenter were experiencing random packet drops and disconnections. Upon first inspection, it was believed that the cause of these issues was an overloaded supervisor engine on one of the distribution switches in DLLSTX4. A replacement supervisor engine was located, and installed at approximately 5:30pm. After the supervisor engine came online the problems continued. It was determined that the original DWDM issue was causing packets to be lost on the wire. It was also determined that several of the links were taking input errors, and that these links were failing. Engineers made several efforts to normalize the faulty links without taking any more equipment out of service. It became necessary to remove several cards from the transport equipment and rewire the affected routers, thus causing loss of connectivity for a period of time. At approximately 9:30pm, normal service was restored.
Future Mitigation:
We will be working with our vendor in the morning to determine if the equipment needs to be replaced, or if a specific component needs to be replaced. Future downtime may be required to fix the issue permanently. In addition to any work on the existing systems, The Planet is installing a secondary system that will link the DLLSTX4 datacenter directly to the DLLSTX3 datacenter on separate fiber and transport systems, eliminating a single point of failure going forward. These plans had already been announced and were in the works prior to this issue, however they are now going to be accellerated. We anticipate lighting this ring in the next 2-4 weeks (or sooner).