Feb 23 2010

KNS Platform Outage Report 2/23/2010

Date: 02/23/2010

Start Time: 1740 MST

Stop Time: 1745 MST

Duration: 00:00:05 (DD:hh:mm)

Severity: Sev1 - Platform Down


Incident Summary:

The KNS platform experienced a service interruption at 1740 MST, which lasted for a five (5) minute period. This outage was attributed to a memory starvation issue on the master load balancer in the load balancing cluster, which caused a failover of the cluster. The failover was successful, but took longer than normal (tested) due to the contributing memory issue.

Services Impacted:

  • Evaluation Servers (cs.kobj.net)
  • Initialization Servers (init.kobj.net)
  • Callback Serves (log.kobj.net)

Root Cause Analysis:

The root cause was identified as a memory starvation issue on lb1.kob.net. The load balancer cluster has been very stable since installation,and there is no reason to suspect that there will be any further service issues.

Recovery Steps:

Once alerted by monitoring and customer reports, a manual forced failover/failback was performed by the Kynetx IT Operations team. This action restored service and returned control of the cluster to the master, who had recovered from the memory starvation issue. Going forward, the Kynetx IT Operations team will schedule a maintenance window to increase the amount of memory allocated to the load balancer processes. Service was restored in less than five (5) minutes from first alert.

Back to Blog