Jan 02 2010

Kynetx KNS Server Outage Report 12/31/2009

Date: 12/31/2009
Start Time: 1606
Stop Time: 1635
Duration: 00:00:29 (DD:hh:mm)
Severity: Sev2 (Degraded Performance)



Incident Summary:

The KNS platform experienced degraded performance for a period of 29 minutes on 12/31/2009 due to high system load. This load was due to an ad-hoc software update job which was scheduled by the Kynetx IT Operations team.

Kynetx utilizes puppet from Reductive Labs to perform configuration management tasks. The system is a great system with one caveat, the daemon which communicates with the "puppet master" server(s) does so on a set schedule. If the daemons are all restarted within close time proximity to each other, you can run into a resource starvation issue if you are running a virtualized environment, like Kynetx does.

Services Impacted:

  • Evaluation Servers (cs.kobj.net)
  • Initialization Servers (init.kobj.net)
  • Callback Serves (log.kobj.net)
  • Code Fragment Server (frag.kobj.net)
  • Kynetx Rule Language Server (krl.kobj.net)
  • Kynetx Corporate Server (corp.kynetx.com)
    • Mail Server (mail.kynetx.com)
    • Corporate Web Server (www.kynetx.com, code.kynetx.com, news.kynetx.com, developer.kynetx.com)
  • Kynetx Application Server (demo.kynetx.com)
    • AppBuilder
    • AppDirectory
    • Accounts

Root Cause Analysis:

The root cause was identified to be a ad-hoc software update job controlled through the Puppet system. The update was deployed to the platform within a ten (10) minute period of time, and caused resource starvation on one of the Xen host servers. The resource starvation manifested itself in the temporary inability of the guest servers to communicate with the virtual XEN network. This caused the guest servers to appear to be down, when in fact they were just busy.

Recovery Steps:

The Kynetx IT Operations team was notified of the issue within seconds of its inception by the network of monitoring agents. Once the Kynetx IT Operations team was made aware, a Severity Two (2) incident was declared, and engineers were immediately engaged to triage and resolve the issue.

Due to the load balanced architecture of the KNS platform, the impacted servers were taken out of rotation automatically and isolated as to not poison the entire platform.

Once the root cause was identified, steps were immediately taken to cancel the update job and relieve the resource starvation. The platform was fully recovered within 29 minutes of the first notification.

Back to Blog