Category: IT Operations

Jul 09 2010

KNS Platform Outage Report 7/9/2010

Date: 7/9/2010

Start Time: 1134 MST on 7/9/2010

Stop Time: 1222 MST on 7/9/2010

Duration: 00:00:48 (DD:hh:mm)

Severity: Sev2 - Degraded performance processing ruleset evaluations on the KNS platform with periods of unavailability (<15 minutes) of the evaluation servers.

Incident Summary:

During a routine code deployment, requests for ruleset parsing were negatively impacted by the inability of the KRL parser to keep up with demand. During this period, requests for ruleset evaluations were either delayed or blocked.

Services Impacted:

Evaluation Servers (cs.kobj.net)
KRL Parser (krl.kobj.net)
AppBuilder (appbuilder.kynetx.com)

Root Cause Analysis:

The root cause was determined to be a single ruleset which took on average > 60 seconds to parse. Due to the volume of parse requests for this particular ruleset, the requests for parsing were never completed and were retried. As the parse requests compounded, requests for other rulesets were delayed or dropped.

Recovery Steps:

Once the root cause was identified, the offending ruleset was deactivated and the platform returned to normal operation within minutes. The author of the ruleset has been notified and the Kynetx Engineering team is reviewing the ruleset's syntax.

Feb 23 2010

KNS Platform Outage Report 2/23/2010

Date: 02/23/2010

Start Time: 1740 MST

Stop Time: 1745 MST

Duration: 00:00:05 (DD:hh:mm)

Severity: Sev1 - Platform Down


Incident Summary:

The KNS platform experienced a service interruption at 1740 MST, which lasted for a five (5) minute period. This outage was attributed to a memory starvation issue on the master load balancer in the load balancing cluster, which caused a failover of the cluster. The failover was successful, but took longer than normal (tested) due to the contributing memory issue.

Services Impacted:

  • Evaluation Servers (cs.kobj.net)
  • Initialization Servers (init.kobj.net)
  • Callback Serves (log.kobj.net)

Root Cause Analysis:

The root cause was identified as a memory starvation issue on lb1.kob.net. The load balancer cluster has been very stable since installation,and there is no reason to suspect that there will be any further service issues.

Recovery Steps:

Once alerted by monitoring and customer reports, a manual forced failover/failback was performed by the Kynetx IT Operations team. This action restored service and returned control of the cluster to the master, who had recovered from the memory starvation issue. Going forward, the Kynetx IT Operations team will schedule a maintenance window to increase the amount of memory allocated to the load balancer processes. Service was restored in less than five (5) minutes from first alert.

Jan 04 2010

Kynetx Maintenance Window 1/9/2010

Date: 1/9/2010

Start Time: 0000 MST

End Time: 0500 MST

Duration: 00:05:00 (DD:HH:MM)

Outcome: Successful


Maintenance Summary:

Kynetx will be upgrading the OS versions on all of its servers from Fedora Core 8 to CentOS 5.4_final. In order to perform the upgrade, the host servers will need to be taken offline in turn and upgraded. This maintenance will be one of many over the course of the next ten (10) days.

Impact Statement:

The following sites and services will be unavailable during the maintenance window due to OS upgrade and rebooting activities.

  • Corporate website – www.kynetx.com
  • Corporate blog – code.kynetx.com
  • Corporate news – news.kynetx.com
  • Developer website – developer.kynetx.com
  • Corporate email – mail.kynetx.com (POP/IMAP/SMTP)
  • Appbuilder – appbuilder.kynetx.com
  • Accounts – accounts.kynetx.com
  • Appdirectory – appdirectory.kynetx.com

Due to load balancing, the impact to the core Kynetx Network Services (init, eval and callback servers) will be minimal.

Maintenance Plan:

  1. Shutdown virtual or physical server being worked on
  2. Create backup copy of VM images on server
  3. Upgrade OS on host servers
  4. Restore VM images to host servers
  5. Start guest images and test

Roll Back Plan:

  1. Restore system from backup
  2. Test system
  3. Close maintenance window
Jan 02 2010

Kynetx KNS Server Outage Report 12/31/2009

Date: 12/31/2009
Start Time: 1606
Stop Time: 1635
Duration: 00:00:29 (DD:hh:mm)
Severity: Sev2 (Degraded Performance)



Incident Summary:

The KNS platform experienced degraded performance for a period of 29 minutes on 12/31/2009 due to high system load. This load was due to an ad-hoc software update job which was scheduled by the Kynetx IT Operations team.

Kynetx utilizes puppet from Reductive Labs to perform configuration management tasks. The system is a great system with one caveat, the daemon which communicates with the "puppet master" server(s) does so on a set schedule. If the daemons are all restarted within close time proximity to each other, you can run into a resource starvation issue if you are running a virtualized environment, like Kynetx does.

Services Impacted:

  • Evaluation Servers (cs.kobj.net)
  • Initialization Servers (init.kobj.net)
  • Callback Serves (log.kobj.net)
  • Code Fragment Server (frag.kobj.net)
  • Kynetx Rule Language Server (krl.kobj.net)
  • Kynetx Corporate Server (corp.kynetx.com)
    • Mail Server (mail.kynetx.com)
    • Corporate Web Server (www.kynetx.com, code.kynetx.com, news.kynetx.com, developer.kynetx.com)
  • Kynetx Application Server (demo.kynetx.com)
    • AppBuilder
    • AppDirectory
    • Accounts

Root Cause Analysis:

The root cause was identified to be a ad-hoc software update job controlled through the Puppet system. The update was deployed to the platform within a ten (10) minute period of time, and caused resource starvation on one of the Xen host servers. The resource starvation manifested itself in the temporary inability of the guest servers to communicate with the virtual XEN network. This caused the guest servers to appear to be down, when in fact they were just busy.

Recovery Steps:

The Kynetx IT Operations team was notified of the issue within seconds of its inception by the network of monitoring agents. Once the Kynetx IT Operations team was made aware, a Severity Two (2) incident was declared, and engineers were immediately engaged to triage and resolve the issue.

Due to the load balanced architecture of the KNS platform, the impacted servers were taken out of rotation automatically and isolated as to not poison the entire platform.

Once the root cause was identified, steps were immediately taken to cancel the update job and relieve the resource starvation. The platform was fully recovered within 29 minutes of the first notification.

Dec 20 2009

Kynetx KNS Platform Maintenance Window 12/26/2009

Date: 12/26/2009

Start Time: 0000 MST

End Time: 0500 MST

Duration: 00:05:00 (DD:HH:MM)


Impact Statement:

The following sites and services will be unavailable during the maintenance window due to system patching, network migration and rebooting activities.

  • Corporate website – www.kynetx.com
  • Corporate blog – code.kynetx.com
  • Corporate news – news.kynetx.com
  • Developer website – developer.kynetx.com
  • Corporate email – mail.kynetx.com (POP/IMAP/SMTP)
  • Appbuilder – appbuilder.kynetx.com
  • Accounts – accounts.kynetx.com
  • Appdirectory – appdirectory.kynetx.com

Due to load balancing, the impact to the core Kynetx Network Services (init, eval and callback servers) will be minimal.

Maintenance Plan:

  1. Shutdown physical/virtual server being worked on
  2. Create backup copy of VM image
  3. Apply application and/or OS patches
  4. Restart image
  5. Start services and test

Roll Back Plan:

  1. Restore system from backup
  2. Test system
  3. Close maintenance window
Dec 12 2009

Kynetx maintenance window 12/19/2009

Date: 12/19/2009
Start Time: 0000 MST
End Time: 0500 MST
Duration: 00:05:00 (DD:HH:MM)


Impact Statement:

The following sites and services will be unavailable during the maintenance window due to system patching, network migration and rebooting activities.

  • Corporate website - www.kynetx.com
  • Corporate blog - code.kynetx.com
  • Corporate news - news.kynetx.com
  • Developer website - developer.kynetx.com
  • Corporate email - mail.kynetx.com (POP/IMAP/SMTP)
  • Appbuilder - appbuilder.kynetx.com
  • Accounts - accounts.kynetx.com
  • Appdirectory - appdirectory.kynetx.com

Due to load balancing, the impact to the core Kynetx Network Services (init, eval and callback servers) will be minimal.

Maintenance Plan:

  1. Shutdown physical/virtual server being worked on
  2. Create backup copy of VM image
  3. Apply application and/or OS patches
  4. Restart image
  5. Start services and test

Roll Back Plan:

  1. Restore system from backup
  2. Test system
  3. Close maintenance window
Dec 09 2009

Kynetx corporate server maintenance 12/9/2009

Date: 12/9/2009
Start Time: 2000 MST
End Time: 2100 MST
Duration: 00:01:00 (DD:HH:MM)


Impact Statement:

The following sites and services will be unavailable during the maintenance window due to system reboot and disk space allocation activities

  • Corporate website - www.kynetx.com
  • Corporate blog - code.kynetx.com
  • Corporate news - news.kynetx.com
  • Developer website - developer.kynetx.com
  • Corporate email - mail.kynetx.com (POP/IMAP/SMTP)

Maintenance Plan:

  1. Shutdown corporate server
  2. Create backup copy of VM image
  3. Change VM configuration to add additional disk space
  4. Restart image
  5. Format new disk space for use
  6. Migrate email message store to new disk space
  7. Start email system and test

Roll Back Plan:

  1. Restore system from backup
  2. Test system
  3. Close maintenance window
Dec 05 2009

Kynetx Corporate Server Outage Report – 12/5/2009 – Final

Date: 12/05/2009

Start Time: 2126 MST

Stop Time: 2221 MST

Duration: 00:00:55 (DD:HH:MM)



Summary: The corporate server, which serves multiple corporate websites and the corporate email system, experienced a unplanned outage this evening.

Impact: The impact of the outage was an interruption of the following sites and services:
The outage impacted the following services and sites:

  • www.kynetx.com
  • code.kynetx.com
  • news.kynetx.com
  • developer.kynetx.com
  • mail.kynetx.com (Web client and SMTP/POP/IMAP)

NOTE: At no time were the KNS services impacted

Root Cause: The root cause was a driver issue on the host server which caused the VM to go into a unrecoverable "paused" state.

Remediation: The VM was migrated to another host server and was brought back online without incident.

If you should experience any further issues or interruptions of service, please contact the support desk at support@kynetx.com immediately.

Regards,

Kynetx IT Operations

Dec 04 2009

Kynetx email system maintenance 12/4/2009

Date: 12/4/2009
Start Time: 2000 MST
End Time: 2130 MST
Duration: 00:01:30 (DD:HH:MM)


Impact Statement:

Kynetx email system will be unavailable during the maintenance window due to system patching and email system software updates.

Change Plan:

  1. Place email system into maintenance mode
  2. Backup current configuration
  3. Apply OS patches
  4. Reboot server
  5. Apply email system patches
  6. Apply custom configuration to email system
  7. Test email system
  8. Take email system out of maintenance mode
  9. Close maintenance window

Roll Back Plan:

  1. Restore system from backup
  2. Test system
  3. Close maintenance window
Dec 03 2009

KNS Deployment Window 12/3/2009

Date: 12/3/2009

Start Time: 1400 MST

End Time: 1500 MST

Duration: 00:01:00 (DD:HH:MM)


Impact Statement:

Medium impact deployment due to multiple changes to the KNS engine configuration. This will be a rolling deployment in which all included servers/services will be taken out of load balancer rotation in order to allow for isolated deployment and testing.

Change Plan:

  1. Send "deployment commencing" notification to interested parties
  2. Remove server from load balancer pool
  3. Backup current configuration
  4. Deploy new revision of KNS engine code
  5. Test
    1. If successful, move to step 6
    2. If unsuccessful, move to roll back plan
  6. Return server to load balancer pool
  7. Monitor server for anomalous behavior
  8. Move to next server/service until all servers have been updated, tested and certified
  9. Send "deployment complete" notification to interested party
  10. Monitor platform for anomalies during 48 hour soak in period

Roll Back Plan:

*If deployment testing fails, or a severity one (1) anomaly is detected during the soak in period, the roll back plan may be activated*

  1. Send "roll back commencing" notification to interested parties
  2. If not already excluded from the load balancer pool, remove the affected server from the load balancer pool
  3. Restore configuration backup taken in step 3 of the "Deployment Plan"
  4. Test
    1. If successful, move to step 4
    2. If unsuccessful, develop and communicate recovery plan to interested parties
  5. Return server to load balancer pool
  6. Monitor platform for anomalies during 48 hour soak in period
  7. Send "roll back complete" notification to interested party