Feb
23
2010
Date: 02/23/2010
Start Time: 1740 MST
Stop Time: 1745 MST
Duration: 00:00:05 (DD:hh:mm)
Severity: Sev1 - Platform Down
Incident Summary:
The KNS platform experienced a service interruption at 1740 MST, which lasted for a five (5) minute period. This outage was attributed to a memory starvation issue on the master load balancer in the load balancing cluster, which caused a failover of the cluster. The failover was successful, but took longer than normal (tested) due to the contributing memory issue.
Services Impacted:
- Evaluation Servers (cs.kobj.net)
- Initialization Servers (init.kobj.net)
- Callback Serves (log.kobj.net)
Root Cause Analysis:
The root cause was identified as a memory starvation issue on lb1.kob.net. The load balancer cluster has been very stable since installation,and there is no reason to suspect that there will be any further service issues.
Recovery Steps:
Once alerted by monitoring and customer reports, a manual forced failover/failback was performed by the Kynetx IT Operations team. This action restored service and returned control of the cluster to the master, who had recovered from the memory starvation issue. Going forward, the Kynetx IT Operations team will schedule a maintenance window to increase the amount of memory allocated to the load balancer processes. Service was restored in less than five (5) minutes from first alert.
Jan
04
2010
Date: 1/9/2010
Start Time: 0000 MST
End Time: 0500 MST
Duration: 00:05:00 (DD:HH:MM)
Outcome: Successful
Maintenance Summary:
Kynetx will be upgrading the OS versions on all of its servers from Fedora Core 8 to CentOS 5.4_final. In order to perform the upgrade, the host servers will need to be taken offline in turn and upgraded. This maintenance will be one of many over the course of the next ten (10) days.
Impact Statement:
The following sites and services will be unavailable during the maintenance window due to OS upgrade and rebooting activities.
- Corporate website – www.kynetx.com
- Corporate blog – code.kynetx.com
- Corporate news – news.kynetx.com
- Developer website – developer.kynetx.com
- Corporate email – mail.kynetx.com (POP/IMAP/SMTP)
- Appbuilder – appbuilder.kynetx.com
- Accounts – accounts.kynetx.com
- Appdirectory – appdirectory.kynetx.com
Due to load balancing, the impact to the core Kynetx Network Services (init, eval and callback servers) will be minimal.
Maintenance Plan:
- Shutdown virtual or physical server being worked on
- Create backup copy of VM images on server
- Upgrade OS on host servers
- Restore VM images to host servers
- Start guest images and test
Roll Back Plan:
- Restore system from backup
- Test system
- Close maintenance window
Jan
02
2010
Date: 12/31/2009
Start Time: 1606
Stop Time: 1635
Duration: 00:00:29 (DD:hh:mm)
Severity: Sev2 (Degraded Performance)
Incident Summary:
The KNS platform experienced degraded performance for a period of 29 minutes on 12/31/2009 due to high system load. This load was due to an ad-hoc software update job which was scheduled by the Kynetx IT Operations team.
Kynetx utilizes puppet from Reductive Labs to perform configuration management tasks. The system is a great system with one caveat, the daemon which communicates with the "puppet master" server(s) does so on a set schedule. If the daemons are all restarted within close time proximity to each other, you can run into a resource starvation issue if you are running a virtualized environment, like Kynetx does.
Services Impacted:
- Evaluation Servers (cs.kobj.net)
- Initialization Servers (init.kobj.net)
- Callback Serves (log.kobj.net)
- Code Fragment Server (frag.kobj.net)
- Kynetx Rule Language Server (krl.kobj.net)
- Kynetx Corporate Server (corp.kynetx.com)
- Mail Server (mail.kynetx.com)
- Corporate Web Server (www.kynetx.com, code.kynetx.com, news.kynetx.com, developer.kynetx.com)
- Kynetx Application Server (demo.kynetx.com)
- AppBuilder
- AppDirectory
- Accounts
Root Cause Analysis:
The root cause was identified to be a ad-hoc software update job controlled through the Puppet system. The update was deployed to the platform within a ten (10) minute period of time, and caused resource starvation on one of the Xen host servers. The resource starvation manifested itself in the temporary inability of the guest servers to communicate with the virtual XEN network. This caused the guest servers to appear to be down, when in fact they were just busy.
Recovery Steps:
The Kynetx IT Operations team was notified of the issue within seconds of its inception by the network of monitoring agents. Once the Kynetx IT Operations team was made aware, a Severity Two (2) incident was declared, and engineers were immediately engaged to triage and resolve the issue.
Due to the load balanced architecture of the KNS platform, the impacted servers were taken out of rotation automatically and isolated as to not poison the entire platform.
Once the root cause was identified, steps were immediately taken to cancel the update job and relieve the resource starvation. The platform was fully recovered within 29 minutes of the first notification.
Dec
20
2009
Date: 12/26/2009
Start Time: 0000 MST
End Time: 0500 MST
Duration: 00:05:00 (DD:HH:MM)
Impact Statement:
The following sites and services will be unavailable during the maintenance window due to system patching, network migration and rebooting activities.
- Corporate website – www.kynetx.com
- Corporate blog – code.kynetx.com
- Corporate news – news.kynetx.com
- Developer website – developer.kynetx.com
- Corporate email – mail.kynetx.com (POP/IMAP/SMTP)
- Appbuilder – appbuilder.kynetx.com
- Accounts – accounts.kynetx.com
- Appdirectory – appdirectory.kynetx.com
Due to load balancing, the impact to the core Kynetx Network Services (init, eval and callback servers) will be minimal.
Maintenance Plan:
- Shutdown physical/virtual server being worked on
- Create backup copy of VM image
- Apply application and/or OS patches
- Restart image
- Start services and test
Roll Back Plan:
- Restore system from backup
- Test system
- Close maintenance window
Dec
12
2009
Date: 12/19/2009
Start Time: 0000 MST
End Time: 0500 MST
Duration: 00:05:00 (DD:HH:MM)
Impact Statement:
The following sites and services will be unavailable during the maintenance window due to system patching, network migration and rebooting activities.
- Corporate website - www.kynetx.com
- Corporate blog - code.kynetx.com
- Corporate news - news.kynetx.com
- Developer website - developer.kynetx.com
- Corporate email - mail.kynetx.com (POP/IMAP/SMTP)
- Appbuilder - appbuilder.kynetx.com
- Accounts - accounts.kynetx.com
- Appdirectory - appdirectory.kynetx.com
Due to load balancing, the impact to the core Kynetx Network Services (init, eval and callback servers) will be minimal.
Maintenance Plan:
- Shutdown physical/virtual server being worked on
- Create backup copy of VM image
- Apply application and/or OS patches
- Restart image
- Start services and test
Roll Back Plan:
- Restore system from backup
- Test system
- Close maintenance window
Dec
09
2009
Date: 12/9/2009
Start Time: 2000 MST
End Time: 2100 MST
Duration: 00:01:00 (DD:HH:MM)
Impact Statement:
The following sites and services will be unavailable during the maintenance window due to system reboot and disk space allocation activities
- Corporate website - www.kynetx.com
- Corporate blog - code.kynetx.com
- Corporate news - news.kynetx.com
- Developer website - developer.kynetx.com
- Corporate email - mail.kynetx.com (POP/IMAP/SMTP)
Maintenance Plan:
- Shutdown corporate server
- Create backup copy of VM image
- Change VM configuration to add additional disk space
- Restart image
- Format new disk space for use
- Migrate email message store to new disk space
- Start email system and test
Roll Back Plan:
- Restore system from backup
- Test system
- Close maintenance window
Dec
05
2009
Date: 12/05/2009
Start Time: 2126 MST
Stop Time: 2221 MST
Duration: 00:00:55 (DD:HH:MM)
Summary: The corporate server, which serves multiple corporate websites and the corporate email system, experienced a unplanned outage this evening.
Impact: The impact of the outage was an interruption of the following sites and services:
The outage impacted the following services and sites:
- www.kynetx.com
- code.kynetx.com
- news.kynetx.com
- developer.kynetx.com
- mail.kynetx.com (Web client and SMTP/POP/IMAP)
NOTE: At no time were the KNS services impacted
Root Cause: The root cause was a driver issue on the host server which caused the VM to go into a unrecoverable "paused" state.
Remediation: The VM was migrated to another host server and was brought back online without incident.
If you should experience any further issues or interruptions of service, please contact the support desk at support@kynetx.com immediately.
Regards,
Kynetx IT Operations
Dec
04
2009
Date: 12/4/2009
Start Time: 2000 MST
End Time: 2130 MST
Duration: 00:01:30 (DD:HH:MM)
Impact Statement:
Kynetx email system will be unavailable during the maintenance window due to system patching and email system software updates.
Change Plan:
- Place email system into maintenance mode
- Backup current configuration
- Apply OS patches
- Reboot server
- Apply email system patches
- Apply custom configuration to email system
- Test email system
- Take email system out of maintenance mode
- Close maintenance window
Roll Back Plan:
- Restore system from backup
- Test system
- Close maintenance window
Dec
03
2009
Date: 12/3/2009
Start Time: 1400 MST
End Time: 1500 MST
Duration: 00:01:00 (DD:HH:MM)
Impact Statement:
Medium impact deployment due to multiple changes to the KNS engine configuration. This will be a rolling deployment in which all included servers/services will be taken out of load balancer rotation in order to allow for isolated deployment and testing.
Change Plan:
- Send "deployment commencing" notification to interested parties
- Remove server from load balancer pool
- Backup current configuration
- Deploy new revision of KNS engine code
- Test
- If successful, move to step 6
- If unsuccessful, move to roll back plan
- Return server to load balancer pool
- Monitor server for anomalous behavior
- Move to next server/service until all servers have been updated, tested and certified
- Send "deployment complete" notification to interested party
- Monitor platform for anomalies during 48 hour soak in period
Roll Back Plan:
*If deployment testing fails, or a severity one (1) anomaly is detected during the soak in period, the roll back plan may be activated*
- Send "roll back commencing" notification to interested parties
- If not already excluded from the load balancer pool, remove the affected server from the load balancer pool
- Restore configuration backup taken in step 3 of the "Deployment Plan"
- Test
- If successful, move to step 4
- If unsuccessful, develop and communicate recovery plan to interested parties
- Return server to load balancer pool
- Monitor platform for anomalies during 48 hour soak in period
- Send "roll back complete" notification to interested party
Dec
02
2009
After a long battle with obscurity, the Kynetx forums, aka codebb.kynetx.com, has been put out of its misery.
In order to work through the stages of grief as quickly as possible, we have replaced the long ignored, and now defunct codebb, with a shiny new site called devex, short for Developer Exchange.
This site is powered by StackExchange, and unlike the forums, is on fire with activity. I expect that devex will enjoy a much more robust and active life and I invite all Kynetx Developers, whether you are contemplating an application or actively working on one, to join in the conversation.