Reliability, Availability, and Scalability Lessons for Holiday Peak Processing

Operational performance is always paramount for applications, websites, and order systems during the busy Christmas holiday shopping season. Keeping holiday peak processing going with good performance is a pains-taking ordeal. Even with the best laid plans of continuous improvement builds, application code freezes, massive end-to-end testing, and continuous monitoring of systems application and programs fail, causing critical outages.

There are many ways to address these failures. Year after year, I help clients improve their response to these inevitable situations. Here are the first five of the top 10 ways that can help make your systems bulletproof for the holiday season’s reliability, availability, and scalability demands.

  • Understand and document program dependencies. With nightly website processing application code, there can be a tremendous amount of code reuse, complexity, and code dependencies between processes. By understanding the process modules and their dependencies on base software versions, Java method coding restrictions and their distinct interface requirements, helps you quickly understand the scope of the outage and determine resources affected. Research and documentation should be readily available to understand how methods, interfaces, and versions are inter-dependent.

    This program dependency information is critical for quickly diagnosing, debugging, or fixing errors or performance problems. Having a cross reference of program dependencies readily available can save huge amounts of time, especially during an outage crisis.

  • Know your application history. Systems and applications evolve over their lifetime. Research previous years’ history of system, application, and processing outages and gather relevant performance and error documentation. If problems occurred last year at peak, the same problem or a variation may happen this year.
    Outages and errors occur in every system and application so understanding where the problems over the past few years can be a great indication of future problems. If the application is new, uncover where the most testing was done and where it had the most problems. These history lessons can help you quickly pinpoint problems or understand where to expect them.

  • Know your crisis communication contacts. When the crisis occurs at the most inappropriate time (as it usually does), what are your escalation procedures, software remediation techniques, internal management, and external public communication protocols?  Each of these areas need distinct processes and messages to immediately capture the problem’s documentation for debugging, understanding any work around methodologies, proper escalation communication, and engaging the appropriate technology subject matter experts for resolving the issues. Documenting all of these procedures and different areas ahead of time so everyone knows their responsibilities and roles for resolution will help minimize the impact, scope, and depth of the problem.

  • Implement automated recovery remediation technology. Humans cause the majority of errors within the complex technology that impact reliability, availability, and scalability. In this article, it states that humans cause the majority of the security issues, but when surveyed we humans put ourselves as the lowest cause for issues. 

    Obviously, we are in denial. Having technology take care of issues such as deploying additional servers for scalability, automated workload balancing, automated error and recovery software is critical for quick problem resolutions. Across the systems and applications, automated responses to issues are usually quicker and more comprehensive than manual actions. Better yet, prior testing of these automated make responses quicker and more comprehensive than any human reaction to a problem.

These are only five suggestions for improving your systems and application reliability, reliability, and scalability. These are easy to understand concepts but the can be very involved to implement. Start on these five issues now to help improve your application uptime over last year’s.

Dave Beulke is a system strategist, application architect, and performance expert specializing in Big Data, data warehouses, and high performance internet business solutions.  He is an IBM Gold Consultant, Information Champion, President of DAMA-NCR, former President of International DB2 User Group, and frequent speaker at national and international conferences. His architectures, designs, and performance tuning techniques help organization better leverage their information assets, saving millions in processing costs. Follow him on Twitter  or connect through LinkedIn.

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>