Hypercare Support for the Holidays
With the winter holidays fast approaching, many retail businesses are turning their focus to hypercare as they prepare to move goods and services at peak levels. But what is hypercare? Here at PagerDuty, we use the following working definition:
Hypercare is the period of time where an elevated level of support is available to ensure the seamless adoption or operation of a system.
The key concept here is that hypercare is a period of planned support—in other words, it’s for Black Fridays and Cyber Mondays, but not DDoS attacks. It’s also important to know that hypercare isn’t just for retail; it impacts any business that can go through a period of elevated support, including major product releases, game releases, news cycles, and beyond. (To avoid over-attaching hypercare to Black Friday, I’ll be using the industry standard terms Go Live Day or Release Day.)
With such a broad reach, how can businesses support hypercare? Simple: In order to support hypercare, you need to support the teams that are providing the technical support, and you can do that with concepts from incident management, observability, and chaos engineering.
Incident response management is probably what most people think of first when they start to consider hypercare. After all, hypercare is elevated support and part of that is responding to situations that arise, quickly. However, to improve MTTAs and MTTRs, organizations should define as many terms and processes as far in advance as possible.
For example, outlining what differentiates an “incident” that requires a response process as opposed to any other hiccup in your systems will help technical responders prioritize what alerts to respond to, reducing time to resolution for bigger incidents. At PagerDuty, we define an incident as “any unplanned disruption or degradation of service that is actively affecting customers ability to use our products or services.”
Incidents have severities and priorities. In terms of human response, they also require people to be on call to respond to them if they occur outside of normal business hours and need defined escalation paths in case they worsen. Just like the incident itself, these all need to be defined in advance to ensure that you don’t lose precious time on Release Day if there’s a situation that needs to be handled. Our incident response guide can help you explore all of these concepts in greater depth, and on top of that, I recommend practicing faux incidents prior to Going Live (more on this later). This is especially important if you make changes to your processes or definitions, so that teams can have them practiced and well understood in advance.
In order to find the incidents to manage, you must be sending data into your incident management and alerting platform(s). To do that, you’ll need to have an observable system. What is that? “A system is observable if and only if you can determine the behavior of the system based on its outputs.” (From Greg Poirier’s talk at Monitorama 2016.)
When people discuss observability, they typically refer to the telemetry. These are referred to as the “three pillars” of observability: logging, monitoring/metrics, and tracing. Having lots of usable data is crucial to supporting hypercare, as that is what will enable your team to successfully triage and troubleshoot what is going wrong when they receive a notification from your incident management platform.
If you’re just getting started in one or more of these areas, don’t panic! There are several “getting started” guides available. What’s most important in the beginning phases is not which tool, but what data. The more you know about your systems, the more you can adapt the best practices you find to (for example) “monitor Kubernetes” to fit your specific deployments.
One of our partners, Datadog, has an excellent three-part series on effective monitoring, and this TechBeacon post is resource-rich, with links to articles about various applications and systems that can be monitored and logged, such as network logging, the differences between the pillars, how to use OWASP for secure logging, and how to choose a tracer.
Once you’re feeling comfortable with the essentials of the three pillars, then take a look at Honeycomb.io CTO Charity Major’s article, “A 3-Year Retrospective,” which highlights some of the shortcomings of observability, as well as what can be done to overcome them.
And now for the final piece: chaos engineering. Chaos engineering is the practice of experimenting in production to avoid real outages. This ties into what I mentioned earlier about practicing incidents before Release Day to help your team prepare for hypercare. The more practiced they are with handling situations that can and will arise, the more adeptly they will move through any unplanned incidents.
If you’re new to chaos experiments, definitely run experiments in non-production first. In addition to providing humans the opportunity to practice the incident management process, it also is a great opportunity to verify that your tools are behaving as intended by providing the correct information and in the correct scope. For additional guidance, check out this Gremlin post about how to run your first chaos experiment.
As you become more proficient with chaos experiments, you can move your experiments from your non-production environment of choice to production. By the time you do this, you should be able to build hypotheses, test for them, and resolve the resulting breakages. You should also choose applications and services that are fairly to extremely critical since these will be the ones you will need to understand the best on Go Live.
Additionally, make use of the stakeholder communications you developed to ensure that relevant parties know that experiments will be running in production so no one is caught off guard. (This is also partly to prevent anyone from panicking when they see bursts of alerts.) You should not mute alerts because they’re a part of the test and will help you see whether they are 1) actionable, 2) routing correctly, and 3) contain the correct information. When in doubt, look to the experience of others for guidance. We have a pair of blog posts detailing our Failure Fridays model. You can also see how New Relic applied their chaos experiments to security, here.
All of that probably feels like a lot, and it is, but the key takeaway to remember is that the goal of hypercare is to eliminate surprise. Each of the topics discussed are practices in engineering that work toward reducing surprises and their impact. While you’re getting ready for your hypercare scenario, you might want to keep our handy Hypercare Readiness Checklist available to help track your progress. If you have any questions, please drop by our Community Forums—we’re happy to help!