IMS is an all-hazard, all-risk framework for managing incidents. Over 1.5 million fires/year in the U.S. are managed using IMS. It’s been battle tested under the most extreme conditions. It works for fires and definitely works in IT.
The fire service has a strong culture, strong personalities and strong opinions, and agreeing on a way of doing anything, much less adopting IMS as “the way” of managing incidents, was a herculean feat. As a testament to how efficient and useful IMS is, it overcame all the political inertia and resistance to change that could have killed it off as a fad or “something that just won’t work for us,” as many fire departments are prone to say. IT is much like the fire service. It also has a strong culture, strong personalities and strong opinions. Any company can argue that their environment, company size or complexity is so unique as to preclude them from adopting IMS as a way to manage incidents. We know this is not true.
The Blackrock 3 Partners have a unique viewpoint into the domain of Incident Management. Collectively we bring over 100 years of Fire Service and Critical Infrastructure experience--we have literally published books on the subjects--and we pioneered the adaptation of the Incident Management System (IMS) from public safety into corporate IT environments.
We have helped our clients successfully implement IMS successfully in enterprises, service providers, and DevOps shops, across multiple industries and around the world! Each adopted IMS concepts and methods, aligned with common sense and intuition, in order to build excellent incident management response programs.
You may not think that a building on fire and an IT incident have much in common, but from an IMS perspective, there are fundamentally the same. Both fire and IT incidents occur without warning, are dynamic (i.e., both are in progress and not under control), create a negative impact of some type, and require a coordinated effort of the right people performing the right tasks at the right time to return systems to normal (i.e., a building that is not on fire or an IT environment that is not in a degraded state). The burning building and the IT incident both create downtime and the incident responders are there to bring the environment back to uptime.
When working as an organized team under strong leadership, the incident responders with technical skills can assess a dynamic and evolving situation, develop plans to resolve the issue, communicate those plans, and work together to return to uptime in shortest amount of time possible. To that end, there is a difference in responding to an incident and reacting to it. Responders are trained, organized, and disciplined in their approach to resolving an incident. They bring their experience and skill to the incident with focus and direction. Reactors, on the other hand, tend to be emotional and without discipline, either as individuals or a team. Each reactor generally has a different viewpoint on what’s important to resolving the situation. There likely is no coordination among reactors, no recognition of the importance of a team, no delegation of tasks and sharing ideas or developing solutions in an organized fashion, and no focused effort of the group as whole.
Responders are calm, cool, and collected and can think clearly under pressure. They arrive and direct the events that ultimately resolve an incident. Reactors get emotional and irrational and cannot stay focused or organized. They arrive and see an emergency not an incident. Which one are you?
Clearly, incident response is best accomplished by responders. Perhaps a good way to get your head around being one is by adapting this viewpoint from the fire service: “Fire is not an emergency to the fire department. It’s what we do.” When you dial 911 for the fire department, you expect a rapid response from a group of professionals, skilled in the art of solving whatever issue you are having on your “bad day.” IT responders, regardless of whether they are using DevOps practices, ITIL, or homegrown systems, are similar to fire fighters and should think of themselves in the same way. IT responders reduce the impact of an IT issue and restore the environment back to uptime.
Wartime is an urgent, degraded mode of operation that occurs when any application or infrastructure element experiences an issue outside the normal course of business. Wartime is downtime.
This doesn’t mean, however, that responders are frantic or hysterical. It means that the group understands the need to assemble quickly, get organized, stay on task, and get on with the business of resolution with urgency (not emergency!), and intensity. If you come from an agile development environment, think of incident response as a really fast and compressed sprint!
Having a responder (Wartime) mentality, however, is just the tip of the iceberg when it comes to resolving IT incidents. An excellent group of technical experts without a strong leader and a framework to organize themselves cannot resolve incidents at maximum efficiency and minimum time. Conversely, strong leadership and a framework to organize people without the right technical expertise will not solve any issue quickly or efficiently. To that end, there must exist the right mix of expertise and leadership when it comes to resolving incidents.
Important to incident response is the use of monitoring and alerting tools for the technology stack, which provides the initial information for the responders and helps to size-up the incident and identify a severity (SEV) level.