Network Operations Center Best Practices
What is a NOC?
A network operations center (NOC) is the centralized location (in-office or virtual) for an organization’s network team. This team typically monitors networks, servers, application infrastructure, cloud usage, and more, for events that can result in service degradation or disruption to customers and users, signifying risk and cost to the business.
NOC duties often include:
- Being available 24×7 and typically working in 2-3 shifts depending on business hours
- Watching for anomalies, service disruptions, and outages
- Managing a queue of tickets from people reporting issues
- Working through runbooks to triage or resolve issues
- Notifying the team responsible for a given service when an outage occurs and the NOC cannot solve it
NOCs are often staffed with several levels of expertise. This is to ensure that as few incidents are escalated to subject matter experts (SMEs) as possible to keep down the cost while still preventing incidents from impacting the customer.
Common NOC roles are:
- L0: Role is a bit of a misnomer for L0. In fact, this role should be entirely automated and serve as the first line of defense before a human is ever bothered with an issue. This is possible via event-driven automation and reduces redundant incidents from the start, as well as auto-remediates well-understood issues.
- L1: This is the first human an event should come into contact with. These engineers are responsible for working through attached, predetermined runbooks to resolve routine and documented issues.
- L2: If an L1 can’t resolve a problem via a runbook, this person is the one who does further triage and troubleshooting, ideally looking to resolve the issue with their additional expertise. They escalate the issue to a SME only if they cannot resolve it themselves. Additionally, L2s may be responsible for calling a major incident if the issue meets the criteria.
- Director, VP, or CIO: This is NOC leadership. The director is responsible for ensuring that the NOC is staffed appropriately, is using resources wisely, and meeting all their goals and KPIs. The VP or CIO will have a larger scope than just the NOC. However, the NOC is a prime opportunity to modernize for better customer experiences, less risk to the business, and less operating costs.
Let’s cover why this team and the duties it covers are so important to a modern enterprise.
Importance of a NOC
In today’s digital landscape, the NOC is under immense pressure to continuously monitor and maintain a growing number of business-critical services as the speed of innovation reaches a break-neck pace. Due to this increase in complexity and sheer volume of data, common practices for a NOC such as the traditional “command and control” approach coupled with sequential, often manual workflows, are no longer fit for today’s real-time world. Under a mandate to do more with less, many organizations are evolving NOCs to centralize and standardize the incident management process to gain efficiency across tech and teams. The benefits include fewer SLA penalties, less risk and loss of revenue, fewer interruptions for SMEs, and better overall brand reputation, to name a few.
But in order to reach that future state, legacy NOCs must address certain technical, people, and process challenges that stand in the way.
Top NOC challenges
Faced with unprecedented levels of complexity, the NOC must transform in order to avoid customer impact and reduce cost and risk to the business. Here are the top three challenges that stand in their way:
Increase in events/data: Because of the increased complexity of applications and services, the amount of data coming into the NOC has grown significantly. It’s no longer possible to maintain eyes-on-glass methods and have NOC engineers watch screens all day to pinpoint issues. There are simply too many data points to synthesize. Problems will fly under the radar. The result is that customers find out about disruptions faster than technical teams do, costing the organization money and hurting the brand reputation.
Manual processes: From running diagnostics, to finding the right runbook, to routing incidents via catch-and-dispatch, the NOC is full of processes that automation can handle. Without automation, NOC engineers spend too much time doing the same routine processes for each issue, delaying a fix. And, with this time spent on toil, organizations are less able to get proactive about incident response, and more issues trickle down to SMEs as the NOC struggles with the volume of tasks to complete.
Expertise: NOC personnel contact SMEs or system/service owners when a specific application or service experiences an incident. They are not typically deep-domain experts in the systems or services being affected largely because they didn’t develop the application or service themselves. So when an incident arises, it can be challenging and costly to manually navigate complex escalation paths to find and contact the SME responsible for a specific service or application. This manual escalation process extends the time it takes to diagnose and resolve issues that are actively affecting customers.
Although there are some challenges in how traditional NOCs operate, there are plenty of ways to consider how it can be reimagined to support today’s real-time needs.
NOC best practices
For organizations looking to uplevel their NOC, teams must shift their mindset and find ways to use automation as their first line of defense and throughout the incident management process. Here are some tried and true tactics for how to do this at each stage of incident management that will lead to a healthier, more productive team and a more reliable system.
Detect: Ensure that all your monitoring systems are filtering events into a central system. Without this central system, the complete picture of an issue’s severity will be fragmented, leaving responders unsure how to categorize it. With monitoring flowing through a single source of truth, NOCs can immediately determine what needs to be prioritized from all across the organization, no matter who owns the service being monitored. Worried about the extra noise? Use automation to suppress and deduplicate redundant alerts even across different tooling for a clear signal.
Mobilize: The right response requires the right approach, whether this is solely via auto-remediation, L1/2 NOC response, or escalation to SMEs or major incident management (MIM) teams. Create automation that categorizes and routes issues to the right person immediately. Establishing clear routing rules and escalation policies can help bridge processes without letting issues fall through the cracks.
Mitigate: Distinguish between high and low priority issues. Some services are not customer-facing or have few dependencies and can wait for response since there’s less risk and impact to the business. Allow auto-remediation to prevent customer-facing problems for well-understood issues. Create runbooks for routine issues that do require NOC response so that fixing the issue is as streamlined as possible. Ensure these runbooks are automatically populated and cover MIM criteria and practices (or better yet, automate this criteria) so that major incident management kicks off immediately if needed.
Resolve: Let machines handle what they can and only loop humans in to resolve problems when automation cannot. Use runbooks to keep escalations low and have set criteria for when an escalation is required. Arm your responders with the right information immediately using automation to pull diagnostics and update them continuously, as well as Artificial Intelligence (AI) to pull relevant historical data such as related incidents or past incidents and how they were resolved.
Document: Integrate with the rest of the organization’s technical system, such as systems of records like JIRA or ServiceNow. These are common tools that NOCs rely on, either to pull data from or to transfer data to. Ensure that these are added for the services the NOC is responsible for so all data is available when needed and craft automation to update them without requiring a human to do data entry.
Learn: Conduct post-mortems to recognize what went well during response, and what could have gone better. Use those learning opportunities to create new automation that helps responders act faster and reduce toil. This feedback cycle is important to implement as it has a direct impact on the customer and the business. Reducing repeat incidents materially decreases risk and cost to the business as well as improves morale.
Keep in mind that these best practices can’t be implemented overnight. While these recommendations will improve the NOC’s ability to react, troubleshoot, and resolve incidents holistically, they should be considered with intent and formalized via documentation.
Additional
Resources
PagerDuty University Training
PagerDuty 101
Webinar
Improve Efficiency of Incident Response with Automated Diagnostics for AWS in PagerDuty