10 Common Ops Mistakes
Updated 7/24/2014: This blog post was updated to more accurately reflect Arup’s talk.
Arup Chakrabarti, PagerDuty’s operations engineer manager, stopped by Heavybit Industries’ HQ to discuss the biggest mistakes an operations team can make and how to head them off. To check out the full video, visit Heavybits Video Library.
1. Getting It Wrong in Infrastructure Setup
Creating Accounts
A lot of people use personal accounts when setting up enterprise infrastructure deployments. Instead, create new accounts using corporate addresses to enforce consistency.
Be wary of how you store passwords. Keeping them in your git repo could require you to wipe out your entire git history at a later date. It’s better to save passwords within configuration management so they can be plugged in as needed.
Selecting Tools
Another good move with new deployments: select your tools wisely. For example, leverage PaaS tools as long as possible – that way, you can focus on acquiring customers instead of building infrastructure. And don’t be afraid to employ “boring” products like Java. Well-established, tried-and-true tech can let you do some really cool stuff.
2. Poorly Designed Test Environments
Keep Test and Production Separate
You don’t want to risk your test and production environments mingling in any way. Be sure to set up test environments with different hosting and provider accounts from what you use in production.
Virtual Machines
Performing local development? There’s no way around it: applications will run differently on local machines and in production. To simulate a production environment as closely as possible, create VMs with a tool like Vagrant.
3. Incorrect Configuration Management
Both these Ansible and Salt are tools that are really easy to learn. Specifically, Ansible makes infrastructure-as-code deployment super-simple for ops teams.
What is infrastructure-as-code? Essentially, it’s the process of building infrastructure in such a way that it can be spun up or down quickly and consistently. Server configurations are going to get screwed up regardless of where your infrastructure is running, so you have to be prepared to restore your servers in as little time as possible.
Whatever tool you use, as a rule of thumb, it’s best to limit the number of automation software tools you’re using. Each one is a source of truth in your infrastructure which means it’s also a point of failure.
4. Deploying the Wrong Way
Consistency matters
Every piece of code must be deployed in as similar a fashion as possible. But getting all of your engineers to practice consistency can be a challenge.
Orchestrate your efforts
Powerful automation software can certainly help enforce consistency. But automation tools are only appropriate for big deployments – so when you’re getting started, Arup suggests running development using git and employing an orchestration tool. For example, Capistrano for Rails, Celery for Python or Ansible and Salt for both orchestration and configuration management.
5. Not Handling Incidents Correctly
Have a process in place
Creating and documenting an incident management process is absolutely necessary, even if the process isn’t perfect.
You should be prepared to review the incident-management document on an ongoing basis, too. If you’re experiencing lots of downtime, reviews won’t really be necessary.
Put everyone on-call
It’s becoming less and less common for companies to have dedicated on-call teams – instead, everyone who touches production code is expected to be reachable in the event of downtime.
This requires a platform (like PagerDuty) that can notify different people in different ways. What really matters is getting a hold of the right people at the right time.
6. Neglecting Monitoring and Alerting
Start anywhere
The specific tool you use for monitoring is less important than just putting something in place. PagerDuty uses StatsD in concert with Datadog; open-source tools like Nagios can be just as effective.
If you have the money, an application performance management tool like New Relic might be a good fit. But, what matters most is that you have a monitoring tool on deck.
“You have no excuse to not have any monitoring and alerting on your app, even when you first launch,” – Arup Chakrabarti, Engineering Manager, PagerDuty
7. Failing to Maintain Backups
Systematizing backups and restores
Just like monitoring and alerting, backing up your data is non-negotiable. Scheduling regular backups to S3 is a standard industry practice today.
You should try restoring your production dataset into a test environment to confirm that your backups are working as designed at least once a month.
8. Ignoring High Availability Principles
‘Multiple’ is the watchword
Having multiple servers at every layer, multiple stateless app servers and multiple load balancers is a no-brainer. Only with multiple failover options can you truly say you’ve optimized for HA.
Datastore design matters, too
Datastores (like Cassandra) are essential because with multimaster data clusters, individual nodes can be taken out with absolutely no customer-facing impact. Clustered datastores are ideal in fast-moving deployment environments for this reason.
9. Falling Into Common Security Traps
Relying solely on SSH
Use gateway boxes instead of SSH on your database servers and load balancers. You can run proxies through these gateways and lock traffic down if you suspect an incursion.
Not configuring individual user accounts
When an employee leaves your organization, it’s nice to be able to revoke his or her access expediently. But there are other reasons to set people up with user accounts to your various tools. Someone’s laptop may get lost. An individual might need his password reset. It’s a lot easier, Arup notes, to revoke or reset one user password than a master account password.
Failing to activate encryption in dev
Making encryption a part of the development cycle helps you catch security-related bugs early in development. Plus, forcing devs to think constantly about security is simply a good practice.
10. Ignoring Internal IT Needs
Not strictly an operations problem, but…
IT isn’t always ops’ concern. But on certain issues, both teams are stakeholders. For example:
-
Commonality in equipment: If an engineer loses her custom-built laptop, how long will it take to get her a replacement? Strive for consistency in hardware to streamline machine deployments.
-
Granting access to the right tools: On-boarding documents are a good way to share login information with new hires.
-
Imaging local machines: With disk images stored on USB, provisioning or reprovisioning equipment is a snap.
-
Turning on disk encryption: With encryption, no need to worry if a machine gets lost.
There are millions more mistakes that operations teams can make. But these 10 tend to be the most commonly seen, even at companies like Amazon, Netflix and PagerDuty.
Have your own Ops mistake you’d like to share. Let us know in the comments below.