Blog

Automate Fast & Win: 11 Event-Driven Automation Tasks for Enterprise DevOps Teams

by Justyn Roberts December 17, 2024 | 7 min read

Event-driven automation is a powerful approach to managing enterprise IT environments, allowing systems to automatically react to enterprise events (Observability / Monitoring / Security / Social / Machine) and reducing or removing the need for manual intervention.

This post discusses 11 common automation tasks that are ideal for enterprise DevOps teams looking to enhance operational efficiency, reduce downtime, and ensure business continuity. 

Struggling with ideas for where to start? These examples cover a range of scenarios, from security patching to resource optimization, and are paired with real code samples to help you get started.

Automation Tasks

1. Kubernetes Pod Actions

Description: Whilst in Kubernetes environments, a desired state is usually well maintained, occasionally restarting pods can be necessary to refresh the application state or apply new configurations. This automation task restarts pods to ensure they connect to the most updated environment. Guard rails can be easily added to prevent accidental overscaling.

Trigger: Incident/Event

Plugin/Technology: Kubernetes plugin

Kubernetes plugin showing Pod Deletion and keystore functionality

Benefit: Prevents application crashes and performance degradation by automating disk space management, improving system stability, and reducing manual intervention costs.

Explanation: This plugin restarts Kubernetes pods for a specific deployment in a given namespace, ensuring the application runs with the latest configurations or patches. Data such as deployment name or namespace can be dynamically passed from the triggering event, and Runbook Automation includes a selection of plugins to streamline this process.

This can easily be extended to any activity within the Kubernetes ecosystem, and 23 plugins are available for tasks such as maintaining PVs, deploying services, grabbing logs, or running internal jobs.

Benefit: Ensures application availability and reliability by keeping pods running with the latest configurations and patches, reducing downtime from misconfigurations.

2. Optimize Disk Resource

Description: Running out of disk space can lead to application crashes, degraded performance, and system instability. Manual monitoring and cleaning of disk space can be time-consuming and error-prone. Automated disk cleanup ensures that the system remains stable by removing unnecessary files.

Trigger: Incident/Event/Human-initiated

Plugin/Technology: Bash inline script

Inline bash script containing an automation of the manual tasks to optimize the local storage

 

Explanation: This script checks the disk usage of the root partition and initiates cleanup actions such as deleting old log files and clearing package caches when disk usage exceeds 80%.

3. Patch Deployment

Description: Vulnerabilities in Linux systems need to be patched promptly to prevent exploitation. This automation task automatically applies security patches when a vulnerability is detected.

Trigger: Scheduled/Event-driven/Human-initiated

Plugin/Technology: Ansible Inline

Ansible inline playbook to apply security patches.

 

Explanation: This playbook updates all packages on Linux systems. It can be triggered when a vulnerability is detected or scheduled to run periodically.

Benefit: Enhances security posture by applying security patches promptly, minimizing vulnerability windows, and protecting against potential exploits.

4. Kubernetes Scaling

Description: In Kubernetes environments, scaling up or down a deployment can be crucial to manage workload effectively, especially during peak usage or when resource usage drops. This automation task scales a deployment to match the current demand with a defined maximum number of instances to ensure optimal resource usage.

Trigger: Human-driven/Event-driven

Plugin/Technology: Kubernetes plugin

 

Inline Kubernetes pod scaling.

Explanation: This script checks the current number of replicas of a deployment and scales it up to the maximum defined number if more resources are required or scales it down during periods of lower demand.

Benefit: Optimizes resource usage by dynamically scaling deployments based on demand, reducing infrastructure costs while maintaining performance during peak times.

5. Security Incident Response

Description: Security incidents such as unauthorized access attempts require immediate action. Automate the response to detected incidents for better security posture. There are dedicated SIEM tools for these purposes, but Runbook Automation can be utilized to enhance the block or quarantine process.

Trigger: Incident/Event 

Plugin/Technology: Lambda Invoke

Lambda invocation plugin, with keystore and parameter configuration.

 

 

Explanation: This Lambda function takes a malicious IP address as input and adds a security group rule to block the IP address.

Benefit: Improves security response times by automating incident handling, reducing the risk of breaches, and limiting potential damage from malicious activity.

6. Database Maintenance

Description: As an example of maintaining database health, PostgreSQL requires periodic vacuuming to clean up unnecessary data and reclaim storage. This helps keep database performance optimal.

Trigger: Human-initiated/Event-driven/Scheduled

Plugin/Technology: SQL Run Step plugin

SQL Query Run plugin. Allows reusable queries to be provisioned.

Explanation: This script performs a vacuum operation on a PostgreSQL database to optimize performance by reclaiming storage and cleaning up unnecessary data.

Benefit: Ensures optimal database performance and longevity by automating routine maintenance tasks, reducing manual effort, and preventing performance issues.

7. IAC Drift Remediation with Terraform

Description: Cloud-native environments require consistent configuration to ensure stability. This automation task helps apply corrective actions when configuration drifts from the desired state.

Trigger: Incident/Event-driven

Plugin/Technology: Terraform

Inline Terraform file, with application and approval

Explanation: This Terraform script defines an AWS EC2 instance. Any drift from this configuration can be corrected by reapplying the Terraform plan.

Benefit: Maintains cloud infrastructure consistency, minimizing the risk of configuration drift, which can lead to unexpected outages or security vulnerabilities.

8. Automated Backup and Recovery

Description: Regular backups are critical for business continuity. Automated backups ensure that data is always recoverable.

Trigger: Scheduled

Plugin/Technology: Command step

Packaged AWSCLI Commands

Explanation: This script creates a daily snapshot of an RDS instance, ensuring that data can be recovered if needed. Credentials can use IAM or be passed securely from a key store.

It reduces the risk of data loss by ensuring regular backups, improving disaster recovery capabilities, and minimizing potential business disruption.

Benefit: Reduces cloud costs by automatically stopping unused resources, ensuring that unnecessary expenses are minimized and resource utilization is optimized.

9. Resource Optimization and Cost Management

Description: Inefficient resources lead to unnecessary costs. Automated optimization helps cut costs.

Trigger: Scheduled/Event-driven/Human-initiated

Plugin/Technology: Python script

Inline Python scripts can be scheduled and executed

Explanation: This Python code stops EC2 instances that have been running for over 24 hours and are tagged for automatic stopping, optimizing resource use.

Benefit: Ensures uninterrupted, secure communication and prevents service outages due to expired SSL certificates, safeguarding customer trust and service reliability.

10. Check SSL Certificate Expiry

Description: Ensuring SSL certificates are up-to-date is crucial to maintaining secure communication between users and services. This automation task checks the expiry date of an SSL certificate for a given URL and provides a warning if it is about to expire within a configured number of days.

Trigger: Scheduled/Event-driven/Human-initiated

Plugin/Technology: Bash Script plugin

SSL Checking Script and notification

Explanation: This script checks the SSL certificate expiry date for a given URL. If the certificate is set to expire within the configured number of warning days, it prints a warning message.

11. Windows Server Restart Remediation

Description: Restarting a Windows server can be necessary to apply patches, resolve performance issues, or implement configuration changes. This automation task uses PowerShell to remotely restart a Windows server in an event-driven manner.

Trigger: Incident/Event-driven

Plugin/Technology: Powershell script

Powershell scripts can be executed locally or remotely via runner architecture

Explanation: This PowerShell script remotely restarts a Windows server specified by $ServerName. The -Force flag ensures the restart proceeds even if users are logged in, and -Wait allows monitoring of the restart process with a timeout of 300 seconds.

Benefit: Enhances system availability by automating server restarts for patching or performance improvements, minimizing downtime and manual maintenance efforts.

Conclusion

Event-driven automation transforms how organizations manage their IT environments, enabling proactive and efficient remediation. By implementing these automation tasks, businesses can enhance their operational resilience, security, and cost-effectiveness, allowing teams to focus more on strategic initiatives.

PagerDuty Runbook Automation helps organizations standardize on a common approach for both existing and future state automation across cloud/hybrid and self-hosted platforms, with plugins for both contemporary and traditional architectures.

Automation Content Library

To make things easier for those just getting started, an automation content library is being launched at https://www.pagerduty.com/automation/.

The library enables multiple automation standardization approaches, including:

  • Bring your own code
  • Build from existing content
  • Automate GenAI job creation in Runbook Automation

About the Author
Justyn is a member of the Solution Consulting team at PagerDuty. Passionate about automation and infrastructure as code, Justyn helps PagerDuty customers streamline their operations and embrace modern technologies to achieve scalability and efficiency and remove low-value tasks.