Site Reliability Engineer – Compute Operations New
As a Site Reliability Engineer focused on Compute Operations at IBM, you will ensure the reliability, availability, and performance of large-scale compute infrastructure. You will design and implement automation to reduce manual toil, manage infrastructure as code, and build monitoring and alerting systems for proactive issue detection. The role involves troubleshooting complex production issues, performing capacity planning, and driving continuous improvements in system uptime and operational efficiency. You will collaborate with development and operations teams to define SLOs/SLIs and implement best practices for incident management, change management, and disaster recovery across cloud and on-premise compute environments.