Senior Site Reliability Engineer, DGX Cloud ⚡ Urgent
This role focuses on operating and scaling high-performance DGX Cloud platforms for AI workloads across major cloud providers. Responsibilities include building and supporting large-scale Kubernetes clusters, defining and monitoring SLOs and error budgets, operating GPU workloads, improving observability, and leading incident response and root-cause analysis. The position emphasizes automation, reliability engineering best practices, and collaboration to ensure highly available, secure, and performant cloud services for enterprise and research customers.