Link copied to clipboard
Software Engineering

Senior Site Reliability Engineer, DGX Cloud ⚡ Urgent

NVIDIA Remote, India
Full Time 8–12 years experience 🏠 Remote
About the Role

This role focuses on operating and scaling high-performance DGX Cloud platforms for AI workloads across major cloud providers. Responsibilities include building and supporting large-scale Kubernetes clusters, defining and monitoring SLOs and error budgets, operating GPU workloads, improving observability, and leading incident response and root-cause analysis. The position emphasizes automation, reliability engineering best practices, and collaboration to ensure highly available, secure, and performant cloud services for enterprise and research customers.

You'll be redirected to the official careers portal

Similar Jobs You Might Like

eDiscovery System Administrator

NVIDIA company logo

NVIDIA

Bengaluru, India
eDiscovery Platforms System Administration Data Privacy Legal Technology Automation

The eDiscovery System Administrator role supports global legal and compliance initiatives by managing and operating enterprise eDiscovery platforms an...

IT & Infrastructure Full Time 5-9 years experience