Senior Cloud Site Reliability, AI/ML Infrastructure

2 days ago• Sunnyvale, CA

Company Description

It started with a simple idea: what if surgery could be less invasive and recovery less painful? Nearly 30 years later, that question still fuels everything we do at Intuitive. As a global leader in robotic-assisted surgery and minimally invasive care, our technologies—like the da Vinci surgical system and Ion—have transformed how care is delivered for millions of patients worldwide.

We’re a team of engineers, clinicians, and innovators united by one purpose: to make surgery smarter, safer, and more human. Every day, our work helps care teams perform with greater precision and patients recover faster, improving outcomes around the world.

The problems we solve demand creativity, rigor, and collaboration. The work is challenging, but deeply meaningful—because every improvement we make has the potential to change a life.

If you’re ready to contribute to something bigger than yourself and help transform the future of healthcare, you’ll find your purpose here.

Job Description

We are seeking a highly skilled Senior Site Reliability Engineer to join our Technical Operations team and lead reliability, scalability, and performance initiatives for AI/ML workloads across multi-cloud and on-prem environments. This role will focus on building and maintaining resilient infrastructure for advanced data science workflows, including NVIDIA DGX systems, leveraging platforms such as Domino Data Lab, Slurm, and NVIDIA Base Command, while driving automation, observability, and networking optimization

Key Responsibilities

Contribute to deployment, and maintenance of infrastructure across AWS, GCP, and Azure, as well as on-prem NVIDIA DGX systems.
Implement and manage Infrastructure as Code (IaC) using Terraform and Ansible for automated provisioning and configuration.
Support cloud and on-prem networking solutions for secure, high-performance connectivity.
Manage and optimize Domino Data Lab workflows and Slurm clusters for distributed training and inference.
Integrate and support NVIDIA Base Command for GPU-based compute environments.
Develop automation scripts and tools in Python to streamline operations and improve reliability.
Support CI/CD pipelines using GitLab, ensuring smooth deployments to UAT and production environments.
Implement and maintain observability solutions (monitoring, logging, alerting) using tools like Prometheus, Grafana, and cloud-native services.
Deploy and manage Kubernetes clusters (EKS, GKE) for scalable containerized workloads.
Troubleshoot complex workflows and ensure high availability of critical systems.
Collaborate with data science and engineering teams to optimize resource utilization and workflow efficiency.
Drive best practices for incident response, capacity planning, and system reliability in multi-cloud and HPC environments.

Additional Responsibilities

Administer and optimize ITSM platforms (e.g., Jira Service Management, ServiceNow) for release/change/incident workflows.
Support tooling across CI/CD, monitoring, and ticketing systems to ensure traceability and automation.
Maintain documentation and evidence for audits related to release/change/incident processes.
Partner with Compliance and InfoSec teams to ensure controls meet HIPAA, HITRUST, FDA GxP, and ISO 27001 standards.
Act as the primary liaison between engineering, product, support, and compliance teams for operational readiness.
Facilitate regular status updates, incident reviews, RCA’s and change planning sessions with stakeholders.
Support in updating onboarding materials and training sessions for engineers and product managers on release/change/incident protocols.
Promote a culture of ownership and reliability through education and process transparency.
Support retrospectives for major releases and incidents to identify process gaps and improvement opportunities.
Track and report on KPIs such as change success rate, incident recurrence, and release velocity.
Identify operational risks and escalate proactively to leadership.
Maintain escalation matrices and ensure readiness for high-severity incidents.

Qualifications

Want more jobs like this?

Get jobs in Sunnyvale, CA delivered to your inbox every week.

By signing up, you agree to our Terms of Service & Privacy Policy.

Required Qualifications

5+ years of experience in Site Reliability Engineering or Cloud Infrastructure Engineering.
Strong proficiency in AWS and GCP; working knowledge of Azure.
Expertise in Terraform, Ansible, and IaC principles.
Solid understanding of networking fundamentals, VPC design, and security best practices.
Hands-on experience managing AI/ML workloads, including Domino Data Lab, Slurm, and GPU-based environments.
Advanced scripting and automation skills in Python.
Experience with CI/CD pipelines and release management using GitLab.
Strong troubleshooting skills and experience with observability tools (Prometheus, Grafana, ELK).
Hands-on experience with Kubernetes in AWS (EKS) and GCP (GKE).
Proficiency with NFS and NetApp Data ONTAP.
Strong Linux systems knowledge, including familiarity with file systems, kernel internals, cgroups, and environment variables.
Experience using debugging tools and performing debugging and analysis for complex systems.
Excellent communication and collaboration skills in cross-functional environments.

Preferred Qualifications

Familiarity with NVIDIA Base Command and GPU orchestration.
Knowledge of container orchestration beyond Kubernetes (Docker, Helm).
Understanding data security and compliance for AI/ML workloads.
Exposure to MLOps best practices and ML lifecycle management.

Minimum Education and Experience requirements

Education: Bachelor’s degree in computer science, Information Systems, Engineering, or related field required. Master’s degree or certifications in ITIL, DevOps, or regulatory compliance preferred.
Experience: Minimum of 7+ years in technical operations, SRE, or IT service management roles. Proven experience supporting release cycles, change governance, and incident response in regulated environments (e.g., healthcare, life sciences, financial services).

Additional Information

Due to the nature of our business and the role, please note that Intuitive and/or your customer(s) may require that you show current proof of vaccination against certain diseases including COVID-19. Details can vary by role.

Intuitive is an Equal Opportunity Employer. We provide equal employment opportunities to all qualified applicants and employees, and prohibit discrimination and harassment of any type, without regard to race, sex, pregnancy, sexual orientation, gender identity, national origin, color, age, religion, protected veteran or disability status, genetic information or any other status protected under federal, state, or local applicable laws.

Mandatory Notices

U.S. Export Controls Disclaimer: In accordance with the U.S. Export Administration Regulations (15 CFR §743.13(b)), some roles at Intuitive Surgical may be subject to U.S. export controls for prospective employees who are nationals from countries currently on embargo or sanctions status.

Certain information you provide as part of the application will be used for purposes of determining whether Intuitive Surgical will need to (i) obtain an export license from the U.S. Government on your behalf (note: the government’s licensing process can take 3 to 6+ months) or (ii) implement a Technology Control Plan (“TCP”) (note: typically adds 2 weeks to the hiring process).

For any Intuitive role subject to export controls, final offers are contingent upon obtaining an approved export license and/or an executed TCP prior to the prospective employee’s start date, which may or may not be flexible, and within a timeframe that does not unreasonably impede the hiring need. If applicable, candidates will be notified and instructed on any requirements for these purposes.

We will consider for employment qualified applicants with arrest and conviction records in accordance with fair chance laws.

Preference will be given to qualified candidates who do not reside, or plan to reside, in Alabama, Arkansas, Delaware, Florida, Indiana, Iowa, Louisiana, Maryland, Mississippi, Missouri, Oklahoma, Pennsylvania, South Carolina, or Tennessee.

We provide market-competitive compensation packages, inclusive of base pay, incentives, benefits, and equity. It would not be typical for someone to be hired at the top end of range for the role, as actual pay will be determined based on several factors, including experience, skills, and qualifications. The target compensation ranges are listed.

Client-provided location(s): Sunnyvale, CA

Job ID: dc7b58a9-0e4d-444f-896c-7a514aa78f29

Employment Type: OTHER

Posted: 2025-12-12T23:06:18

Perks and Benefits

Health and Wellness
- Health Insurance
- Health Reimbursement Account
- Dental Insurance
- Vision Insurance
- Life Insurance
- FSA
- HSA
- Mental Health Benefits
Parental Benefits
- Birth Parent or Maternity Leave
- Non-Birth Parent or Paternity Leave
- Family Support Resources
Work Flexibility
- Flexible Work Hours
- Remote Work Opportunities
- Hybrid Work Opportunities
Office Life and Perks
- Casual Dress
- Company Outings
- On-Site Cafeteria
Vacation and Time Off
- Paid Vacation
- Paid Holidays
- Personal/Sick Days
- Leave of Absence
Financial and Retirement
- 401(K) With Company Matching
- Company Equity
- Stock Purchase Program
Professional Development
- Internship Program
- Leadership Training Program
- Tuition Reimbursement
- Promote From Within
- Lunch and Learns
Diversity and Inclusion
- Employee Resource Groups (ERG)
- Diversity, Equity, and Inclusion Program

Company Description

Job Description

Qualifications

Want more jobs like this?

Additional Information

Perks and Benefits

Health and Wellness

Parental Benefits

Work Flexibility

Office Life and Perks

Vacation and Time Off

Financial and Retirement

Professional Development

Diversity and Inclusion