Principal Cloud Site Reliability Engineer - United Kingdom

Today• Flexible / Remote

We are seeking a Principal Cloud Site Reliability Engineer with strong Incident Management, Kubernetes, and Terraform expertise to ensure the reliability, scalability, and operational excellence of our production platforms.

The ideal candidate will combine software engineering, infrastructure automation, and operational excellence to maintain highly available systems while leading and coordinating responses to critical production incidents.

This role requires someone comfortable operating in high-availability cloud environments, managing large-scale distributed systems, and driving incident response, post-incident analysis, and reliability improvements.

In this role you will...

Site Reliability Engineering

Maintain and improve system reliability, scalability, and performance for production environments.
Implement Infrastructure as Code (IaC) using Terraform to manage and automate cloud infrastructure.
Design, deploy, and operate Kubernetes clusters and containerized workloads.
Build and maintain observability frameworks including monitoring, logging, and alerting.
Automate operational tasks to reduce manual interventions and improve system resilience.

Incident Management

Lead and coordinate Major Incident Management (MIM) during production outages.
Act as Incident Commander or technical lead during high severity incidents.
Facilitate incident triage, mitigation, communication, and resolution across engineering teams.
Drive Root Cause Analysis (RCA) and ensure corrective and preventive actions are implemented.
Develop and improve runbooks, playbooks, and operational procedures.

Platform & Cloud Operations

Manage cloud infrastructure on platforms such as AWS, Azure, or GCP.
Optimize cluster performance, scaling, and availability in Kubernetes environments.
Implement high availability and disaster recovery strategies.
Support CI/CD pipelines and deployment automation.

Reliability & Engineering Excellence

Define and monitor SLIs, SLOs, and error budgets.
Implement proactive reliability improvements and capacity planning.
Collaborate with development teams to improve application resilience and observability.
Advocate for DevOps and SRE best practices across engineering teams.

You've got what it takes if you have...

5+ years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure.
Strong experience with Terraform (Infrastructure as Code).
Hands-on experience with Kubernetes (EKS, AKS, GKE, or self-managed clusters).
Experience with Major Incident Management and production incident response.
Strong knowledge of Linux systems and networking fundamentals.
Experience with cloud platforms (AWS preferred).
Familiarity with monitoring tools such as Prometheus, Grafana, Datadog, or ELK.
Experience with CI/CD tools such as Jenkins, GitHub Actions, GitLab CI, or similar.
Strong scripting skills in Python, Bash, or Go.

Preferred Qualifications

Experience managing large-scale distributed systems in production.
Experience implementing chaos engineering or resilience testing.
Knowledge of security best practices in cloud-native environments.

Want more jobs like this?

Get jobs in Flexible / Remote delivered to your inbox every week.

By signing up, you agree to our Terms of Service & Privacy Policy.

Client-provided location(s): Flexible / Remote

Job ID: CornerstoneOnDemand-req11148

Employment Type: OTHER

Posted: 2026-04-11T00:23:51

Perks and Benefits

Health and Wellness
- Health Insurance
- Health Reimbursement Account
- Dental Insurance
- Vision Insurance
- Life Insurance
- Short-Term Disability
- Long-Term Disability
- FSA
- HSA
- HSA With Employer Contribution
- Pet Insurance
- Mental Health Benefits
Parental Benefits
- Birth Parent or Maternity Leave
- Non-Birth Parent or Paternity Leave
- Fertility Benefits
- Family Support Resources
- Adoption Leave
Work Flexibility
- Flexible Work Hours
- Remote Work Opportunities
- Hybrid Work Opportunities
Office Life and Perks
- Casual Dress
- Snacks
- Company Outings
- On-Site Cafeteria
- Holiday Events
Vacation and Time Off
- Paid Vacation
- Unlimited Paid Time Off
- Paid Holidays
- Personal/Sick Days
- Leave of Absence
- Summer Fridays
Financial and Retirement
- 401(K) With Company Matching
- Stock Purchase Program
- Performance Bonus
- Relocation Assistance
- Financial Counseling
- Profit Sharing
Professional Development
- Tuition Reimbursement
- Promote From Within
- Work Visa Sponsorship
- Leadership Training Program
- Internship Program
- Shadowing Opportunities
- Access to Online Courses
Diversity and Inclusion
- Employee Resource Groups (ERG)
- Unconscious Bias Training
- Diversity, Equity, and Inclusion Program

Want more jobs like this?

Perks and Benefits

Health and Wellness

Parental Benefits

Work Flexibility

Office Life and Perks

Vacation and Time Off

Financial and Retirement

Professional Development

Diversity and Inclusion