Senior Site Reliability Engineer
- Bengaluru, India
Changing the world through digital experiences is what Adobe's all about. We give everyone-from emerging artists to global brands-everything they need to design and deliver exceptional digital experiences! We're passionate about empowering people to create beautiful and powerful images, videos, and apps, and transform how companies interact with customers across every screen.
We're on a mission to hire the very best and are committed to creating exceptional employee experiences where everyone is respected and has access to equal opportunity. We realize that new ideas can come from everywhere in the organization, and we know the next big idea could be yours!
Are you comfortable with dev, comfortable with ops, and looking for a job that doesn't have "DevOps" in the title?
Do you have an intimate understanding of the operational challenges of running services at scale, and are you also committed to overcoming those challenges with software instead of manpower?
Adobe needs a Site Reliability Engineer (SRE) who knows how to balance going fast and going big with operating safely. Our mission is to progress, protect, and provide for the software and systems behind all of Marketo: An Adobe Company, with an ever-watchful eye on system availability, latency, performance, and capacity. SRE is a mindset of engineering approaches which focuses on building highly reliable systems and eliminating toil through automation.
We hire people from both systems and software backgrounds. Strong candidates will have experience with both. The engineer role within SRE is at the heart of fulfilling SRE's mission: build highly reliable, scalable & measurable customer experience for the continued growth of Marketo's infrastructure. We are using both multi-cloud (Azure/AWS/GCP) and on-premise environments. We are looking for someone who is ambitious, has a passion for quality, and wants to help critical services succeed without compromising security.
Our SRE and Engineering teams are distributed, split between Denver, Colorado; San Mateo, California; and Bucharest, Romania. We rely heavily on tools like Slack, JIRA and video conferencing to collaborate. Flexibility to join meetings with colleagues around the world is expected. The successful candidate must be able to prioritize tasks and work independently.
What you'll do
Engage with product and engineering to drive and improve the whole lifecycle of operational readiness - from inception and design, through deployment, operation and refinement proactively.
Write software layers, scripts, deployment frameworks, tracers, monitors, self-healing/auto remediation tools and automate the processes.
Build and maintain software modules for use and re-use in cloud and on-premise systems automation.
Maintain business continuity by identifying and driving opportunities to make systems highly resilient and human-free.
Closely work with software engineering team to ensure accurate monitoring and metrics are being built into applications before going to production.
Maintain up-to-date documentation on deployments, processes, and standard operating procedures/run-books with a goal minimize runbooks by automation.
Even after self-healing and automation done by you - if complex issues arise, get involved with troubleshooting and root-cause analysis of issues across the stacks - hardware, software, database, network and so on.
Participate in shared on-call schedule [follow-the-sun model] managed across SRE & Engineering.
Be an evangelist and promote lean-ops culture by applying self-service, self-healing and automation.
Work with product management team to define SLAs SLOs and implement SLIs for core capabilities.
Improve observability of software by implementing right monitoring, tracing and logging.
What you need to succeed
Experience designing for and dealing with a large production environment. Minimum of 7 years.
A Bachelors or Masters degree in computer science engineering or related.
Developing, running, and/or consuming cloud technologies such as AWS, Azure, Google Cloud Platform and related tooling: Terraform, configuration management, etc.
Recent large-scale experience developing, running and/or consuming on premise platforms and related tooling: VMware, Ansible, Chef or Puppet, configuration management, etc.
Programming (Python and Bash are our preferred scripting/shell languages) and automation skills.
Troubleshooting and system engineering exposure in Linux production environments. Experience with Linux, Internet Protocols, and Large-Scale Operations.
Experience with CI/CD tooling: Jenkins, Spinnaker, GitLab runners, Azure DevOps, etc.
Experience with designing, deploying and maintaining monitoring solutions such as Splunk, Prometheus, Check MK, etc.
Familiarity with AWS/Azure well architected frameworks and practical experience in applying resiliency and reliability patterns such as Circuit Breaker, Bulkhead etc...
Great communication, interpersonal, and teamwork skills.
Ability to work independently and own problem statements end-to-end.
Experience with relational databases such as MySQL, Postgres, and document stores such as MongoDB.
Experience deploying applications in containers using Docker and Kubernetes.
Strong intuition about system design, robustness, and scalability.
Decent Experience with Windows.
At Adobe, you will be immersed in an exceptional work environment that is recognized throughout the world on Best Companies lists. You will also be surrounded by colleagues who are committed to helping each other grow through our unique Check-In approach where ongoing feedback flows freely.
If you're looking to make an impact, Adobe's the place for you. Discover what our employees are saying about their career experiences on the Adobe Life blog and explore the meaningful benefits we offer.
Adobe is an equal opportunity employer. We welcome and encourage diversity in the workplace regardless of race, gender, religion, age, sexual orientation, gender identity, disability or veteran status.
Back to top