Site Reliability Engineer
At Lyft, our mission is to improve people’s lives with the world’s best transportation. To do this, we start with our own community by creating an open, inclusive, and diverse organization.
Passengers rely on Lyft to get to work, to go to the doctor, or to get home safely when public transit has stopped running. Drivers use Lyft for income and flexibility. Building a stable and reliable application for our passengers and drivers is a responsibility we take very seriously, and we are building out a team of Software Engineers focused on reliability, to deliver a consistent and highly reliable user experience.
Every engineering team at Lyft is responsible for running and operating the software that they build. The Reliability Engineers works towards standardizing and supporting all of the rapidly growing teams throughout our organization, assessing their architecture, helping them design scalable services, and fostering excellent operational practices. It's a mission-critical role of ensuring that our systems are always healthy, monitored, automated, and designed to scale.
What makes Reliability Engineering different at Lyft?
- It is engineering! We work on resolving the problems with the mindset on how to ensure they don't happen again. We are looking to automate ourselves out of our jobs.
- Our day to day is driven by helping our product teams create robust software faster.
- We don't sit on the other side of the tossing fence -- we're a first class engineering citizen and embedded in specific development teams where we drive engineering improvements from the bottom up.
Examples of Reliability Engineering projects:
- We automated Kafka topics management by building a declarative service that prevents abuse before capacity changes are shipped.
- We built a rate limiting system for our Wavefront proxy.
- We rolled out Kubernetes as a core component of Lyft infrastructure.
- We built Horizon, a cubism-inspired system to visualize faults across our various services.
- We revamped our incident management process and tools. This created a safe culture to understand outages and focus on preventing future ones.
- Define roadmap and architecture based on technology and business needs.
- Build holistic visibility into SLIs, SLOs, SLAs, dependency graphs, past performance of software, network, and system to ensure that we can continue to scale without increasing operational burden or toil.
- Share your knowledge by giving brown bags, tech talks, and evangelizing appropriate tech and engineering best practices.
- Build infrastructure and drive projects that break things with the aim to improve the robustness of production systems
- Use the core Site Reliability Engineering principles of change management, monitoring, emergency response, capacity planning, and production readiness reviews to run the platform.
- Step back to observe patterns and develop innovative tools and automation to minimize toil. Use those learnings to drive the best operational practices.
- Partner with the broader Lyft organization to build a culture of rigorously learning from incidents.
- Unblock, support, and effectively communicate across teams to achieve results.
- 2+ years of software engineering experience
- Experience with high level programming languages (Python, Go, Java, etc.)
- Experience designing, debugging and running fault tolerant large-scale distributed systems
- Experience working with public cloud platforms (e.g., AWS, Google Cloud Platform, Microsoft Azure, etc.)
- Strong troubleshooting and debugging skills
- Experience bringing software to production at high scale
- Strong Cross team collaboration
- Good communication skills
The nature of work is interdisciplinary, and our teammates come from varying backgrounds e.g. (Site Reliability Engineer (SRE), Systems Engineer, Software Engineer, DevOps Engineer, Infrastructure Engineer, Production Engineer). We urge you to apply even if you feel uncertain that you have the exact background.
- Great medical , dental, and vision insurance options.
- In addition to 11 observed holidays , salaried team members have unlimited paid time off, hourly team members have 15 days paid time off.
- 401(k) plan to help save for your future
- 18 weeks of paid parental leave. Biological, adoptive, and foster parents are all eligible
- Pre-tax commuter benefits
- Lyft Pink - Lyft team members get an exclusive opportunity to test new benefits of our Ridership Program
Lyft is an Equal Employment Opportunity employer that proudly pursues and hires a diverse workforce. Lyft does not make hiring or employment decisions on the basis of race, color, religion or religious belief, ethnic or national origin, nationality, sex, gender, gender-identity, sexual orientation, disability, age, military or veteran status, or any other basis protected by applicable local, state, or federal laws or prohibited by Company policy. Lyft also strives for a healthy and safe workplace and strictly prohibits harassment of any kind. Pursuant to the San Francisco Fair Chance Ordinance and other similar state laws and local ordinances, and its internal policy, Lyft will also consider for employment qualified applicants with arrest and conviction records.
Back to top