Senior Site Reliability Engineer, Site Operations
The Site Operations team owns overall site health for Credit Karma’s suite of products to ensure that members are always able to access our suite of services. We accomplish our mission by acting as a force multiplier for engineers across Credit Karma who own production services and features. We do this by making operational risk visible to the org, leading incident management, and providing tools to facilitate both.
We are looking for a Senior Site Reliability Engineer with strong operations and software engineering background to help drive overall site reliability to a world-class level. This is a high visibility role at Credit Karma where you will be empowered to drive operational best practices across the Engineering org for production web applications at scale.
In Site Operations, minutes matter. Since we are always striving for the fastest time to repair during member impacting incidents, we are looking for collaborators that enjoy quickly solving technical puzzles. You will be dedicated to ensuring the stability and resiliency of the Credit Karma experience, visited by over 80 million members.
What you’ll do:
- Partner inside the Platform Engineering org on developing and operating resilient platform services from edge services, to CI/CD, observability
- Partner with Product Engineering teams to improve operational best practices
- Build and maintain tooling for site operations and incident management for the Engineering org
- Incident Management - leading the charge and coordinating investigation for production issues, post-issue follow-up, and ensuring learned improvements are implemented.
- Coordinate with Platform and Product Engineering orgs to develop, deploy, evangelize, and enforce best practice processes
- Be an active leader in Change Advisory Board
- Participate in an incident manager on-call rotation
What we expect:
- 5+ years background in production operations, ideally in a microservices environment
- 5+ years of experience in build tooling (apps, scripts, monitoring, processes, documents) to support production operations
- Quickly solving technical puzzles to reduce mean time to restoration
- Experience using core metrics tools (Splunk, Grafana, New Relic)
- Passion for peeling back the layers of a technology stack and solving reliability issues
- Ability and experience for root causing complex issues in distributed systems at scale
- Strong cross-functional communicator and collaborator that can lead technical triage in ad-hoc situations
What’s great about this opportunity:
- Be a significant contributor to building a strong availability and reliability framework for production operations at Credit Karma
- Opportunity to collaborate not just across Engineering org but also into business and marketing orgs to drive the availability and reliability mission end to end
Back to top