The Core Resilience team is part of our SRE organization and is responsible for partnering with groups across Datadog to improve our technical and organizational resiliency. We steward the post-mortem and incident response processes across the company, constantly iterating and seeking improvements through the lessons we learn from production. We run training sessions for on-call and incident management and occasionally embed in product groups to ensure we remain aligned and can offer practical solutions to reliability problems.
- Blamelessness in our processes. Our primary goal in incident reviews is to learn from and adapt our mental models of how our systems run in production. A
- A people-centered approach: ensuring that automation and systems support engineers doing work, not vice versa.
- An understanding that systems are inherently complex and failure is inevitable. What we can control is how resilient our systems and organization are when responding to these inevitable events.
- The idea that safety and risk are emergent properties in a socio-technical system and that they arise from a complex interaction of factors that constitute normal work. Resilience is a dynamic process of steering rather than a static quality.
At Datadog, we place value in our office culture - the relationships that it builds, the creativity it brings to the table, and the collaboration of being together. We operate as a hybrid workplace to ensure our employees can create a work-life harmony that best fits them.
What You’ll Do:
- Help run the post-mortem process for the company and partner with teams on writing them, as well as identifying and implementing opportunities to reduce friction and maximize learning value to the organization.
- Define how we respond to incidents as a company and write software to streamline that process, partnering with our product teams where necessary. Our goal is to support our incident responders as much as possible to deal with complexity.
- Train our on-callers in our incident and post-mortem processes. This involves both introducing newcomers to on-call responsibilities and refreshing the knowledge of existing engineers.
- Perform cross-functional engagements with different teams across the organization, embedding in their group for a few weeks in order to either learn about how work is performed or to solve a specific reliability problem.
- Facilitate incident reviews in a way that emphasizes learning and blamelessness.
- Write reliability bulletins, blog posts, and other forms of documentation that identify systemic risks to the company, provide actionable remediations, and promote best reliability practices.
Who You Are:
Somebody who has experience or is interested in the following:
- Writing software that solves real user problems, as well as reviewing others’ code in an empathetic and collaborative way. We mainly use Go and Python.
- Analyzing incidents, identifying broader risk patterns, and sharing your findings in an engaging way that other people can understand and learn from.
- Responding to incidents as an incident commander or responder (preferably those with high-impact), and iteratively improving incident response processes.
- Teaching and training other engineers on best practices.
- Familiarity with Kubernetes and distributed systems as well as their potential failure scenarios.
Datadog values people from all walks of life. We understand not everyone will meet all the above qualifications on day one. That's okay. If you’re passionate about technology and want to grow your skills, we encourage you to apply.
Benefits and Growth:
- New hire stock equity (RSUs) and employee stock purchase plan (ESPP)
- Continuous professional development, product training, and career pathing
- Intradepartmental mentor and buddy program for in-house networking
- An inclusive company culture, ability to join our Community Guilds (Datadog employee resource groups)
- Access to Inclusion Talks, our Internal panel discussions
- Free, global mental health benefits for employees and dependents age 6+
- Competitive global benefits
Benefits and Growth listed above may vary based on the country of your employment and the nature of your employment with Datadog.
Equal Opportunity at Datadog:
Datadog is an Affirmative Action and Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements.
Any information you submit to Datadog as part of your application will be processed in accordance with Datadog’s Applicant and Candidate Privacy Notice.