Team Lead, Engineering - Site Reliability

3+ months agoNew York, NY

About Datadog:

We're on a mission to build the best platform in the world for engineers to understand and scale their systems, applications, and teams. We operate at high scale—trillions of data points per day—allowing for seamless collaboration and problem-solving among Dev, Ops and Security teams globally for tens of thousands of companies. Our engineering culture values pragmatism, honesty, and simplicity to solve hard problems the right way.


The Team: 

The Site Reliability teams at Datadog are responsible for ensuring that our high-volume, low-latency environments continue to perform around the clock. These teams collaborate closely with our product engineers to ensure that Datadog can monitor millions of servers and containers, ensuring our customers always have dependable and actionable data at their fingertips. You’ll be responsible for shaping the infrastructure of our data-intensive, real-time services as we continue to grow at petabyte scale.


The Opportunity:

As a Team Lead, you’ll be responsible for the people management of a small group of dedicated, pragmatic engineers working to build reliable systems at petabyte-scale. Team Leads are our first layer of management at Datadog - they are both technical leaders and people managers.



We are a globally distributed team with US Offices in New York (HQ), Boston, and Denver and International Offices in Paris, Dublin, London, Madrid, the Netherlands, and Singapore. About 33% of our engineering team are remote.

Datadog values people from all walks of life. We understand that not everyone will meet these requirements on day one. If you’re passionate about reliability engineering and want to grow these skills but don’t meet all of these qualifications, we encourage you to apply.


You Will:

  • Manage a team of 3-6 engineers.
  • Guide projects to create reliable and resistant systems that can scale up in order of magnitude as our company continues to grow.
  • Define and set priorities for your team, unblocking your direct reports when needed.
  • Build tools and production frameworks to make our engineering team’s lives easier.
  • Respond to, investigate and fix issues, whether it’s deep in the database code or in the client application.


You Are:

  • 5+ years of experience in software engineering, 1+ year of management or technical leadership
  • You have architected, built, and operated distributed systems to solve problems at high scale
  • You value correctness and efficiency; you leave no stone unturned when diagnosing production issues
  • You handle infrastructure with code because automation lets you focus on the more difficult and rewarding problems
  • You have production experience with distributed compute/storage tools - we use Kubernetes, Cassandra, Postgres, Kafka, Elasticsearch and Redis


Bonus Points:

  • You’ve worked at a company with large scale systems, handling large amounts of data
  • You have created tooling for, or submitted contributions to, an open-source datastore
  • You are fluent in Python or Golang


#LI-Remote This is a remote position


Equal Opportunity at Datadog:

Datadog is an Affirmative Action and Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements.


Your Privacy:

Any information you submit to Datadog as part of your application will be processed in accordance with Datadog’s Applicant and Candidate Privacy Notice.

Job ID: 1826086