Systems Reliability Engineer - Multicloud

    • New York, NY
    • Flexible / Remote
    + 1 more

About Datadog:

We're on a mission to build the best platform in the world for engineers to understand and scale their systems, applications, and teams. We operate at high scale—tens of trillions of data points per day—providing always-on alerting, metrics visualization, logs, application tracing, synthetics and more for thousands of companies. Our engineering culture values pragmatism, honesty, and simplicity to solve hard problems.


The opportunity:

We’re looking for Systems Reliability Engineers to join our new Multicloud Systems Reliability Engineering team. Today Datadog runs across a few vendors in a handful of regions.  As we move towards becoming the first-choice telemetry platform no matter where our customers run, we have found we need to greatly expand the footprint of where our infrastructure runs. With that, there are enough challenges specific to each cloud provider that we need to start building focused core reliability teams or each cloud provider.

At Datadog, Systems Reliability Engineers are our systems-focused generalists, blending deep and practical knowledge of Linux, Open Source, Cloud Vendors and System design. They are at the front line maintaining and expanding the capabilities of our many and varied systems, filling the gap between traditional systems administration and development, seeking to merge the capabilities from both disciplines to run reliable systems at massive scale. 

One of the first region builds we will support is for U.S. FedRAMP Moderate customers. This will require candidates to be a U.S. citizen or national, U.S. permanent resident (i.e., current Green Card holder), or lawfully admitted into the U.S. as a refugee or granted asylum.



  • Bachelor’s degree in Computer Science or related field, or relevant work experience
  • Experience as a software engineer
  • Experience with working on AWS services (S3, DynamoDB, EC2)
  • Experience in 24x7 production environments
  • 2+ years Linux experience
  • 2+ years devops, reliability, technical support, operations, or development experience


Bonus points:

  • Strong Linux skills
  • In depth Python/Go programming ability, with a focus on automation
  • Java/JVM operations experience
  • Experience managing large server/container fleets
  • 2+ years working in a software as a service environment
  • Excellent problem solving skills with a strong attention to detail
  • Ability to dive deep into complex technical problems
  • Active TS/SCI security clearance

Back to top