Site Reliability Engineer
How do you keep a data-intensive, real-time service that monitors tens of thousands of servers, up and running around the clock?
How do you respond to infrastructure failures or performance issues in a high-volume, low-latency computing environment?
What should the infrastructure look like when Datadog monitors a million servers? If you think you have the answers, join us as a Site Reliability Engineer.
What you will do
- Keep our service reliable, available and fast as a member of the operations team.
- Respond to, investigate and fix service issues, whether they be deep in the OS kernel or in the application code.
- Design, build and maintain the infrastructure we need to support orders of magnitude more customers.
Who you must be
- You have a BS/MS/PhD in a scientific field
- You have a track record as an engineer in the operations of a large site
- You value correctness and efficiency; you leave no stone unturned when diagnosing production issues
- You handle infrastructure with code because automation lets you focus on the more difficult and rewarding problems
- You have production experience with distributed compute/storage tools, e.g. zookeeper, cassandra, postgres, kafka, elasticsearch redis
- You have submitted bug fixes to the aforementioned projects
- You are fully fluent in python, ruby and go
Is this you? Tell us why, and apply now. Include links to your github, stackoverflow or other online projects.
Back to top