Site Reliability Engineer - Observability

Site Reliability Engineer - Observability

Outbrain's recommendation system is powered by a large scale infrastructure, based both on Cloud Infrastructure as well as bare metal infrastructure managed in our own Data Centers.
Outbrain servers fleet consist of over 6000 physical servers, and produces over 200 Billion personalised recommendations on a monthly basis, reaching over 550M unique users every month.

At Outbrain, Observability is more than just metrics and logs. It's an integral part of the business that provides tools that are used by Engineers, SREs, Account Managers, Data scientists and Upper Management. We develop and innovate on a massive Prometheus environment (80 million metrics per minute), multiple ELK stacks handling 23 million docs per minute, in-code metrics and logging libraries, real time log querying, and long term storage of metrics

Our team is made up of individuals who are resourceful, bright, proactive, and who work well both independently and as a team.
Outbrain is seeking a motivated Observability-Minded Senior Site Reliability Engineer, ready to make a difference, to join our Platform group. We work in a challenging environment, leveraging state of the art Open Source tools and developing our own. We are big believers in automation, data-driven development, and extreme visibility.

We are looking for an SRE to join our existing SRE team to help us further develop more automated and robust solutions to some of the industry's biggest observability problems like event correlation, debuggability, time series/log collection and graphing.

Your Impact:
You'll help to innovate and support a massive metrics and log data flow.
You'll build some CLI tools that developers will use to interact with our systems
You'll introduce new technologies to support age old problems like latency tracing, event correlation and debuggability
You'll update a critical alerting system.

Tech we use (And you will too):

  • Prometheus
  • Elasticsearch
  • Logstash
  • Kibana
  • Grafana
  • Kubernetes
  • Docker
  • Kafka
  • Consul
  • Chef
  • Jenkins
  • Mysql
  • Ruby
  • Go
Basic Qualifications:
  • You have a strong understanding of Linux (we use Ubuntu)
  • You have an understanding of and a passion for Site Reliability Engineering and Observability
  • You are proficient with Go, Ruby and/or Bash
  • You are familiar with git, version control, jenkins, chef and other "devops" tools
  • You are familiar with popular data queues like Kafka
  • You are able to architect scalable and robust solutions to complex problems by pragmatically blending system design, open source software, and software engineering
  • You want to learn ... a lot

Preferred Qualifications:
  • You have practical experience with Prometheus
  • You have good knowledge of Elasticsearch
  • You have good knowledge of Consul and Kafka
  • You have strong knowledge of Kubernetes and Docker.
  • You can build complex infrastructure automation
  • You can write complex tools using Golang, Ruby, and Bash

About Outbrain
Outbrain is the brains-and technology-behind those recommendations that can be seen throughout online articles, helping people digitally discover the content most interesting to them. Headquartered in New York City, with a global presence across 16 offices, Outbrain was founded to narrate the web. Top publishers-including CNN, Fox News, and ESPN-utilize Outbrain's technology to redefine their digital landscapes, while top marketers utilize the platform to interact with untapped audiences. Outbrain currently serves more than 308 billion monthly recommendations to readers, and is paving the way in native advertising innovations.

The brainy culture that ignites the halls of Outbrain has even been industry-recognized and awarded 'Best Tech Work Culture' in 2018 by Tech In Motion.

Interested in discovering what's next for your career at Outbrain?

Back to top