Booking.com

Site Reliability Engineer - Core Infrastructure - Data Infrastructure

3+ months agoAmsterdam, Netherlands

At Booking.com, our mission is to make it easier for everyone to experience the world. And while that world might feel a little farther away right now, we're busy preparing for when the world is ready to travel once more. With strategic long-term investments into what we believe the future of travel can be, we are opening career opportunities that will have a strong impact on our mission.

Core Infrastructure

Do you want to build software that impacts millions of customers around the world, tackling some of the world's most complex ecommerce challenges? We are looking for talented infrastructure backend developers to join our Core Infrastructure department in our Amsterdam HQ.

In Core Infrastructure we design, build and operate all the technology that our Booking.com product development teams need in order to deliver great travel products to our customers.

This includes, for instance, our on premise data centers, our cloud hosted Kubernetes clusters, MySQL/Cassandra/Elasticsearch database environment, HAProxy load balancers, Envoy service mesh, APigee gateway, Kafka streaming service, Hadoop big data storage, Graphite time series, Grafana dashboard platform, monitoring & alerting tools, CI/CD tooling, Perl/Java/Node.js language frameworks and more...

Application Data Service

In Application Data Services we are operating a fleet of thousands of database instances in hundreds of replication hierarchies, some with hundreds of members, some with sizes up to hundred Terabytes, and some with a transaction rate that pushes the boundaries of what the hardware can do. We are also taking care of developer needs, providing services to automate grant management, data ownership, online schema management, and monitoring and alerting. We are using Python and Go, CI/CD in gitlab, Puppet, and some bits and pieces in other languages and systems.

Of course, this is only possible because the provisioning, maintenance and operations of these servers is automated. And maintaining, improving and refactoring this automation is what the SRE job is about: We code our way out of problems where operations are concerned, addressing availability, scalability, latency, and efficiency challenges within the vast infrastructure here at Booking.

  • You will impact millions of people all over the globe with your creative solutions
  • You will be working in one of the biggest e-commerce companies in the world
  • You will solve interesting problems at scale by writing and deploying code across tens of thousands of servers
  • You will have the opportunity to collaborate with many of the world's leading SREs
  • You will be free to launch your own ideas and solutions within our complex production environment
Our automation is written in Python and Go, and is interfacing with a number of systems, among them our Puppetry, Openstack, Kubernetes, PowerDNS, Graphite, Prometheus, Zookeeper, and many more.

B.RESPONSIBLE

Important aspects of the job include:
  • It's MySQL, thousands of instances in hundreds of replication hierarchies, some of them seeing substantial load, the foundation of our Application Data Infrastructure.
  • It's automated. But as our systems are evolving, this automation needs improvement, extension and refactoring to meet the changing requirements of a different environment.
  • It's Python, and Go. And being at the center of most, if not all applications, it is literally talking to everything else.
  • It's moving to all the platforms, including Openstack, Kubernetes and the public cloud.
  • It's dynamic. With automated capacity testing, restore testing, failover testing and disaster recovery testing, it needs to be able to adapt to planned and unplanned changes in the production conditions and environments.
  • Sometimes it has problems. Sometimes our customers make problems. Good monitoring and alerting are required to be aware of problems as they develop, or ideally before they develop.
  • It's in multiple data centers, ours and in the public cloud. Replication and communication over long distances pose their own scaling and performance problems.
As SRE in the data infrastructure team, you will be responsible for planning, building, improving and refactoring solutions that solve these problems. You will also share the on-call rotation and be an escalation contact for incidents. You will be working in close collaboration with multi-functional teams in Core Infrastructure and in the Application Teams.

B.SKILLED

What will you bring to the role?
  • A solid understanding of MySQL operations, scalability and performance, with a focus on replication, large-scale environments and InnoDB.
  • The will and skill to completely automate the database lifecycle as scale.
  • Experience with building and maintaining a complex database environment, at scale and in a distributed environment with a variety of underlying technologies.
  • Solid understanding of Linux administration and networking as a foundation of operating applications in a Linux environment.
  • The ability to define the metrics that define the health and success of the production environment you built.
  • Creative approach to problem-solving
  • Fluency in the English language both spoken and written
  • Additional experience in networking, security or storage is an advantage
B.offered

We are a performance-based company that offers career advancement and lucrative compensation, including bonuses and stock potential. We also offer what we call the "Booking Deal" with other competitive perks and benefits. The Technology department has monthly hackathons, training and attends/speaks at global conferences.

This position is open to worldwide candidates and in the case of relocation, we will assist you with a generous relocation package, ensuring a smooth transition to working and living in Amsterdam. We have successfully relocated 300+ Technology professionals to Amsterdam in the last year!

Job ID: booking-BOOKUS2948535EXTERNAL