Site Reliability Engineer
- Vancouver, Canada
Who We Are
Take-Two develops and publishes some of the world's biggest games. Our Rockstar label creates Grand Theft Auto and Red Dead Redemption, two of the most critically acclaimed gaming franchises in history. Our 2K label creates games like NBA 2K, WWE 2K, Bioshock, Borderlands, Evolve, XCOM and the beloved Sid Meier's Civilization. Our Private Division label publishes Kerbal Space Program, The Outer Worlds, and will publish upcoming titles with Obsidian Entertainment, Panache Digital Games and more.
Take-Two Direct to Consumer
The Direct to Consumer team is a (well-funded) startup within Take-Two. We have offices in San Francisco and Vancouver and have built a culture that enables remote work. We're building a commerce and distribution platform for our game labels, partnering directly with our studios to bring value company-wide. Our team is small and agile - we release to our users quickly, and constantly iterate to elevate our product's quality. We seek regular feedback from our users and labels to make sure we are delivering at and above expectations. We believe in giving our studios the flexibility they need to create the world's greatest games, so we plan to offer a variety of interfaces using modern technology and best practices. Our success is measured by our impact on gamers and developers, not presentations or promises.
The Role Defined:
A Site Reliability Engineer (SRE) on the D2C team will support our infrastructure, monitoring, and tooling needs. Proven systems and analytical skills will be needed, as you will be helping to build and maintain a production environment that serves the needs of gamers and game development studios worldwide, alongside a group of top-notch engineers.
As a member of the D2C SRE team, you will work directly with engineers, architects, operations, and the Take-Two SRE team to ensure highly performant, highly available services across a broad range of technologies and products.
- Develop and automate highly scalable infrastructure in the cloud using modern infrastructure-as-code principles.
- Build in performance and operational monitoring to ensure scalability and allow swift diagnosis and resolution of service degradation or disruption.
- Diagnose and resolve technical issues from both internal and external customers.
- Develop tooling to automate and simplify common tasks such as building and deploying applications, and assist with integration into CI/CD pipelines.
- Document processes and procedures relating to the deployment, monitoring, and administration of D2C infrastructure and applications
- Participate in a rotating on-call team to triage, diagnose, and resolve live service issues.
- Collaborate closely with fellow engineers and team members, and maintain a strong working relationship based on communication, respect, and trust.
- 3+ years of professional experience, with proven track record of managing highly scalable and robust large-scale distributed infrastructure
- Experience scaling web applications and microservices using container orchestration systems such as Kubernetes
- Experience implementing monitoring, reporting and alerting on large production systems with tools such as Grafana, Prometheus, and Splunk
- Experience building and managing infrastructure and services on AWS
- Experience supporting live production systems, maintaining high availability and responding swiftly to issues as they appear
- Experience with CI/CD practices, using tools such as Jenkins, GitHub Actions, Docker, and source control systems like perforce and git
- Experience provisioning cloud infrastructure using tools such as CloudFormation, Pulumi, or Terraform
- Expertise in Linux operating systems with user level experience in others
- Ability to develop operational tools using Python, Ruby, Bash, and/or NodeJS
- Drive to proactively identify opportunities for improvement in our systems and propose solutions
- Strong written and verbal communication skills
- Desire to automate everything possible
- An obsession with performance and providing great end user experience
- Experience in Azure, GCP, and other cloud providers
- Experience administering databases at scale
- Experience leveraging enterprise third-party monitoring solutions such as Datadog or New Relic
- Working knowledge of configuration management tools such as Puppet, Chef, or Ansible
Back to top