Lead Site Reliability Engineer--Machine Learning

7900 Westpark Drive (12131), United States of America, Tysons, Virginia

At Capital One, we're building a leading information-based technology company. Still founder-led by Chairman and Chief Executive Officer Richard Fairbank, Capital One is on a mission to help our customers succeed by bringing ingenuity, simplicity, and humanity to banking. We measure our efforts by the success our customers enjoy and the advocacy they exhibit. We are succeeding because they are succeeding.

Guided by our shared values, we thrive in an environment where collaboration and openness are valued. We believe that innovation is powered by perspective and that teamwork and respect for each other lead to superior results. We elevate each other and obsess about doing the right thing. Our associates serve with humility and a deep respect for their responsibility in helping our customers achieve their goals and realize their dreams. Together, we are on a quest to change banking for good.

Lead Site Reliability Engineer--Machine Learning

Investing in the right information security capabilities is essential to what we do for Capital One in protecting our customers and our employees. As part of that mission, the CyberML team collects and analyzes vast quantities of data to help detect malware, prevent fraud, and protect customers.

As a Lead Site Reliability Engineer on the CyberML team, you will lead an SRE team dedicated to building and operating petabyte-scale, distributed, fault-tolerant systems that are essential to Capital One's cyberdefense capabilities. You will not only provide technical leadership to the engineers on your team, but you will also be the driving force behind building an SRE culture of analytical problem solving, continuous process improvement, and openness within Capital One.

Who You Are

  • You have a solid background in operations and software engineering.
  • You are interested in working on challenging problems involving scalability and performance.
  • You can effectively collaborate with other teams to work on high-profile initiatives.
  • You enjoy learning new technologies and picking up new skills.
  • You are interested in and proficient at automating tasks, deployments, monitoring, and testing.

What The Role Is
  • Participate in a 24-hour on-call rotation, practicing sustainable incident response.
  • Participate in architecture design and review, capacity planning, launch planning, and other activities prior to an application going live.
  • Maintain applications after they launch to production by monitoring availability, latency, and application health.
  • Scale up applications and modify application architecture to meet the evolving needs of the customer.
  • Conduct blameless postmortems and retrospectives as part of continuous process improvement.

Basic Qualifications
  • Bachelor's Degree or Military Experience
  • At least 5 years of programming experience in Java, Scala, Python, C++, or Golang
  • At least 5 years of programming experience in scripting languages such as Shell, Python, or Perl.
  • At least 5 years of experience working with Ansible, Puppet, Saltstack, CloudFormation, or Terraform
  • At least 5 years of experience working with Linux-based OSes
  • At least 3 years of experience working within cloud environments
  • At least 1 year of experience working with ELK stack, TICK stack, Prometheus, Graphite, or Grafana
  • At least 1 year of experience working with Jenkins or Artifactory

Preferred Qualifications
  • M.S. or Ph.D. in Computer Science or related technical discipline.
  • Experience working with Elasticsearch and Lucene-based search.
  • Experience working with Snowflake data warehouse.
  • Experience working with container runtimes (Docker, rkt, cri-o, etc.)
  • Experience working with container frameworks (Kubernetes, Mesosphere, etc.)
  • Experience with building Machine Learning (ML) applications or implementing ML algorithms.


At this time, Capital One will not sponsor a new applicant for employment authorization for this position.


Back to top