Lead Site Reliability Engineer

Towers Crescent (12066), United States of America, Vienna, Virginia

At Capital One, we're building a leading information-based technology company. Still founder-led by Chairman and Chief Executive Officer Richard Fairbank, Capital One is on a mission to help our customers succeed by bringing ingenuity, simplicity, and humanity to banking. We measure our efforts by the success our customers enjoy and the advocacy they exhibit. We are succeeding because they are succeeding.

Guided by our shared values, we thrive in an environment where collaboration and openness are valued. We believe that innovation is powered by perspective and that teamwork and respect for each other lead to superior results. We elevate each other and obsess about doing the right thing. Our associates serve with humility and a deep respect for their responsibility in helping our customers achieve their goals and realize their dreams. Together, we are on a quest to change banking for good.

Lead Site Reliability Engineer

This position is for a dynamic and strategic leader who can join teams that are already performing, and start making a difference from day one. In this role as a Senior Manager-Lead Software Engineer, you will provide technology leadership, guiding architectural and design decisions. You must be able to provide career coaching and support for team members to develop their technical, business and interpersonal professional skills. You'll be able to identify and remove impediments, while successfully driving towards key initiatives. Building and maintaining strong relationships, earning and maintaining confidence with both customers and business leaders are critically important responsibilities. A clear and crisp communication style will establish realistic expectations for all stakeholders.


As an Engineering Manager, you will provide technical guidance and leadership to the Agile teams while being hands on in Site Reliability Engineering work as needed. Removing technical impediments, designing the platform/application implementations with cutting edge technologies and leading a successful team are core parts of responsibilities. This position requires working in fast paced environment which needs realigning priorities and changing strategies for successful implementation of business requirements.

- Maintain performance of critical applications in production with continuous monitoring, detection and response
- Create best practices and determine KPIs to measure latency, stability, availability, system health and overall business functionality health
- Understand integration patterns of complex distributed systems, design chaos scenarios and implement automated chaos execution in pre-production systems
- Roll-out controlled chaos tests into production and measure reliability with minimal customer disruption
- Help development teams with designing, capacity planning and deploying large-scale distributed systems
- Map and maintain dependencies and understand implications of service disruptions on the overall system health
- Lead incidents to quickly bring services online with minimal disruption
-Effectively and readily assess customer impact post-incident
- Partner with chaos engineering teams to design and create controlled chaos in production systems
- Partner with development teams to create fallback experiences to critical scenarios
- Work with downstream/upstream dependency teams to create business continuity plans
- Partner with client teams to enforce agreed upon SLAs by employing appropriate throttling mechanisms
- Analyze existing services, identify technical debt and propose solutions to increase production reliability
- Review complex production deployments, including debugging code, reviewing configurations and provide feedback to development teams to eliminate potential prod issues
- Use monitoring tools extensively to measure application/system health and trigger timely action
- Automate production validations post-release
- Lead production resiliency exercises for the department

Basic Qualifications:

- Bachelor's Degree or Military Experience
- At least 3 years experience developing applications utilizing Agile principles
- At least 3 years of experience with open source application

Preferred Qualifications:

- 5 years experience with Enterprise level infrastructure designs, implementation and support
- 3 years experience with Cloud based hosting solutions (AWS-EC2/S3, Azure or Google Cloud)
- 2 years of experience working on UNIX/LINUX platforms
- 2 years of experience with UNIX scripting (ksh) or perl
- 2 years of experience with AWS resource stack (EC2, ELB) via CloudFormation
- 3 years of experience with Ansible and scripting languages
- 2 years of experience with S3 bucket administration (ACLs, Policies, LifeCycle and Replication)
- 2 years of experience Managing Security Groups for EC2 instances
- 2 years of experience with application monitoring tools (New Relic, Splunk or Elk)
- Active AWS certification

Capital One will consider sponsoring a new qualified applicant for employment authorization for this position.

Back to top