Principal System Site Reliability Engineer - Chaos Engineering
- Dallas, TX
At AT&T, we're connecting the world through the latest tech, top-of-the-line communications and the best in entertainment. Our groundbreaking digital solutions provide an intuitive and integrated experience for customers across online, retail and care channels. Join our mission to deliver compelling communication and entertainment experiences to customers around the world. You'll drive how we deliver a seamless and fast customer experience with digital at the center of AT&T's distribution channels. We're offering an opportunity to revolutionize the digital space and the chance to create a career that will propel your future.
Sr. System/Site Reliability Engineer - Chaos Engineering
This position is responsible for implementing and managing pro-active Chaos Engineering and Chaos Testing practices to discover system behaviors, properties, and performance, enabling improvements that drive optimum production site experience and operations even during higher-than-expected site traffic, network outages, security attacks, hardware/memory failures, or software defects. This position implements new capabilities to drive scale, resilience, performance and reliability at all times.
- Evaluating & implementing best practices for Chaos Engineering and Chaos Testing to enable industry-leading reliability and resiliency for mission-critical customer experiences and back-end systems.
- Applying software engineering to automate all aspects of the software release and operations process from build/test/deploy, monitoring and alerting, service level reporting, to automatic failover and capacity management.
- Defining steady state that represents normal behavior of the site/system, hypothesizing expected outcomes when something goes wrong, and designing experiments with variables to reflect real-world events like dependency failures, server failures, network or memory malfunctions, etc.
- Measuring the impact of tests and observing difference of steady state across test groups
- Based on learnings, developing results and architecture designs where individual components can fail without affecting the availability of the entire system.
- Partnering with SREs, Architects, and Product Managers to ensure software they produce meets reliability, serviceability, and resiliency standards our customers deserve.
- Drive best practices and patterns that will contribute to AT&T's reputation as an industry leader for running highly reliable Digital applications and experiences.
- Solving the hard problems of running large-scale services at the highest levels of reliability and resiliency.
- Establishing great rapport with other DevOps teams, Product Managers, and Operations teams to maintain high levels of visibility, efficiency, and collaboration.
- 8+ years related experience with a bachelor's degree in Computer Science, Information Systems or related field.
- 6+ years of progressive experience in one or more of the following areas: application delivery; subject matter expertise in building Java-based high-volume/high-transaction e-commerce applications
- 3+ years of experience working with front end frameworks such as React, Angular
- 4+ years of experience in architecture and design of systems using Micro services architecture
- 4+ years of experience in a leadership capacity - coaching and mentoring engineers, developers
- 2+ years of experience working with SPA/PWA architectures
- 2+ years of experience with server-side rendering technologies and architectures
- 2+ years of experience in cloud technologies: AWS, Azure, OpenStack, Docker, Kubernetes, Ansible, Chef or Terraform
- 2+ years of experience in build and CICD technologies: GitHub, Maven, Jenkins, Nexus or Sonar
- 4+ years of experience in Unit and Function testing using Junit, Spock, Mockito/JMock, Selenium, Cucumber, SoapUI or Postman
- Proficiency in Unix/Linux command line
- Expert knowledge and experience working with asynchronous message processing, stream processing and event driven computing.
- Experience working within Agile/Scrum/Kanban development team
- Excellent written and verbal communication skills with demonstrated ability to present complex technical information in a clear manner to peers, developers, and senior leaders
Back to top