Principal Software Site Reliability Engineer - Problem Management & RCA
- Dallas, TX
At AT&T, we're connecting the world through the latest tech, top-of-the-line communications and the best in entertainment. Our groundbreaking digital solutions provide an intuitive and integrated experience for customers across online, retail and care channels. Join our mission to deliver compelling communication and entertainment experiences to customers around the world. You'll drive how we deliver a seamless and fast customer experience with digital at the center of AT&T's distribution channels. We're offering an opportunity to revolutionize the digital space and the chance to create a career that will propel your future.
Principal Software/Site Reliability Engineer - Problem Mgmt & RCA
This position is responsible for driving 24x7 Problem/Incident Mgmt impact and RCA assessment and communication for Consumer online Sales, Account Management, and Support websites and mobile apps. This position will define Service Level Objectives (SLOs) and also track & drive availability & service metrics, and accomplishment of operational SLOs.
- Analysis of GTOC enterprise Incidents including implementing automated tracking and reporting of system, customer & business impacts from site outages, incidents, and critical defects.
- Weekly and monthly analysis of progress & accomplishment against Service Level Objectives (SLOs) and identifying/driving gap closures where necessary.
- Coordinating with GTOC, Digital Product Delivery (PO/PM, Dev, QA), Operations, Site Reliability Engineers, Infrastructure/Network & 3rd Party vendors to drive resolution of reported problems.
- Leading Root-Cause Analysis (RCA) for complex outages, incidents, and critical/major defects, and tracking resolution through completion.
- Provide training to teams and audit RCAs to ensure blameless post-mortems are conducted per established principles and the resulting information is actionable to ensure the same problems do not occurs more than once.
- Developing tools, scripts, queries and performing data analysis of weekly/month/YTD incidents/problems to determine chronic/recurring root causers and applications with high frequency of incidents.
- Partnering with Site Reliability Engineers (SREs), DevOps teams, Network, Infrastructure, Security & Fraud services to establish proactive and automated monitoring/alerting for chronic root causers, establish get-well/ improvement plans and driving established improvement plans through to resolution.
- 8+ years related experience with a bachelor's degree in Computer Science, Information Systems or related field.
- 6+ years of progressive experience in one or more of the following areas: application delivery; subject matter expertise in building Java-based high-volume/high-transaction e-commerce applications
- 3+ years of experience working with front end frameworks such as React, Angular
- 4+ years of experience in architecture and design of systems using Micro services architecture
- 4+ years of experience in a leadership capacity - coaching and mentoring engineers, developers
- 2+ years of experience working with SPA/PWA architectures
- 2+ years of experience with server-side rendering technologies and architectures
- 2+ years of experience in cloud technologies: AWS, Azure, OpenStack, Docker, Kubernetes, Ansible, Chef or Terraform
- 2+ years of experience in build and CICD technologies: GitHub, Maven, Jenkins, Nexus or Sonar
- 4+ years of experience in Unit and Function testing using Junit, Spock, Mockito/JMock, Selenium, Cucumber, SoapUI or Postman
- Proficiency in Unix/Linux command line
- Expert knowledge and experience working with asynchronous message processing, stream processing and event driven computing.
- Experience working within Agile/Scrum/Kanban development team
- Excellent written and verbal communication skills with demonstrated ability to present complex technical information in a clear manner to peers, developers, and senior leaders
Back to top