Skip to main contentA logo with &quat;the muse&quat; in dark blue text.
Walmart

Principal Software Engineer, Site Reliability

Sunnyvale, CA

Position Summary...

What you'll do...

Principal Site Reliability Engineer:

This position is responsible for the operation of a department. An individual in this position will be expected to perform additional job related responsibilities and duties as assigned and/or necessary.

Performance and Optimization : Requires knowledge of: Unix/Linux performance optimization tuning; Java/NodeJS/Tomcat/Apache tuning and optimization; Opensource Chaos tools (for example, Openblade, Chaos Monkey, Pumba, Chaos Mesh, Litmus, Chaos Toolkit, ToxiProxy) To evaluate appropriate reliability models to evaluate and estimate complex reliability parameters. Designs and develops a reliability program plan for a complex site environment. Facilitates reliability testing procedures. Ensures reliability testing procedures align with site environment changes.

Want more jobs like this?

Get Software Engineering jobs in Sunnyvale, CA delivered to your inbox every week.

By signing up, you agree to our Terms of Service & Privacy Policy.


Integrates the business goals of site reliability engineering and site safety engineering. Trains team members on the development and implementation of tools and applications for reliability predictions and improvements. Decides criteria selection and evaluation for site reliability analysis and assessment. Facilitates Opensource Chaos experiments to test and validate the resiliency of applications.

Solution Design: Requires knowledge of: Software architecture; Distributed systems; Scalability; Design patterns; Disaster Recovery; Tech Stacks; Minimum Viable Product- MVP; Non-Functional Requirements; Telemetry To create simple, modular, extensible and functional design in adherence to the requirements for multiple products/solutions within a domain.

Understand Customer requirements and analyze the gaps between existing architecture and customer requirements. Analyze system performance impacting the complete product for non-functional requirements like reliability, operability, performance efficiency and security.

Create detailed design using mock screens, pseudo codes and detailed functional logic of the modules for an entire product. Finalize the tech stack (For example MEAN, LAMP etc.) - for products/systems based on the business needs. Review the MVP to uncover risks and check for performance and usability; guide the team during MVP creation. Drive design of software, production and preproduction environments and deployment pipeline to continuously generate records for telemetry.

Infrastructure Design: Requires knowledge of: Software architecture; Distributed systems; Scalability; Design patterns; Disaster Recovery; Tech

Stacks; Non-Functional Requirements; Security standards, frameworks, and methodologies (System Security Plan -SSP, Security Risk and

Compliance Review- SRCR etc.)

To assist in creation of simple, modular, extensible and functional design for the product/solution in adherence to the requirements. Evaluate trade-offs while designing across multiple components in a system based on the business requirements. Convert HLD to create detailed design for specific modules / components of a product/system. Understand nuances of designing for disaster recovery. Undertake infrastructure coding automation.

Coding : Requires knowledge of: Coding standards and guidelines; Coding languages (E.g. JavaScript, Python, C# etc.), frameworks(E.g. ActiveX, .Net, Cocoa, Android application framework etc.), tools(E.g. Monday.com, Linx, Embold etc.) and Platforms (E.g. Microsoft Azure, AWS , Apple IOS etc.); Quality, Safety and Security (PCI etc) standards; Emerging tools and technologies; Telemetry.

To create/configure minimalistic code for entire component/application and ensure the components are meeting business/technical requirements, non-functional requirements, low-maintainability, high-availability and high-scalability needs.

Assist in the selection of appropriate languages (E.g. JavaScript, Python, C# etc.), development standards and tools (E.g. Monday.com, Linx, Embold etc.)for software coding/configuration. Take initiative to learn the fundamentals of different coding languages and frameworks that would be useful for future scope of work. Build scripts for automation of repetitive and routine tasks in CI/CD (Continuous Integration/Continuous Delivery), Testing or any other process(as applicable).

Implement telemetry features as required independently. Ensure security policy requirements are properly applied to components/application during code development/configuration.

Triaging and Troubleshooting : Requires knowledge of: Regression testing; Root cause analysis (RCA); Root cause corrective action (RCCA) To guide team members in RCA and RCCA to identify the origins of and prevent defects/performance gaps. Analyzes complex problems involving multiple parties, networks, hardware, software, and cloud computing technologies.

Assesses immediate restoration versus root cause based on consequences and resource requirements. Analyzes the issues and plans a series of steps to enhance an application's availability and reliability, potentially including reconfiguration, integration, removal, or the addition of application components. Analyzes trends to proactively prevent incidents and provide historical summary reports.

Disaster Recovery Planning : Requires knowledge of: Disaster recovery procedures and processes; Enterprise disaster recovery systems. To coordinate partial and full tests of contingency and disaster recovery plans. Creates and maintains data center contingency documents and action plans. Defines and documents contingency and disaster recovery procedures. Leads the identification of critical functions for assigned area of responsibility. Creates and tests plans for operating in a remote back-up environment. Coordinates the day-to-day activities of control measures used in recovery plans.

Monitoring and Alerting : Requires knowledge of: Monitoring and alerting tools; Monitoring metrics and key performance indicators (for example, availability, MTBF, MTTR); SLIs and SLOs (for example, request latency, availability, error rates, saturation); Distributed tracing; Alerting logic.

To establish metrics to monitor network, software, or system performance. Establishes SLOs/SLAs to determine availability goals of systems/services. Sets altering priorities by identifying the most important systems based on criticality. Oversees daily system monitoring, including verifying the integrity and availability of all hardware and services, reviews system and application logs, and verifies the completion of scheduled jobs.

Leads end-to-end audits of monitors and alarms based on subsystem knowledge. Provides proactive updates to executive leadership on potential customer-impacting issues. Analyzes systems and makes recommendations to prevent possible incidents using knowledge of complex and company-wide systems.

Writes advanced "Splunk" queries to join multiple indices to stitch data.

Drives the execution of multiple business plans and projects by identifying customer and operational needs; developing and communicating business plans and priorities; removing barriers and obstacles that impact performance; providing resources; identifying performance standards; measuring progress and adjusting performance accordingly; developing contingency plans; and demonstrating adaptability and supporting continuous learning.

Provides supervision and development opportunities for associates by selecting and training; mentoring; assigning duties; building a team-based work environment; establishing performance expectations and conducting regular performance evaluations; providing recognition and rewards; coaching for success and improvement; and ensuring diversity awareness.

Promotes and supports company policies, procedures, mission, values, and standards of ethics and integrity by training and providing direction to others in their use and application; ensuring compliance with them; and utilizing and supporting the Open Door Policy.

Ensures business needs are being met by evaluating the ongoing effectiveness of current plans, programs, and initiatives; consulting with business partners, managers, co-workers, or other key stakeholders; soliciting, evaluating, and applying suggestions for improving efficiency and cost effectiveness; and participating in and supporting community outreach events.

At Walmart, we offer competitive pay as well as performance-based bonus awards and other great benefits for a happier mind, body, and wallet. Health benefits include medical, vision and dental coverage. Financial benefits include 401(k), stock purchase and company-paid life insurance. Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting. Other benefits include short-term and long-term disability, company discounts, Military Leave Pay, adoption and surrogacy expense reimbursement, and more.





You will also receive PTO and/or PPTO that can be used for vacation, sick leave, holidays, or other purposes. The amount you receive depends on your job classification and length of employment. It will meet or exceed the requirements of paid sick leave laws, where applicable.


For information about PTO, see https://one.walmart.com/notices .




Live Better U is a Walmart-paid education benefit program for full-time and part-time associates in Walmart and Sam's Club facilities. Programs range from high school completion to bachelor's degrees, including English Language Learning and short-form certificates. Tuition, books, and fees are completely paid for by Walmart.

Eligibility requirements apply to some benefits and may depend on your job classification and length of employment. Benefits are subject to change and may be subject to a specific plan or program terms.


For information about benefits and eligibility, see One.Walmart .


Bellevue, Washington US-11075:The annual salary range for this position is $132,000.00-$264,000.00

SUNNYVALE, California US-04396:The annual salary range for this position is $143,000.00-$286,000.00



















Additional compensation includes annual or quarterly performance bonuses.

Additional compensation for certain positions may also include:



- Stock




Minimum Qualifications...

Outlined below are the required minimum qualifications for this position. If none are listed, there are no minimum qualifications.

Option 1: Bachelor's degree in computer science, computer engineering, computer information systems, software engineering, or related area and5 years' experience in software engineering or related area.
Option 2: 7 years' experience in software engineering or related area.

Preferred Qualifications...

Outlined below are the optional preferred qualifications for this position. If none are listed, there are no preferred qualifications.

Master's degree in computer science, computer engineering, computer information systems, software engineering, or related area and 3 years' experience in software engineering or related area., We value candidates with a background in creating inclusive digital experiences, demonstrating knowledge in implementing Web Content Accessibility Guidelines (WCAG) 2.2 AA standards, assistive technologies, and integrating digital accessibility seamlessly. The ideal candidate would have knowledge of accessibility best practices and join us as we continue to create accessible products and services following Walmart's accessibility standards and guidelines for supporting an inclusive culture.

Primary Location...

840 W CALIFORNIA AVE, SUNNYVALE, CA 94086-4828, United States of America

Client-provided location(s): Sunnyvale, CA, USA
Job ID: Walmart-8261_R-1609483
Employment Type: Other