Skip to main contentA logo with &quat;the muse&quat; in dark blue text.

Lead Site Reliability Engineer

AT Salesforce
Salesforce

Lead Site Reliability Engineer

San Francisco, CA

To get the best candidate experience, please consider applying for a maximum of 3 roles within 12 months to ensure you are not duplicating efforts.

Job Category
Software Engineering

Job Details

About Salesforce

We're Salesforce, the Customer Company, inspiring the future of business with AI+ Data +CRM. Leading with our core values, we help companies across every industry blaze new trails and connect with customers in a whole new way. And, we empower you to be a Trailblazer, too - driving your performance and career growth, charting new paths, and improving the state of the world. If you believe in business as the greatest platform for change and in companies doing well and doing good - you've come to the right place.

Want more jobs like this?

Get Software Engineering jobs in San Francisco, CA delivered to your inbox every week.

By signing up, you agree to our Terms of Service & Privacy Policy.


About Salesforce

We're Salesforce, the Customer Company, inspiring the future of business with AI+ Data +CRM. Leading with our core values, we help companies across every industry blaze new trails and connect with customers in a whole new way. And, we empower you to be a Trailblazer, too - driving your performance and career growth, charting new paths, and improving the state of the world. If you believe in business as the greatest platform for change and in companies doing well and doing good - you've come to the right place.

The Marketing Automation Platform & Data Operations team operates within the Marketing Technology organization and is instrumental in ensuring the trusted, transparent, and reliable platforms that empower our company to achieve its innovation goals. We are focused on proactively addressing challenges related to platform reliability and operational efficiency, particularly concerning our critical Marketing Technology ecosystem.

Given the importance of incident management and the criticality of our technology, our team requires an experienced and self-motivated Site Reliability Engineer to ensure the highest standards of Trust and Security. This role will collaborate with Platform Operations and Platform Engineering to improve the reliability, performance, and scalability of our systems by implementing and maintaining automated solutions for monitoring, incident response, and system optimization, as well as contributing to strategic planning and technology decisions.


What Are We Looking For?

Role Overview: As a Lead Site Reliability Engineer, you will play a pivotal role in ensuring the reliability, performance, and scalability of our critical software systems and infrastructure within an enterprise IT environment. You will serve as a technical leader, bridging software engineering and system administration, with a particular emphasis on monitoring, visualization, and alerting tools such as Datadog, Splunk, Grafana, New Relic, Tableau, and PagerDuty. You will take ownership of service reliability, lead incident investigations, and drive automation initiatives to enhance system stability and operational efficiency.

Technical Expertise:

  • Monitoring and Visualization Platforms: Deep expertise in Datadog, Splunk, Grafana, New Relic, and Tableau for proactive monitoring, alerting, and comprehensive visualization of system performance and reliability metrics.
  • Salesforce Ecosystem: Experience managing reliability and performance within the Salesforce ecosystem, including the Salesforce Platform, Slack, Data Cloud, Tableau and Heroku.
  • Cloud Infrastructure: Extensive experience with cloud platforms (AWS, Azure, Google Cloud) for infrastructure management and monitoring.
  • Coding & Scripting: Advanced proficiency in scripting languages such as Python, Go, Java, or equivalent, focused on automation and monitoring integration.
  • Infrastructure as Code (IaC): Proven capability using tools such as Terraform, Ansible, and Kubernetes for infrastructure automation and provisioning.
  • CI/CD Pipelines: Comprehensive knowledge of CI/CD processes to ensure reliable and efficient software deployment (Jenkins, Copado, Gearset).

Operational Skills:

  • Incident Response: Leadership in incident investigations, driving swift resolutions using incident management tools, particularly PagerDuty.
  • Service Level Objectives (SLOs): Expertise in defining and managing SLOs and SLAs, utilizing monitoring tools (Datadog, Splunk, New Relic) for accurate tracking and reporting.
  • Documentation and Knowledge Sharing: Strong skills in documenting best practices, incident responses, and operational procedures using tools such as Wikis, Notion, and Tableau dashboards.

Problem Solving:

  • Root Cause Analysis: Expertise in conducting detailed root cause analyses using data from monitoring and visualization tools like Splunk, Datadog, Grafana, and New Relic.
  • Troubleshooting: Advanced troubleshooting capabilities across infrastructure, leveraging insights from comprehensive monitoring systems.
  • Process Improvement: Proven ability to identify and implement automation and process improvements to enhance reliability and reduce manual efforts.

Vendor and Relationship Management:

  • Collaboration: Excellent ability to collaborate across teams including developers, platform engineers, architects, QA, and operations, maintaining alignment and effective communication.
  • Stakeholder Engagement: Act as liaison among product, engineering, and operations teams, emphasizing reliability insights derived from platforms like Tableau and Salesforce Tableau.

Disaster Recovery and Incident Management:

  • Escalation Management: Primary point of contact for escalations related to reliability, utilizing PagerDuty to ensure rapid and structured incident responses.
  • Disaster Recovery: Active participation in developing and executing disaster recovery plans with continuous monitoring and alerting using the above-mentioned tools.

Communication and Leadership:

  • Effective Communication: Strong verbal and written communication, especially in presenting complex technical insights through platforms such as Tableau and Salesforce Tableau.
  • Mentorship: Demonstrated capability in mentoring junior engineers, fostering a high-performance culture focused on proactive monitoring and reliability.

Innovation and Continuous Learning:

  • Industry Trends: Continuous learning on industry advancements, especially relating to monitoring, observability, and visualization technologies.
  • Knowledge Management: Contribution to internal training, documentation, and knowledge-sharing practices, leveraging detailed analytics from monitoring and visualization platforms.

Flexibility:

  • Adaptability: Capability to manage shifting priorities in fast-paced development cycles while maintaining operational excellence and composure.

Minimum Qualifications:

  • 8+ years of relevant industry experience, emphasizing monitoring, alerting, and visualization systems.
  • Advanced expertise with Datadog, Splunk, Grafana, Tableau, New Relic, and PagerDuty.
  • Deep knowledge of cloud infrastructures (AWS, Azure, GCP).
  • Experience managing reliability within the Salesforce ecosystem.
  • Proven ability in incident escalation and disaster recovery management.
  • Strong relationship-building skills across technical and business teams.
  • Excellent verbal, written, and interpersonal skills.

Preferred Qualifications:

  • Experience in Enterprise-scale environments, particularly with Salesforce technology (Heroku, SF Platform, Data Cloud).
  • Familiarity with configuration management tools (Ansible, Puppet, Chef) and log management (Elastic, Logstash, Kibana).
  • Relevant industry certifications (AWS, GCP, Kubernetes).

Accommodations

If you require assistance due to a disability applying for open positions please submit a request via this Accommodations Request Form.

Posting Statement

Salesforce is an equal opportunity employer and maintains a policy of non-discrimination with all employees and applicants for employment. What does that mean exactly? It means that at Salesforce, we believe in equality for all. And we believe we can lead the path to equality in part by creating a workplace that's inclusive, and free from discrimination. Know your rights: workplace discrimination is illegal. Any employee or potential employee will be assessed on the basis of merit, competence and qualifications - without regard to race, religion, color, national origin, sex, sexual orientation, gender expression or identity, transgender status, age, disability, veteran or marital status, political viewpoint, or other classifications protected by law. This policy applies to current and prospective employees, no matter where they are in their Salesforce employment journey. It also applies to recruiting, hiring, job assignment, compensation, promotion, benefits, training, assessment of job performance, discipline, termination, and everything in between. Recruiting, hiring, and promotion decisions at Salesforce are fair and based on merit. The same goes for compensation, benefits, promotions, transfers, reduction in workforce, recall, training, and education.

In the United States, compensation offered will be determined by factors such as location, job level, job-related knowledge, skills, and experience. Certain roles may be eligible for incentive compensation, equity, and benefits. Salesforce offers a variety of benefits to help you live well including: time off programs, medical, dental, vision, mental health support, paid parental leave, life and disability insurance, 401(k), and an employee stock purchasing program. More details about company benefits can be found at the following link: https://www.salesforcebenefits.com.Pursuant to the San Francisco Fair Chance Ordinance and the Los Angeles Fair Chance Initiative for Hiring, Salesforce will consider for employment qualified applicants with arrest and conviction records.

For California-based roles, the base salary hiring range for this position is $200,800 to $276,100.

Client-provided location(s): San Francisco, CA, USA
Job ID: Salesforce-JR300984
Employment Type: Full Time

Perks and Benefits

  • Health and Wellness

    • Health Insurance
    • Health Reimbursement Account
    • Dental Insurance
    • Vision Insurance
    • Life Insurance
    • Short-Term Disability
    • Long-Term Disability
    • FSA
    • FSA With Employer Contribution
    • HSA
    • HSA With Employer Contribution
    • Fitness Subsidies
    • On-Site Gym
    • Mental Health Benefits
  • Parental Benefits

    • Adoption Leave
    • Return-to-Work Program
    • Birth Parent or Maternity Leave
    • Non-Birth Parent or Paternity Leave
    • Fertility Benefits
    • Adoption Assistance Program
    • Family Support Resources
  • Work Flexibility

    • Flexible Work Hours
    • Remote Work Opportunities
    • Hybrid Work Opportunities
  • Office Life and Perks

    • Casual Dress
    • Happy Hours
    • Snacks
    • Some Meals Provided
    • Company Outings
  • Vacation and Time Off

    • Paid Vacation
    • Unlimited Paid Time Off
    • Paid Holidays
    • Personal/Sick Days
    • Leave of Absence
    • Sabbatical
    • Volunteer Time Off
  • Financial and Retirement

    • 401(K)
    • 401(K) With Company Matching
    • Company Equity
    • Stock Purchase Program
    • Performance Bonus
    • Relocation Assistance
    • Financial Counseling
  • Professional Development

    • Tuition Reimbursement
    • Learning and Development Stipend
    • Promote From Within
    • Mentor Program
    • Shadowing Opportunities
    • Access to Online Courses
    • Lunch and Learns
    • Internship Program
    • Leadership Training Program
    • Professional Coaching
    • Work Visa Sponsorship
  • Diversity and Inclusion

    • Employee Resource Groups (ERG)
    • Unconscious Bias Training
    • Diversity, Equity, and Inclusion Program

Company Videos

Hear directly from employees about what it is like to work at Salesforce.