Software Engineer - Reliability Foundations
- San Francisco, CA
About the Team
The Reliability Foundations team is composed of both systems and software engineers who build programs and services that make Slack more reliable. Our customers include SRE teams, product engineering teams, and various (non-technical) business owners. We develop and lead Slack’s Incident Response business processes, as well as the software automation which enables workflow, collects data and drives the insights needed to improve our reliability and customer experience. We also take care of service ownership tools that enable engineering teams as Slack to catalog, monitor and alert on the services they build for Slack. We also help design and extend the infrastructure management components of Slack internally (our internal PaaS platforms). More than just a “tools team” focused on productivity, we craft the mechanisms which make our platforms secure, performant and reliable by default.
Slack has a positive, diverse, and supportive culture — we look for people who are curious, inventive, and work to be a little better every single day. In our work together we aim to be smart, humble, hardworking and, above all, collaborative. If this sounds like a good fit for you, why not say hello?
What you will be doing
- You will lead large software projects, from start to finish, where the scope is mostly understood
- You will design and develop new software which extends the features of our internal “Platform-as-a-service” (PaaS) infrastructure
- You will develop features (mostly in Python), fix bugs, and guide the code and service health of internal service ownership products within Slack.
- You will add to, improve and maintain integration of internal tools with external/3rd party applications and APIs (using multiple languages).
- You will help sustain and improve Slack’s internal Incident Response and incident automation software
- You will integrate internal software applications with Slack products as they evolve.
- You will write, review, and provide feedback on technical design proposals.
- You will define SLA/SLOs for your services, manage code deployments, fixes and software updates, and automate our operational processes as needed.
- You will participate in the team’s on-call rotation, assist with triaging, and addressing production issues.
- You will review code and get your code reviewed; mentor and be mentored by other engineers. Teamwork is what makes the dream work.
What you should have
- Curiosity about how things work and love to share that knowledge with others
- A positive approach that embraces standard methodologies for software management and reliability, including unit testing, code review, design documentation, debugging, and troubleshooting.
- A passion for reliability, scaling patterns, up-time, and availability.
- A demonstrable history of thriving within a software development team, even if your roles have included traditional operations and/or infrastructure management duties.
- Professional functional or imperative programming languages -- e.g., PHP, Python, Ruby, Go, C, or Java (used without frameworks)
- Strong command of computer science fundamentals: data structures, algorithms, programming languages, distributed systems, and information retrieval
- Bachelor’s degree in Computer Science, Engineering or related field, or equivalent training or work experience
- Experience developing and managing modern public cloud infrastructure, especially AWS
- Experience as a Site Reliability Engineer (SRE), or as a platform or infrastructure engineer building and managing reliability mechanisms on distributed infrastructure
- Experience deploying, operating and debugging software on Linux at scale
- Hands-on background managing full-stack infrastructure, ie networking, storage, virtualization and/or host hardware, configuration management and packaging
- Experience using deployment automation/configuration management, such as Chef, Puppet, Ansible or Salt
- Experience with Incident Response programs and processes
- Experience triaging, troubleshooting and resolving production incidents at an organization with challenging SLAs and customer expectations
Back to top