Network Site Reliability Engineer
Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. At Goldman Sachs, SRE is responsible for the availability and reliability of our firm's most critical platform services, and ensures they meet the requirements of our internal and external users. We look for engineers who are motivated to collaborate with our businesses to build and run sustainable production systems, which can evolve and adapt to changes in our fast-paced, global business environment.
WHAT WE DO
At Goldman Sachs, our Engineers don't just make things - we make things possible. Change the world by connecting people and capital with ideas. Solve the most challenging and pressing engineering problems for our clients. Join our engineering teams that build massively scalable software and systems, architect low latency infrastructure solutions, proactively guard against cyber threats, and leverage machine learning alongside financial engineering to continuously turn data into action. Create new businesses, transform finance, and explore a world of opportunity at the speed of markets.
Engineering, which is comprised of our Technology Division and global strategists groups, is at the critical center of our business, and our dynamic environment requires innovative strategic thinking and immediate, real solutions. Want to push the limit of digital possibilities? Start here.
At Goldman Sachs, our culture is one of teamwork, innovation and meritocracy. We often say our people are our greatest asset and we take pride in supporting each colleague both professionally and personally. From collaborative work spaces and mindfulness classes to working from home and flexible work options, we offer our people the support they need to reach their goals in and outside the office.
RESPONSIBILITIES AND QUALIFICATIONS
HOW YOU WILL FULFILL YOUR POTENTIAL
- Balance feature development velocity and reliability with well-defined Network level SLOs.
- Run the Production environment by monitoring availability and taking a holistic view of system health.
- Analyze data to diagnose and identify root causes to network-specific events.
- Develop tools and services to automate the mitigation and remediation of network-specific events.
- Manage end-to-end availability and performance of mission critical services and build automation to prevent problem recurrence.
- Design, write and deliver software to improve the availability, scalability, latency, and efficiency of our global network.
- Drive incident management process and support a blameless post-mortems culture.
- Participate in system design consulting, platform management, and capacity planning.
SKILLS AND EXPERIENCE WE ARE LOOKING FOR
- BS degree in Computer Science or related technical field involving coding and / or systems engineering.
- Proficiency in one or more of the following: Go, Python, C, C++, Java, Perl, Ruby or shell scripting.
- Experience with algorithms, data structures and software design.
- Experience with Network systems and Unix operating systems.
- Experience with distributed systems design, maintenance, and troubleshooting.
- Hands-on experience with debugging and optimizing code, as well as automation.
- Strong interpersonal skills, drive, and ownership.
- Coding beyond simple scripts.
- Solving novel problems from first principles
ABOUT GOLDMAN SACHS
The Goldman Sachs Group, Inc. is a leading global investment banking, securities and investment management firm that provides a wide range of financial services to a substantial and diversified client base that includes corporations, financial institutions, governments and individuals. Founded in 1869, the firm is headquartered in New York and maintains offices in all major financial centers around the world.
© The Goldman Sachs Group, Inc., 2020. All rights reserved Goldman Sachs is an equal employment/affirmative action employer Female/Minority/Disability/Vet.
Back to top