Site Reliability Engineer- DevOps II
Come work at a place where innovation and teamwork come together to support the most exciting missions in the world!
Site Reliability Engineering at F5 Cloud Services is responsible for the performance, reliability, and scalability of all F5 Cloud Services in production. We build software to automate, optimize, manage, and maintain those services; driving down technical debt, operational cost, and toil every step of the way.
The team is seeking an experienced Site Reliability Engineer to support our product engineering teams. SREs supporting these teams will be dedicated to scaling our infrastructure, improving developer productivity, automation and tooling that power F5 Cloud Services, like aDNS, GSLB, WAF, LTM, and so much more.
Attractions to the job
Our SRE engineer is a valued contributor on our team in the F5 Cloud Services. Our team is a dedicated, global team that leads all the site reliability aspects of the product.
You will be well versed in wide variety of automation, standardization, operations, incident handling methodologies and building tools to meet world class production grade quality system. You'll learn internals of some of our core services like aDNS, WAF, GSLB, LTM etc. You will use this knowledge to meet the desired reliability of the services in production. Our team follows incident handling procedures to drive mitigation of any production incidents and will be called to perform troubleshooting, root cause analysis, configuration and any design suggestions or automation. You will be working on some of the most cutting-edge technologies in Cloud Services, Kubernetes, microservices architecture, CI/CD, and many more. You are an advocate every single day for improving the reliability of F5 Cloud Services products and services.
What You'll Do
- You will help us identify and develop solutions to improve scalability, performance and simplify our platform and processes
- Advocate for reliability and develop standards across services and teams
- You will drive standardization efforts across multiple disciplines, systems, software, and teams
- Maintain services once they are live by measuring and monitoring availability, latency and overall system health
- Support Continuous Integration and Deployments across various environments
- You will do performance analysis and optimize various services
- Ensure F5 Cloud Services infrastructure and services maintain the required level of availability, reliability, scalability, and performance
- Ensure proper security, monitoring, alerting and reporting for the F5 Cloud Service's infrastructure and services
- Additional responsibilities might include design and build outs, SLA management, operations management and projects to continually improve service quality and performance
Who you are
- You have a solid understanding of systems and application design, including the operational trade-offs of various designs
- You have a systematic problem-solving approach
- You have strong interpersonal skills and a sense of ownership and drive
- Be adaptable and able to focus while working with large, complex, and multi-team owned services
- Can take initiative on tasks and work well in a remote team environment
- You have an interest in working with distributed systems, microservices and modern container orchestration platforms and related systems (e.g. Kubernetes, Prometheus, docker)
- Minimum 3+ years of handling services in a large-scale distributed environment
- At least one year of experience in programming DevOps tools or products on at least one higher-level language (e.g. Python, Golang or Java) and practical knowledge of daily task automation using scripts.
- Production experience with configuration management and deployment automation tools such as Terraform, Ansible, etc.
- At least one year of production experience running infrastructure in a public cloud such as AWS, Azure, GCP.
- Production experience in design and setup monitoring systems (preferable Prometheus), creating checks and alerts, tuning thresholds etc.
- Good understanding of web operations best practices
Skills, experience and attributes:
- Basic networking fundamentals: TCP/IP, HTTPS, VLANs, DNS, load balancing and firewalling, etc.
- Working knowledge of Linux OS is a must
- At least one year of production experience with Kubernetes or OpenShift. Strong knowledge about Linux containers, virtualization, micro-services etc.
- Working knowledge of BIG-IP is an advantage
- Strong communication skills
- Excellent customer service focus
The Job Description is intended to be a general representation of the responsibilities and requirements of the job. However, the description may not be all-inclusive, and responsibilities and requirements are subject to change.
Equal Employment Opportunity
It is the policy of F5 to provide equal employment opportunities to all employees and employment applicants without regard to unlawful considerations of race, religion, color, national origin, sex, sexual orientation, gender identity or expression, age, sensory, physical, or mental disability, marital status, veteran or military status, genetic information, or any other classification protected by applicable local, state, or federal laws. This policy applies to all aspects of employment, including, but not limited to, hiring, job assignment, compensation, promotion, benefits, training, discipline, and termination. Reasonable accommodation is available for qualified individuals with disabilities, upon request.
Back to top