Sr Site Reliability Engineer (Public Cloud)
We are reshaping the cybersecurity market through our cloud-delivered security services, and our cloud infrastructure is quickly and massively growing with a global footprint. We’re looking for great SREs, as well as software engineers interested in production engineering, to help us scale the largest enterprise security cloud infrastructure in the world.
Palo Alto Networks reinvented the enterprise firewall, growing from a start-up to a multi-billion-dollar company. Our Application Framework, the latest offering in our cloud-delivered security services, ingests security events from hundreds of thousands of firewalls deployed across the globe to provide a massive data analytics platform for deep inspection, anomaly detection, and actionable security automation. Our cloud infrastructure is home to a series of massive and complicated distributed systems and virtualization software platforms which enable big data processing around security services, sandboxing and malware detection, URL categorization and malicious site/domain identification, and security research/response.
- You will be responsible for designing, building, maintaining, and scaling production services and server farms across multiple data centers for complex and data-intensive cloud services.
- You will design and enhance software architecture to improve scalability, service reliability, capacity, and performance.
- You will write automation code for provisioning and operating infrastructure at massive scale. You are not an operator, you’re an experienced software engineer focused on operations.
- You will work with development teams to make sure the applications fit nicely within the infrastructure and scalability/reliability is designed and implemented from the grounds up. You will work with QA on building pipelines and automation for delivering and deploying applications to production.
- You will participate in the occasional on-call rotation supporting the infrastructure.
- You will roll up the sleeves to troubleshoot incidents, formulate theories and test your hypothesis, and narrow down possibilities to find the root cause.
- You write postmortem reviews and remediation recommendation.
Hands on experience in building fault-tolerant and scalable systems.
- Strong development/automation skills. Must be very comfortable with reading and writing Python code. Java is a plus.
- 10+ years of Unix/Linux experience, with some experience in managing 100+ nodes.
- Tools-first mindset. You build tools for yourself and others to increase efficiency and to make hard or repetitive tasks easy and quick.
- Experience with AWS. Azure and/or GCP is a plus
- Experience with Configuration Management and CI/CD. Salt and Jenkins preferred.
- Preferred experience: API Gateway, CloudFormation or Terraform, Cloudwatch, EC2, IAM, Lambda, RDS, Route53, S3, SNS, SQS, Step Functions, VPC
- Organized, focused on building, improving, resolving and delivering. Good communicator in and across teams, taking the lead.
Back to top