Cloud Reliability & Operations Engineer
Lazard is looking for a senior cloud reliability and operations engineer to join our IT department. This individual will be working on developing the operating model and supporting the firm's cloud hosting zones across a number of providers.
This is a key role which focuses on quality, availability, and performance to ensure the firm's cloud applications and services meet the demands of the firm's digital users today and in the future. The individual will need to be proficient in a variety of observability technologies, including availability and performance monitoring and tuning, and automation to help define and mature our cloud management and reporting capabilities. This role will also help transition 24x7 operational responsibilities to the standard operation teams by enabling new tooling, capability, training and documentation to allow for the traditional operations team to take on the new cloud centric responsibilities. After the support model is established this role will serve as the L3 escalation point for cloud based incidents and admin escalations from ops, appdev, and infrastructure teams.
On-going production operations of AWS and Azure hosted infrastructure and applications
- Drive the development and use of new and self-service tooling to support the operating model for the cloud
- Improve resiliency for all cloud applications and infrastructure and ensure that HA, DR, Data Protection requirements are appropriately engineered and implemented for each workload
- Stand up cloud environments based on established standards and guard-rails
- Use configuration management, orchestration and management tooling to ensure cloud environments meet operational and security standards
- Be a subject matter expert in reducing and resolving production incidents by identifying preventive controls and driving proactive efforts
- Act as the gatekeeper for all access escalations across all cloud environments
- Drive to a new operating model - enable tooling and process so that all L1/L2 operations can be done by more traditional NOC teams and remain the L3 escalation point for cloud incidents and requests
- Track system uptime and availability and promote incremental increases to change velocity
- Drive innovation and prioritization and engineering of new cloud capabilities to bolster the operating model
- 5+ years of reliability and operations experience - Linux, Windows, DevOps, Infrastrcuture, Network, Cyber
- 3+ years of experience with cloud - AWS, Azure, VMWare
- Expertise in troubleshooting cloud environments - finding and fixing critical production issues
- Practical experience with modern scripting languages - Python, Powershell, Perl, PHP, Shell
- Experience implementing Infrastructure as Code - Terraform, CloudFormation, Ansible etc..
- Expertise in management and monitoring cloud tooling - cloudwatch, splunk, datadog, ELK, Prometheus, cloudtrail etc..
- Experience with AIOps platforms to automate and shift-left operations functions
- Experience supporting mission critical applications and infrastructure on a 24x7 basis
- Working knowledge of cloud security principles and best practices
- Working knowledge of cloud networking - DNS, SG, NACL, firewalls,
- Expertise in driving good hygiene in cloud environments - in place patching, immutability, compliance monitoring (aws config), clean up of technical debt, IAM
- Experience with a DevOps delivery model for infrastructure, applications, and configuration
- Designing operational state to be policy and automation driven
- Strong communication skills
- Ability to multitask, work well under pressure and prioritize work against competing deadlines and changing business priorities
Experience with Google Cloud Platform
Software development experience
Back to top