We are looking for an experienced Lead Site Reliability Engineer to join our team and play a key role in building and maintaining robust, scalable, and efficient systems. This position focuses on improving infrastructure, streamlining processes through automation, and ensuring optimal performance across distributed systems and cloud platforms. You will collaborate with diverse teams, lead technical projects, and mentor team members to foster a culture of innovation and operational excellence.
EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.

Want more jobs like this?

Get jobs in Bahía Blanca, Argentina delivered to your inbox every week.

By signing up, you agree to our Terms of Service & Privacy Policy.

#LI-DNI

Responsibilities

Enhance the performance of Linux-based operating systems for production services and distributed systems
Develop and implement advanced monitoring solutions using tools like Grafana, Prometheus, and Splunk to improve system observability
Resolve complex Kubernetes-related issues and establish team-wide best practices and standards
Create and maintain automation scripts with Bash and Python to streamline operational processes
Build and manage container orchestration platforms such as Kubernetes or EKS, sharing knowledge with the team
Design and manage reliable and scalable cloud infrastructure on AWS to ensure system availability
Lead initiatives to automate repetitive processes and drive efficiency across the team
Provide leadership and promote a collaborative work environment through effective communication and ownership
Encourage continuous learning and development among team members to foster a culture of growth and curiosity
Offer technical guidance and mentorship to team members to improve communication and operational efficiency
Plan and manage disaster recovery strategies and capacity planning to ensure system resilience and scalability
Automate deployment workflows using tools like Terraform or CloudFormation to improve reliability and productivity
Incorporate open-source technologies such as Cassandra, Kafka, Postgres, Solr, and Redis to advance SRE methodologies

Requirements

A bachelor's degree in Computer Science, a related technical field, or equivalent practical experience
A minimum of five years of hands-on experience as a Site Reliability Engineer
At least one year of experience in a leadership or team management role
Proficiency in Bash for scripting and process automation
Experience with Grafana for system monitoring and visualization
Strong expertise in Linux systems and their optimization for high-performance environments
Familiarity with Microsoft Internet Information Services (IIS) for managing web servers
Knowledge of Prometheus for monitoring and alerting in distributed environments
Proficiency in Python for creating automation solutions and improving operational workflows
Fluency in English at a B2 level or higher, with strong verbal and written communication skills

Nice to have

Experience designing scalable solutions with Amazon Web Services (AWS)
Familiarity with cloud platforms and their integration into system designs
Advanced knowledge of Kubernetes for managing containerized applications
Experience using Splunk for log management and advanced telemetry
Expertise with Terraform and Terraform Cloud for infrastructure automation
Strong skills in troubleshooting and resolving complex technical challenges

We offer

Connectivity Bonus (15,000 ARS are paid with a salary receipt at the end of each month as a non-wages concept)
Medicina Prepaga (It covers the collaborator and direct family group)
Paternity Leave (Two additional days are added to what is established by law, total of 4 days)
Discounts card
English Training (English lessons, twice per week)
Training Program (Access to multiple customized training plans according to the needs of each role within the company)
Marriage bonus (The company doubles the allowance established by law that ANSES offers)
Referral Program (Referral bonus is paid when the referral of a collaborator joins the Company)
External Agreements and Discounts
Vacations: 14 calendar days a year

By applying to our role, you are agreeing that your personal data may be used as in set out in EPAM's Privacy Notice and Policy.

Want more jobs like this?

Search Additional Jobs