Skip to main contentA logo with &quat;the muse&quat; in dark blue text.

Lead Site Reliability Engineer

AT EPAM Systems
EPAM Systems

Lead Site Reliability Engineer

Bahía Blanca, Argentina

We are looking for an experienced Lead Site Reliability Engineer to join our team and play a key role in building and maintaining robust, scalable, and efficient systems. This position focuses on improving infrastructure, streamlining processes through automation, and ensuring optimal performance across distributed systems and cloud platforms. You will collaborate with diverse teams, lead technical projects, and mentor team members to foster a culture of innovation and operational excellence.
EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.

Want more jobs like this?

Get jobs in Bahía Blanca, Argentina delivered to your inbox every week.

By signing up, you agree to our Terms of Service & Privacy Policy.


#LI-DNI

Responsibilities
  • Enhance the performance of Linux-based operating systems for production services and distributed systems
  • Develop and implement advanced monitoring solutions using tools like Grafana, Prometheus, and Splunk to improve system observability
  • Resolve complex Kubernetes-related issues and establish team-wide best practices and standards
  • Create and maintain automation scripts with Bash and Python to streamline operational processes
  • Build and manage container orchestration platforms such as Kubernetes or EKS, sharing knowledge with the team
  • Design and manage reliable and scalable cloud infrastructure on AWS to ensure system availability
  • Lead initiatives to automate repetitive processes and drive efficiency across the team
  • Provide leadership and promote a collaborative work environment through effective communication and ownership
  • Encourage continuous learning and development among team members to foster a culture of growth and curiosity
  • Offer technical guidance and mentorship to team members to improve communication and operational efficiency
  • Plan and manage disaster recovery strategies and capacity planning to ensure system resilience and scalability
  • Automate deployment workflows using tools like Terraform or CloudFormation to improve reliability and productivity
  • Incorporate open-source technologies such as Cassandra, Kafka, Postgres, Solr, and Redis to advance SRE methodologies
Requirements
  • A bachelor's degree in Computer Science, a related technical field, or equivalent practical experience
  • A minimum of five years of hands-on experience as a Site Reliability Engineer
  • At least one year of experience in a leadership or team management role
  • Proficiency in Bash for scripting and process automation
  • Experience with Grafana for system monitoring and visualization
  • Strong expertise in Linux systems and their optimization for high-performance environments
  • Familiarity with Microsoft Internet Information Services (IIS) for managing web servers
  • Knowledge of Prometheus for monitoring and alerting in distributed environments
  • Proficiency in Python for creating automation solutions and improving operational workflows
  • Fluency in English at a B2 level or higher, with strong verbal and written communication skills
Nice to have
  • Experience designing scalable solutions with Amazon Web Services (AWS)
  • Familiarity with cloud platforms and their integration into system designs
  • Advanced knowledge of Kubernetes for managing containerized applications
  • Experience using Splunk for log management and advanced telemetry
  • Expertise with Terraform and Terraform Cloud for infrastructure automation
  • Strong skills in troubleshooting and resolving complex technical challenges
We offer
  • Connectivity Bonus (15,000 ARS are paid with a salary receipt at the end of each month as a non-wages concept)
  • Medicina Prepaga (It covers the collaborator and direct family group)
  • Paternity Leave (Two additional days are added to what is established by law, total of 4 days)
  • Discounts card
  • English Training (English lessons, twice per week)
  • Training Program (Access to multiple customized training plans according to the needs of each role within the company)
  • Marriage bonus (The company doubles the allowance established by law that ANSES offers)
  • Referral Program (Referral bonus is paid when the referral of a collaborator joins the Company)
  • External Agreements and Discounts
  • Vacations: 14 calendar days a year
By applying to our role, you are agreeing that your personal data may be used as in set out in EPAM's Privacy Notice and Policy.

Client-provided location(s): Argentina
Job ID: EPAM-epamgdo_blt0664a559e79a9a39_en-us_Other_Argentina
Employment Type: Other