EPAM Systems

Site Reliability Engineer

1 week agoSofia, Bulgaria

EPAM is committed to providing our global team of more than 41,150 EPAMers with inspiring careers from day one. EPAMers think creatively and lead with passion and honesty. Our people are the source of our success. We value collaboration, work in partnership with our customers, and strive for the highest standards of excellence. In today's market conditions, we're supporting operations for hundreds of clients around the world remotely. No matter where you are located, you'll join a dedicated, diverse community that will help you discover your fullest potential.

You are curious, persistent, logical and with a growth mindset - a true techie at heart. You enjoy living by the code of your craft and developing elegant solutions for complex problems. If this sounds like you, this could be the perfect opportunity to join EPAM as SITE RELIABILITY ENGINEER or any level above.
What You'll Do

  • Building and maintaining observability ecosystem including the following components: Prometheus, Corthex/GME, Grafana, Loki, Jaeger, FluentD/Bit, etc
  • Coding work (Python and Go) on projects such as building API's, self-service portals and observability platform automation and integration
  • Help to support a large-scale observability platform, including on-call shifts
  • Working with other team members to define the architectures and practices that should be adopted to deliver observability platform operational goals
  • Contributing to research and tooling for monitoring and performance improvement to provide solid SLAs for our customers
  • Assist internal customers on getting the most from company observability platforms, both guarding during onboarding stage and encouraging the best practice methodologies
What You Have
  • 2+ years of experience with application and infrastructure monitoring tools
  • Understanding of foundational monitoring concepts: avoiding noise, defining CLI/SLO, etc
  • Experience designing and implementing central logging solution by using tools such as FluentBit, FluentF, PromTail or LogStash
  • Experience with containerization (e.g. Docker, Kubernetes, Amazon EKS or Azure AKS)
  • Familiarity with Terraform, Ansible automation platform
  • 2+ years of experience working cloud infrastructure based on AWS and Azure platforms
  • 2+ years of programming/scripting experience with Python, PowerShell, shell scripting or GO
  • 2+ years of experience with CI/CD, Git, Terraform or Jenkins
  • Experience providing operational support for a production service
  • Understanding of Linux and networking fundamentals
  • Strong problem-solving characteristics
  • Clear, concise communication skills and good command of written and spoken English
Nice to have
  • Experience with Prometheus/PromQL/Cortex/Loki/Grafana platform
  • Experience in custom ETL design, implementation, and maintenance. SQL and relational database knowledge (MS SQL, MySQL, PostgreSQL)
We offer
  • Personal development program that will allow you to be valued for your strengths
  • Wide range of professional trainings and workshops
  • Unlimited access to LinkedIn learning solutions
  • Attractive salary, additional health and dental insurance as well as other social benefits
  • Broad projects variety and possible mobility between projects over the time
  • Experience exchange with colleagues around the world
  • Work-life balance and flexible schedule, team buildings and sport opportunities
  • Modern office in the Infinity Tower business center
  • If you are interested in this role please send your CV in English. All applications will be treated as strictly confidential
  • Only short-listed applicants will be contacted

Client-provided location(s): Sofia, Bulgaria
Job ID: EPAM-60705