Principal Site Reliability Engineer

    • Sunnyvale, CA

Responsibilities:

  • Act as a key contributor in forming the team’s technical strategy and aligning the team and stakeholders with it
  • Initiate large projects with complex architecture, breaking them down to the right logical components so that others engineers can be utilized & contribute effectively
  • Work frequently with other teams to coordinate major changes to cross-system architectures, influencing upstream or downstream for the most efficient solutions
  • Collaborate with engineering teams to propose features that solve recurring patterns of customer complaints
  • Expertly design and implement scalable, distributed, fault tolerant systems that satisfy complex requirements
  • Support services before they go live, through activities such as system design consulting, capacity planning and launch reviews
  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health
  • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity
  • Design and implement best practices for security, monitoring, and telemetry systems
  • Lead initiatives and meetings in the engineering organization and help your teammates be better engineers through better processes, practices and technical guidance
     

    Requirements:

  • Strong CS fundamentals. BS degree in Computer Science or related technical field, or equivalent practical experience
  • Ability to manage competing priorities, a focus on shipping, and the ability to work well under pressure
  • Experience in designing, analyzing, scaling and troubleshooting large-scale distributed systems
  • A systematic problem-solving approach, coupled with strong communications skills and a sense of ownership and drive
  • A passion for automation; strong coding skills in at least one modern programming language (Java/Go/Python/Ruby)
  • Super strong Linux skills and supreme troubleshooting skills
  • Experience with a variety of Cloud technologies and familiarity with industry landscape and trends
  • Some configuration management experience; product does not really matter (any of Puppet, Chef, cfengine, Fabric, Ansible, Salt is fine)
  • Willingness to be part of on-call rotations
     

    Nice to have:

  • Experience with large scale OLTP and OLAP deployments
  • Cloud experience: platform does not matter
  • Experience with tools like Elastic/Kibana, Jenkins, Pagerduty, Wavefront
  • Release software tooling (git, Jenkins, custom scripts)
  • Experience with algorithms, data structures, complexity analysis and software design


Back to top