The AWS ParallelCluster team is looking for a System Development Engineer to join our broader HPC Organization.
Our team owns a core set of technologies that allow our customers to plan, schedule, and execute HPC workloads across AWS compute services and capabilities.
As a team member, you'll have the opportunity to operate and engineer systems on a global scale, while touching and influencing large parts of the underlying AWS services. You'll be involved in the design and development of products that provide fully featured HPC infrastructures on demand. An HPC infrastructure is complex in nature including the provisioning of multiple resources as computing, networking, storage and the deployment and configuration of different operating systems and software tools that enable our customers to fulfill their HPC workloads. We don't expect you to be an expert in, or necessarily even be familiar with, all of these technologies, but we do expect you to be excited to learn about them and use them to delight our customers.
You'll focus on operational excellence by implementing and integrating DevOps key methodologies, such as infrastructure as code and configuration management, and tools into the product architecture as capabilities for building HPC infrastructures. You'll improve and apply such practices and tools to help to innovate faster through automating and streamlining the software development and infrastructure management processes. You will become intimate with the architecture of our systems and will drive prioritization of operational issues.
This position involves on-call responsibilities, typically one week every two months. We work to ensure that our systems are fault tolerant and alarms won't wake us up in the middle of the night. When this happens, we take action to ensure we will not get paged again for the same issue.
The team is dedicated to supporting new team members, you will grow in the team through one-on-one mentoring and thorough, but kind, code reviews.
Over the years, we have developed a strong sense of team trust and we are looking for a new teammate who is enthusiastic, empathetic, motivated, and reliable.
• Bachelors degree in Computer Science / Engineering (or related STEM Subjects), or equivalent experience
• Several years of professional experience with Linux/Unix system engineering
• Several years of professional experience in software development with Python
• Experience with DevOps practices and tools for continuous delivery, infrastructure as code, software deployment automation / configuration management
• Technically sound in software development activities and life cycles, including coding standard, source control management, testing and operations
• English as working language is a requirement
• Masters Degree in Computer Science / Engineering, or equivalent experience
• Experience with configuration management systems such as Chef, Ansible, or Puppet
• Experience specifying, designing, and implementing system health, performance monitoring tools, and software management tools
• Experience developing automation to solve problems at scale
• Experience with HPC batch schedulers (LSF, PBS, GridEngine, Slurm, etc.) or other HPC, cluster management technologies
Amazon is an equal opportunities employer. We believe passionately that employing a diverse workforce is central to our success. We make recruiting decisions based on your experience and skills. We value your passion to discover, invent, simplify and build. Protecting your privacy and the security of your data is a longstanding top priority for Amazon. Please consult our Privacy Notice (https://www.amazon.jobs/en/privacy_page) to know more about how we collect, use and transfer the personal data of our candidates.
All offers are conditional on references, verification of the right to work in the UK, and successful background screening check. This will include previous employment verification, qualification verification (if relevant) and a relevant criminal check.