Amazon

Sr. Software Development Engineer - AWS ML Platforms

1 month agoSanta Clara, CA

DESCRIPTION

About Us:
The mission of AWS-AI DeepEngine is to democratize machine learning by making it easy, fast, and universal across all practitioners. We build performance optimizations, profilers, debuggers, compilers, elastic training and distributed training frameworks to power 85% of all machine learning workloads in the cloud. Our world class platform is the foundation that powers distributed training in AWS. Our customers include scientists and machine learning engineers building and deploying deep learning models in SageMaker and EC2. We upstream our innovations to PyTorch and TensorFlow. We set world records in scalability, reliability and speed for distributed deep learning training workloads. We are a diverse group of scientist and engineers who enjoy tinkering with hard performance problems and geeking about it.

Few Achievements:

  1. https://aws.amazon.com/blogs/machine-learning/aws-and-nvidia-achieve-the-fastest-training-times-for-mask-r-cnn-and-t5-3b/#:~:text=In%202019%2C%20we%20demonstrated%20the,6%3A12%20minutes%20on%20TensorFlow .
  2. https://www.amazon.science/blog/the-science-behind-sagemakers-cost-saving-debugger
  3. https://www.amazon.science/latest-news/the-science-of-amazon-sagemakers-distributed-training-engines
  4. https://aws.amazon.com/blogs/aws/announcing-torchserve-an-open-source-model-server-for-pytorch/
  5. https://aws.amazon.com/blogs/machine-learning/announcing-the-amazon-s3-plugin-for-pytorch/

About the job:
You will be at a rare intersection of AWS-scale of machine learning infrastructure and cutting edge advances in deep learning. This is a day one, cross disciplinary opportunity. You will be a founding member of Deep Engine Toolkit and will have a chance of shaping the team culture and development mechanisms. In any typical day, you will dive deep into recent advances in machine learning systems, prototype for new profiling techniques, interact with customers to optimize their workload. You will launch new libraries that powers all machine learning workloads in AWS.

About You:
You are passionate about optimizing large scale deep learning models (100+ billion GPT, 1000s of GPU devices). You have a proven track record of bringing innovative research to customers. You are able to thrive and succeed in an entrepreneurial environment, and not be hindered by ambiguity or competing priorities. Ownership, delivering results, thinking big and analytical leadership are essential to success in this role.

You have solid experience in multi-threaded asynchronous C++ development. You have prior experience in one of; internals of PyTorch and/or TensorFlow, high performance computing, profiling techniques and POSIX APIs (process control, communication, and device management, CUDA, CUPTI, Perf Capture and Perfetto.

Amazon is committed to a diverse and inclusive workplace. Amazon is an equal opportunity employer and does not discriminate on the basis of race, national origin, gender, gender identity, sexual orientation, protected veteran status, disability, age, or other legally protected status. For individuals with disabilities who would like to request an accommodation, please visit https://www.amazon.jobs/en/disability/us.

Inclusive Team Culture
Here at AWS, we embrace our differences. We are committed to furthering our culture of inclusion. We have ten employee-led affinity groups, reaching 40,000 employees in over 190 chapters globally. We have innovative benefit offerings, and we host annual and ongoing learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences. Amazon's culture of inclusion is reinforced within our 14 Leadership Principles, which remind team members to seek diverse perspectives, learn and be curious, and earn trust.

Work/Life Balance
Our team puts a high value on work-life balance. It isn't about how many hours you spend at home or at work; it's about the flow you establish that brings energy to both parts of your life. We believe striking the right balance between your personal and professional life is critical to life-long happiness and fulfillment. We offer flexibility in working hours and encourage you to find your own balance between your work and personal lives.

Mentorship & Career Growth
Our team is dedicated to supporting new members. We have a broad mix of experience levels and tenures, and we're building an environment that celebrates knowledge sharing and mentorship. Our senior members enjoy one-on-one mentoring and thorough, but kind, code reviews. We care about your career growth and strive to assign projects based on what will help each team member develop into a better-rounded engineer and enable them to take on more complex tasks in the future.

BASIC QUALIFICATIONS

  • 4+ years of professional software development experience
  • 3+ years of programming experience with at least one modern language such as Java, C++, or C# including object-oriented design
  • 2+ years of experience contributing to the architecture and design (architecture, design patterns, reliability and scaling) of new and current systems
  • M.Sc. Degree in computer science, engineering, statistics, mathematics or related field;
  • 5+ years of experience in multi-threaded asynchronous development with C/C++.
  • 5+ Experience in developing highly scalable, fault-tolerant, distributed systems at the kernel level.
  • 2+ years of experience contributing to the architecture and design to improve reliability and maintainability and scalability of the system under development.
  • Familiarity with one of: TensorFlow and/or PyTorch, high performance computing systems, MPI, NCCL, CUDA, BLAS libraries.
  • Familiarity with machine learning techniques in Natural Language Understanding or Computer Vision tasks.


PREFERRED QUALIFICATIONS

  • P.hD in computer science, statistics, engineering, mathematics, or related field.
  • 7+ years of experience in multi-threaded asynchronous development with C/C++
  • 7+ years of experience in developing highly scalable, fault-tolerant, distributed systems at the kernel level.
  • 4+ years of experience contributing to the architecture and design to improve reliability and maintainability and scalability of the system under development.
  • 2+ years of experience with TensorFlow, PyTorch, high performance computing systems, MPI, NCCL, CUDA, BLAS libraries and POSIX system APIs.
  • 2+ years of experience optimizing Natural Language Understanding or Computer Vision tasks.

Client-provided location(s): Santa Clara, CA, USA
Job ID: Amazon-1403602

Company Videos

Hear directly from employees about what it's like to work at Amazon.