Systems Administrator, HPC
(Menlo Park, CA – New York, NY)
Facebook’s mission is to give people the power to share, and make the world more open and connected. Through our growing family of apps and services, we’re building a different kind of company that helps billions of people around the world connect and share what matters most to them. Whether we’re creating new products or helping a small business expand its reach, people at Facebook are builders at heart. Our global teams are constantly iterating, solving problems, and working together to make the world more open and accessible. Connecting the world takes every one of us—and we’re just getting started.
Facebook has developed a large-scale compute cluster, providing our industry-leading Artificial Intelligence Research group with access to massive CPU and GPU resources. We are seeking a talented and enthusiastic systems administrator to continue to improve the performance, reliability, and flexibility of our cluster. The ideal candidate will have sharp and tenacious troubleshooting skills, a burning desire to “Move Fast and Be Bold” along with strong communication skills and attention to detail. This position is full-time and located in either our Menlo Park or New York office.
- Work closely with world-class Research Scientists and Engineers to accelerate their research projects.
- Understand and accommodate their diverse set of requirements.
- Select, implement, and improve technologies for workload management, shared storage, and low-latency networking
- Provide simple and robust mechanisms for scientists and engineers to use the software packages they need on the compute cluster.
- Help engineers and scientists diagnose performance and reliability issues with distributed training jobs
- BS or MS in Computer Science, Engineering, or a related technical discipline or equivalent experience
- 2+ years experience administering a scaled HPC cluster with automated processes
- Experienced with UNIX
- Understanding of both InfiniBand and Ethernet networking and switching topologies
- Ability to configure and troubleshoot high-performance shared storage systems such as Lustre
- Experience with Slurm or a similar workload management system
- Ability to develop maintainable scripts to automate common system management task
Meet Some of Facebook's Employees
Sr. Manager, WhatsApp Customer Support & Localization
Cristina manages the WhatsApp customer experience, translating the application into multiple languages and troubleshooting communication services worldwide.
Back to top