NVIDIA
Senior Technical Program Manager – AI Research Systems
Job Description
Joining NVIDIA’s AI Efficiency Team means contributing to the infrastructure that powers our innovative AI research. This team focuses on optimizing efficiency and resiliency of ML workloads, as well as developing scalable AI infrastructure tools and services. Our objective is to deliver a resilient and scalable environment for NVIDIA’s AI researchers, providing them with the necessary resources and scale to foster innovation.
As a Technical Program Manager (TPM) on this team, you’ll confront and oversee the unique challenges of building and maintaining the AI and data infrastructure necessary for training flagship models at an unprecedented scale. Your focus will be on increasing researcher productivity, as well improving system stability, availability and performance.
What you’ll be doing:
-
Understand the challenges of training foundational models: Delve into the specific workflows and resource requirements of training innovative LLMs and Generative AI models. Identify the problems, scaling limitations, and failure points that researchers encounter.
-
Engineer Scalable and Resilient Solutions: Design, implement, and continuously refine highly scalable AI/ML infrastructure. Prioritize fault tolerance, automated recovery mechanisms, and proactive monitoring to minimize disruptions to critical research projects.
-
Increase Researcher Velocity: Develop streamlined processes for resource allocation, model deployment, and experimentation tracking. Collaborate closely with researchers to ensure the infrastructure seamlessly supports their rapidly evolving needs.
-
Lead Complex Technical Projects: Own the planning, execution, and delivery of complex infrastructure projects in a dynamic, fast-paced research environment. Balance agility with meticulous attention to detail, risk assessment, and long-term maintainability.
-
Collaborate for Success: Partner with diverse teams across engineering, research, and operations to drive solutions that address the complexities of large-scale AI development.
-
Resource Matching and Optimization: Collaborate with researchers to understand their computational needs (compute, memory, networking bandwidth, storage performance) and ensure optimal resource utilization.
-
Data Access and Pipelines: Design high-throughput data pipelines and storage solutions that seamlessly integrate with researcher workflows, enabling efficient access to massive datasets.
What We Need To See:
-
BS or MS Degree or equivalent experience
-
8+ years of Program management experience within same or similar industries
-
Technical Expertise: Deep understanding of cloud infrastructure, distributed systems, large-scale ML/HPC workloads, Kubernetes, Slurm, and AWS services. Experience in handling petabyte-scale data and extreme-scale systems with 10s of thousands of compute nodes is a plus.
-
Experience managing projects with one of the following workloads: ML (e.g., training and deploying large machine learning models) or HPC (e.g., deploying Hardware and Software needed to run large batch compute jobs).
-
Experience within Cloud Infrastructure, particularly with compute, networking, and storage.
-
Software Development Background: Familiarity with AI/ML frameworks
-
Project Management Mastery: Demonstrated Ability to manage numerous projects, prioritize in a high-pressure setting, identify and mitigate risks, and ensure on-time delivery.
-
Researcher-Centric Focus: Demonstrated ability to empathize with researchers, understand their problems, and translate their needs into technical requirements and map them to the abilities of AI infrastructure engineers and teams. Excellent communication and collaboration skills are essential.
-
Understanding of data pipeline requirements for large-scale ML training. This includes awareness of data throughput needs, data preprocessing steps, and potential bottlenecks. Familiarity with tools like Apache Spark or Ray is a plus.
-
Background in evaluating and selecting data storage solutions (HDFS, object storage, distributed file systems, such as Lustre) based on cost, performance, and research requirements.
The base salary range is 124,000 USD – 247,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.
You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.