Home
/
Comprehensive
/
Principal Software Architect, AI and HPC
Principal Software Architect, AI and HPC-October 2024
Santa Clara
Oct 29, 2025
ABOUT NVIDIA
NVIDIA is a computing platform company, innovating at the intersection of graphics, HPC, and AI.
10,000+ employees
Technology
VIEW COMPANY PROFILE >>
About Principal Software Architect, AI and HPC

  We are now looking for a Principal Software Architect for AI and HPC.

  At NVIDIA, we are advancing the frontiers of AI capabilities. We seek an expert in high-performance computing and AI to design and develop software resiliency features for training AI models on the world’s most powerful and largest supercomputers.

  In this role, you will outline mission requirements for ultra large-scale AI supercomputers, thoroughly investigate and evaluate RAS feature designs, establish software requirements and evaluation metrics, and oversee the complete implementation of RAS features in software. As a leader in HPC and AI software development, you will interact with multiple teams across the organization. Your responsibilities include conducting regular reviews and check-ins with execution teams, ensuring the timely delivery of essential RAS software features such as checkpoint-recovery logic, error detection and attribution, error containment, SDC detection, and other related RAS elements. Leading cross-organizational efforts among various stakeholders and teams, you will coordinate priorities with senior leadership, provide timely updates, and ensure adequate resourcing for the projects.

  What You'll Be Doing:

  Collaborate with both internal and external customers and partners to define innovative Reliability, Availability, and Serviceability (RAS) requirements and objectives for present and future AI supercomputing products.

  Oversee and guide the development of RAS features across the entire AI stack, encompassing aspects from job-level scheduling and AI application frameworks (such as PyTorch), down to driver-level and hardware health monitoring on GPUs.

  Develop and maintain comprehensive software roadmaps, ensuring alignment with diverse engineering teams and synchronizing with engineering and product leadership for strategic coherence.

  Drive successful implementation and execution of RAS features in software, with demonstrable improvements in end-to-end metrics such as availability during large-scale training runs.

  What We Need to See:

  A Master's or Ph.D. in Computer Science, Electrical or Computer Engineering from a reputed university, or equivalent professional experience.

  15+ years of industry experience in systems architecture or related fields, demonstrating a deep understanding of system complexities.

  Proven ability to work and communicate effectively in a collaborative environment, bridging multiple engineering disciplines.

  At least 5 years of hands-on experience in software development, preferably in high-complexity projects involving HPC or AI.

  Ways to Stand Out From the Crowd:

  Demonstrated experience with large-scale AI supercomputing applications, particularly in training and inference stages.

  In-depth knowledge of the requirements for large-scale AI workload training and inference.

  A strong passion for and experience in developing system architectures tailored for AI applications, encompassing CPU, GPU, memory, storage, and networking.

  Hands-on involvement in the entire lifecycle – from design to deployment – of large-scale High-Performance Computing (HPC) systems.

  Practical experience in adopting and implementing HPC software development practices in large-scale system environments.

  As NVIDIA makes inroads into the Datacenter business, our team plays a central role in getting the most out of our exponentially growing datacenter deployments as well as establishing a data-driven approach to hardware design and system software development. We collaborate with a broad cross section of teams at Nvidia ranging from DL research teams to CUDA Kernel and DL Framework development teams, to Silicon Architecture Teams. NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people on the planet working for us. If you're creative and autonomous, we want to hear from you!

  The base salary range is 268,000 USD - 414,000 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

  You will also be eligible for equity and benefits (https://www.nvidia.com/en-us/benefits/) . NVIDIA accepts applications on an ongoing basis.

  NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

  NVIDIA is a Learning Machine

  NVIDIA pioneered accelerated computing to tackle challenges no one else can solve. Our work in AI and the metaverse is transforming the world's largest industries and profoundly impacting society.

  Learn more about NVIDIA .

Comments
Welcome to zdrecruit comments! Please keep conversations courteous and on-topic. To fosterproductive and respectful conversations, you may see comments from our Community Managers.
Sign up to post
Sort by
Show More Comments
SIMILAR JOBS
Claims Contract Attorney - R0027089
It's fun to work at a company where people truly believe in what theyare doing!   [Job Description:]{.underline} Complex Claims Attorneys conduct claims processing operations for the Complex Claims t
Automotive General Service Technician
Monro, Inc. Monro, Inc. is one of the nation’s largest auto service companies and major tire retailer. We own and operate more than 1,200 stores in 32 states and our stock trades on the Nasdaq (MNRO)
Substation Senior Engineer
Substation Senior Engineer Job Locations US Category Substation 2nd Category Electrical Engineering Position Type Regular Division Power Overview Substation Senior Engineer - Pensacola, FL Westwood P
Travel MS/Tele RN job in Spokane, WA - Make $2268 to $2446/week (Job #2356683)
Aya Healthcare has an immediate opening for the following position: MS/Tele Registered Nurse in Spokane, WA.We'll work with you to build the healthcare career of your dreams. Whether you want a job c
Senior Program Manager, Promotions , Amazon Fresh Grocery
Interested in building the foundation of an innovative and high-impact business changing how customers think about grocery? The Amazon Fresh Grocery team is looking for a Sr. Program Manager to lead
Speech Therapist LEA
Reference #: 5000998881606Speech Therapist LEAThroughout, Philadelphia, Pennsylvania 19104 Bonus Eligible$6,000 sign-on bonus for new hires starting before 12/31/23Pay Range$72-87kSchedule8am to 4pm
2024 Corporate Summer Internship Program - Engineering Intern
2024 Corporate Summer Internship Program - Engineering Intern Location : CLEVELAND, OH, United States Job Family : Engineering Job Type : Regular Posted : Feb 28, 2024 Job ID : 44215 Back to Search R
Senior Data Integrations Analyst
Overview JOB TITLE: Senior Data Integrations Analyst JOB LOCATION: Reston, Virginia 20191 REQUIRED TRAVEL: None HOURS: Full-time REPORTS TO: Lead Data Integrations Analyst JOB DUTIES: Responsible for
Door-to-Door Sales Representatives
Vivint, Inc. (Washington, DC) 80 F/T Temp. Door-to-Door Sales Representatives: 04/01/2024 - 10/1/2024. 50 hours/week, 1:00pm-9:00pm (Monday-Friday); 11:00am to 9:00pm (Saturday). Hours/work schedule
Postdoctoral Researcher –Feedstock Production Life Cycle Analysis
The Materials Life Cycle Analysis Group of the Systems Assessment Center within the Energy Systems and Infrastructure Analysis Division at Argonne National Laboratory is seeking qualified applicants
Copyright 2023-2025 - www.zdrecruit.com All Rights Reserved