Senior System Software Architect, HPC Networking

NVIDIA

NVIDIA

Software Engineering, IT
Beijing, China
Posted on Sep 12, 2024

Our technology has no boundaries! NVIDIA is building the world’s most groundbreaking and state of the art accelerated compute platforms for the world to use. It’s because of our work that scientists, researchers and engineers can advance their ideas. We pioneered a supercharged form of computing loved by the fastest paced computer users in the world - scientists, designers, artists, and gamers.

We are seeking a highly motivated architect to join our team of experts and take part in shaping the future of high-performance and ML / AI computing. Our next-generation Ethernet, InfiniBand and NVLink systems will be at the heart of connecting and powering the world's most advanced compute clusters, from supercomputers used in AI research to high-performance clusters used at almost every industry today. As a system/software architect at NVIDIA, you will have the opportunity to work on some of the most innovative technology that is currently driving the world forward.

What you will be doing:

  • Creating proofs-of-concept to evaluate and motivate extensions in AI Frameworks (PyTorch/NEMO), new runtime designs, and new network hardware features.

  • Research, design and implement features for AI and HPC communication middleware (NCCL, UCX, UCC), and Deep Learning frameworks such as TensorFlow/Pytorch.

  • Research, design and develop hardware features relevant to scientific, Deep learning, and data-intensive workloads.

  • Collaborate with customers to understand their needs and provide innovative solutions for them.

What we need to see:

  • Ph.D, Masters, or Bachelors in computer science, computer engineering, electrical engineering or a closely related field.

  • 5+ years of experience in DNNs, Scaling of DNNs, Parallelism of DNN frameworks, or deep learning training workloads.

  • Deep understanding of parallelism techniques including Data Parallelism, Pipeline Parallelism, Tensor Parallelism, and FSDP.

  • Experience with AI network parallelism using collective libraries and RDMA/RoCE.

  • Background in algorithm design, system programming, and computer architecture.

  • Strong programming and software development skills.

  • Ability and flexibility to work and communicate effectively in a multi-national, multi-time-zone corporate environment.

Ways to stand out from the crowd:

  • Deep understanding of technology and passion for what you do.

  • Strong collaborative and interpersonal skills, specifically a proven ability to effectively guide and influence within a dynamic matrix environment.

  • Background with designing communication middleware for high-performance computing systems, including RoCE and DPUs.

  • Background with CUDA programming and NVIDIA GPUs and programming models for emerging architectures.

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you!

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.