Senior Compute Cluster Deployment Engineer



Multiple locations
Posted on Tuesday, July 2, 2024

NVIDIA is looking for a hardworking Senior Compute Cluster Deployment Engineer to join our Professional Services team.

You'll join a small team working around the globe to build some of the most cutting-edge Datacenters in the world. This role will focus on working to deploy server and compute clusters built with brand new GPU platforms responsible for AI and Machine Learning. You'll be working with some of the world's largest and most sophisticated customers and supercomputers. You'll work alongside our Infiniband and Ethernet network engineers to deploy a complete solution for customers looking to adopt NVIDIA solutions into their business.

Opportunities for global travel and learning about the newest GPU-related technologies are plentiful as we seek to build, shape and expand this new aspect of our business.

What you will be doing:

  • Primary responsibilities will include managing and maintaining AI/HPC infrastructure in Linux-based environments for new and existing customers.

  • Support operational and reliability aspects of large scale AI clusters with focus on performance at scale, real time monitoring, logging and alerting

  • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement.

  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health

  • Provide feedback into internal teams such as opening bugs, documenting workarounds, and suggesting improvements.

  • Be part of an on call rotation to support production systems

What we need to see:

  • 5+ years providing in-depth support and deployment services, solving problems for hardware and software products.

  • Knowledge and experience with Linux System Administration, process management, package management, task scheduling, kernel management, boot procedures/troubleshooting, performance reporting/optimization/logging, network-routing/advanced networking (tuning and monitoring).

  • Cluster management technologies, EX: Bright Cluster Manager

  • Scripting proficiency.

  • Good social skills with the ability to maintain and deliver resolutions for customer blocking issues as they arise.

  • Superb communication and presentation/oral skills.

  • Excellent verbal and written English skills.

  • Strong organizational skills and ability to prioritize/multi-task easily with limited supervision.

  • Candidates should have a minimum of a four-year degree from an accredited university or college in Computer Science, or Electrical or Computer Engineering.

  • Industry-standard Linux certifications.

Ways to stand out of a crowd:

  • InfiniBand experience.

  • Experience with GPU focused hardware/software.

  • Experience with MPI.

  • Automation tooling background (Ansible, Salt, Puppet etc.).

  • Ethernet and Storage technologies.

Widely considered to be one of the technology world’s most desirable employers, NVIDIA offers highly competitive salaries and a comprehensive benefits package. As you plan your future, see what we can offer to you and your family