SRE Manager, NIM Factory

NVIDIA

NVIDIA

Santa Clara, CA, USA
Posted on Saturday, August 17, 2024

NVIDIA is the platform upon which every new AI-powered application is built. We are seeking a SRE Manager to build and manage SREs which monitor and operate both the factory automation for NVIDIA Inference Microservices (NIMs) and its deployed services. The right person for this role brings leadership that encourages the teams technical drive and creativity to change the way NVIDIA provides high-performance inferencing for every AI model. Our NIM offerings are easy to use, optimized for performance, and developed using a highly automated software factory. We create containers available for download and hosted services. You will apply your expertise to lead the operation of highly available services that make effective use of the thousands of GPU involved in this operation. Your team's services provide the best-in-class performance, accuracy and availability.

What you'll be doing:

  • This is a ground floor opportunity to form a team and define the SRE role in the NIM program. Your team will operate a software factory that will take an AI model in and produce a deployable service that is validated across Cloud, On-prem and Kubernetes environments. With the development team, define and deliver rapid iterations on the group's technical strategies and roadmaps to evolve the NIM factory for continuous delivery of packaged NIMs. Your team is responsible for the operation of the factory, its availability, observability, and stability; and will track the deployment of our services into multiple cloud hosts and improve the efficiency, availability, and stability of these services.

  • You will partner with internal and external SRE team leadership to provide the best experience for our developers and our users of the resulting services. Your team ensures our operation is secure with the proper configuration and management of infrastructure including containers, databases, and networking; following and improving standard processes for security, scalability, and cost optimization. This requires working closely with our security teams tasked with responding to security threats.

  • Broad collaboration with multiple AI model teams is needed to understand their requirements and build an efficient infrastructure that supports and improves development and production execution of these models. You will define metrics and drive improvements based on user feedback. You will mentor and collaborate throughout the team and with other teams to grow your colleagues and yourself. You will have a history of learning and growing your skills and those around you.

What we need to see:

  • Supportive mentoring and empathetic leadership recruiting and growing successful teams and team members. Flexibility and a clear ability to adjust your direction and expectations given the needs of our customers.

  • Effective experience working with multi-functional teams, principals and architects, and across organizational boundaries.

  • Demonstrated advanced system engineering skills operating and improving the observability, security and maintainability of distributed microservice cloud applications and services. Experience operating distributed containerize applications using technologies such as Docker, K8s, Cloud Endpoints, Helm, and Prometheus. Use of Infrastructure as code, such at Terraform, Puppet, Ansible or others.

  • Experience identifying the root cause of failures and performance bottlenecks in distributed microservices or cloud systems. Understand and practice good security practices for publicly facing cloud services.

  • BS or MS in Computer Science, Computer Engineering or equivalent experience.

  • 7+ overall years of experience as an SRE or Developer working on high-performance microservices and cloud software; 3+ years leading or managing engineering teams.

Ways to stand out from the crowd:

  • Excellent communication and interpersonal skills and the ability to engage a multi-functional team.

  • Experience with cloud deployed infrastructure, effective security practices in a high-risk environments, and hyper scaling applications for demand.

  • A history of building and deploying containers for Microservices, Cloud and On-prem deployments, and their associated CI/CD pipelines

We are widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and creative people in the world working for us. If you're creative and autonomous with a real passion for technology, we want to hear from you. We are an equal opportunity employer and value diversity at our company.

The base salary range is 220,000 USD - 419,750 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.