Senior Site Reliability Engineer - GeForce NOW

NVIDIA

NVIDIA

Software Engineering
Multiple locations
Posted on Wednesday, February 7, 2024

We are now looking for a Sr. Site Reliability Engineer (SRE). NVIDIA is looking for a Senior Site Reliability Engineer (SRE) to join its GeForce Now (GFN) team. SRE at NVIDIA ensures that our internal and external facing GPU cloud gaming services have reliability and uptime as promised to the users and at the same time enabling developers to make changes to the existing system through careful preparation and planning while keeping an eye on capacity, latency and performance. As SREs are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle complex problems.

The person in this position will be responsible for Service Response and Workflows and will drive tools/service development to maintain and improve service SLOs. We partner with Service Owners to drive reliability of the service. The GFN Service is an exciting service in the newly growing game streaming industry.

What you will be doing:

  • Working on building tools to improve the SRE Observability.

  • Be part of Kubernetes migration journey with VMI setup and problem solving.

  • Rapidly debug and triage incidents and user-reported issues

  • Taking ownership of automating, scripting, and tooling of new/existing scripts to help the team achieve 100% automation of daily tasks

  • Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity management and launch reviews.

  • Be part of an on call rotation to support production systems

What we need to see:

  • MS or BS in Computer Science/Engineering or a related field or equivalent experience.

  • 8+ year’s Site reliability engineering experience working on large scale distributed micro services in a production environment with a real passion for automation and tooling.

  • Very strong Kubernetes background and ability to understand Kubernetes with complex and highly available VMI setup on K8's.

  • Lead significant production improvements including change management, post-mortem reviews, workflow processes, design and deliver software automation in various languages.

  • Confirmed strengths in problem-solving and root causing issues, while continuously seeking ways to drive optimization, efficiency and the bottom line.

Ways to stand out from the crowd:

  • Previous experience with Datadog, Prometheus, alert manager or similar monitoring systems.

  • Jenkins (or similar CI/CD) setup, configuration, deployment is a requirement

  • Excellent communication, presentation, social, and analytical skills; the ability to communicate complex interaction concepts clearly and persuasively across different audiences and varying levels of the organization.

  • Experience with Stack Storm, Prometheus, and Kubernetes and similar are bonuses.

  • Prior experience as an SRE or Service Engineering is a huge plus.

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you

The base salary range is 164,000 USD - 253,000 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.