Senior Site Reliability Engineer

NVIDIA

NVIDIA

Software Engineering
Multiple locations
Posted on Oct 8, 2024

NVIDIA’s Infrastructure, Planning and Processes (IPP) organization is seeking a hard-working and experienced Site Reliability/DevOps Engineer, with strong background in Infrastructure Management, Monitoring, Automation, & System Administration, to join our Sanity Operations Team in Pune. The IPP Org provides Infrastructure, Products & Services for multiple software teams including GPU, Mobile, and Automotive divisions working on Nvidia's extraordinary products & services.

The team is responsible for hosting, enabling & running the large scale private cloud systems & services, for our in-house Testing CI framework. The cloud hosts a heterogeneous mix of machines and devices with various operating systems (Windows/Linux/Android, etc.), running with NVIDIA GPUs and Tegra Processors.

What you’ll be doing:

  • Create resilient, scalable, and efficient test and deployment pipelines.

  • Design and implement complex automation platforms to identify & resolve operational inefficiencies.

  • Triaging software, hardware and infrastructure issues and maintaining high availability for our infrastructure & services.

  • Deploying & Monitoring critical high performance, large scale services running on Geo-distributed systems.

  • Continuously Strive for efficient utilization & management of the infrastructure.

  • Automate processes for enabling developers to adopt self-service practices, while ensuring compliance with security standards.

  • Work with architects and engineers across the teams to review the designs & solutions during development and deployment phases.

  • Collaborate with our other engineering teams to deliver reliable, robust, and high-performance capability of the underlying infra.

  • Mine & analyze data from multiple sources for identifying scaling & optimization opportunities.

What we need to see:

  • Bachelor’s or Master’s degree in computer science, Software Engineering, or equivalent experience with 8+ years of experience in a DevOps environment.

  • Strong hands-on experience in Configuring, maintaining, and building upon deployments of industry-standard tools (e.g. Kubernetes, Jenkins, Docker, CMake, Gitlab, Jira, etc)

  • Working Experience in monitoring & maintaining large-scale infrastructure applications running in a microservice-based architecture.

  • Proficient with Virtualization architecture with strong experience in Kubernetes, VMs, Dockers.

  • Experience with continuous integration and continuous delivery systems such as GitLab, GitOps, Jenkins, Packer, and Terraform.

  • Strong Python scripting skills, with proven background of using/writing JSON/REST APIs.

  • Fluency in using MySQL or equivalent NoSQL databases queries

  • Solid understanding of configuration management tools like, Chef, Puppet, Ansible, etc.

  • Working Experience with Perforce, GIT or any other version control system is necessary.

  • Experience with telemetry and alerting systems such as Kibana, Elastic Search, Grafana, and Prometheus to create rich visualizations of system health over time.

  • Ability to self-manage, show leadership, mentor others and communicate well.

Ways to stand out from the crowd:

  • Understanding of networking concepts like TCP/IP and firewall management.

  • Exposure to web apps/dashboards on frameworks like Django, AngularJS, VueJS, etc.

  • High level understanding of Build and Test systems.

  • Experience in Building regression detection systems by analyzing real-time production data, emphasizing important metrics.

  • Innovating with industry-standard tools and collaborating with the open source community