Senior Site Reliability Engineer - Storage
NVIDIA has been redefining computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s an outstanding legacy of innovation that’s fueled by phenomenal technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world. We are looking for a Staff Systems Storage – Site Reliability Engineer for our Global Storage Team working from Israel, Yokneam. This is a highly niche subject area that demands knowledge across different systems, networking, coding, database, capacity management, continuous delivery, and deployment, and open-source, cloud-enabling technologies.
SRE at NVIDIA ensures that our internal and external facing services have reliability and uptime as promised to the users and at the same time enabling developers to make changes to the existing system through careful preparation and planning while keeping an eye on capacity, latency, and performance. While taking a big picture of how our systems relate to each other using a breadth of tools and approaches to address a broad spectrum of problems. As part of this team, working with teams across company to deliver and practices such as limiting time spent on reactive operational work, to improve the reliability and performance of systems and services. Challenges include balancing project responsibilities with advanced operational issues. If you have a primary skill in the NAS, SAN and/or Object storage systems, with the ability to define and lead primarily LABs & storage projects globally this job might be for you.
What you will be doing:
Lead initiatives in a global team of storage engineers to Design, Build, Deploy and manage storage systems consisting of a variety of enterprise appliances, networks, and open-source technologies.
Engage with Engineering teams worldwide to build and implement policy for Capacity Management and Life cycle management of storage resources.
Participate in global infrastructure operations with a follow-the-sun model, create runbooks to run best in class infrastructure.
Build automation to improve observability, availability, scalability, latency & efficiency.
Working closely engineering teams to understand their requirements and influence the methodologies to build/test/deploy applications.
What we need to see:
8+ years with design, deployment and management of Enterprise NAS like NetApp, Pure Storage, S3 storage.
Engineering degree in Computer Engineering or Computer Science or equivalent experience.
Collaborative mindset and Solid attention to detail and excellent written and verbal communication skills, Ability to operate with a SRE mindset with focus towards continuous improvement.
Experience working in large-scale distributed environment that is distribute globally.
Self-motivated with a sharp learning curve, strong verbal and written communication skills, teammate - able to work independently and as a member of a team, high service orientation.
Ways to stand out of the crowd:
Demonstrated experience on Lustre storage.
Hands-on experience with: Containers & container orchestration: Docker, Kubernetes.
Experience with CI/CD systems and building integrations into infrastructure automation systems with API’s.
With competitive salaries and a generous benefits package, NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most brilliant and talented people in the world working for us. If you're creative and motivated, we want to hear from you!
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.