Senior System Engineer - Site Reliability Accelerated Apache Spark

NVIDIA

NVIDIA

Multiple locations
Posted on Thursday, November 16, 2023

We are seeking experienced Lead Site Reliability Engineers adept at Apache Spark to join our team. Apache Spark is the most popular distributed data processing engine in data centers. It is used for a wide variety of workloads, from data preparation, feature generation, reporting, analytics, and more. Data scientists spend a considerable amount of time exploring data and iterating over machine learning (ML) experiments. Every hour of compute required to sort through datasets, extract features and fit ML algorithms impedes an efficient business workflow. Apache Spark is the most popular data analytics engine in data centers. NVIDIA has been working with the open source community to accelerate Apache Spark workloads for ETL and ML/DL. Our Spark accelerators are available on major cloud Spark services (including Databricks, Google Dataproc, AWS EMR) and on-prem Spark distributions (including Cloudera and HPE). Our enterprise customers have realized significant speedup and cost-savings in production workloads.

At NVIDIA, we are passionate about working on hard problems that have an impact. You will work with a team that is developing the Spark RAPIDS open source library to accelerate Spark applications. You will work with cloud service providers to enable Apache Spark users to easily benefit from GPU accelerations.

What you'll be doing:

  • Define the deployment architecture of GPU accelerated Apache Spark services.

  • Develop and implement automation and monitoring strategies to ensure that services are delivered in an efficient and secure manner.

  • Collaborate with cloud service providers for service deployments.

  • Work with customers to resolve production issues.

  • Identifying opportunities for service optimization and cost reduction.

What we need to see:

  • 8+ years of Systems/Applications automation and incident response in 24x7 Production Services environments.

  • BS in Computer Science, Computer Engineering or equivalent experience.

  • Fluency with one or more current generation scripting languages used by DevOps professionals (Powershell, Python, Perl, Go).

  • Excellent troubleshooter, utilizing a systematic problem-solving approach and demonstrated experience in designing, analyzing, and diagnosing large-scale distributed systems.

  • Experience with running and troubleshooting big data clusters: Apache Spark, Apache Hive, Apache Flink, Apache Hadoop, and Trino/Presto .

  • Background with infrastructure as code and configuration as code, utilizing tools like Terraform, CloudFormation, Chef, SaltStack, Puppet, DSC.

  • Experience with elastic scaling, fault tolerance and other cloud architecture patterns and experience with Kubernetes and proficiency in package managers like Helm and Kustomize.

  • Proven strength in SaaS services, experience in massive scale operations.

  • Experience operating on AWS, GCP, Azure or other public Cloud (both PaaS and IaaS offerings).

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people on the planet working for us. If you're creative, passionate and self-motivated, we want to hear from you! NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services.

The base salary range is 176,000 USD - 333,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.