Principal Engineer, AIOps

NVIDIA

NVIDIA

Other Engineering
Santa Clara, CA, USA
Posted on Nov 23, 2024

We are looking for an AIOps Principal Engineer who can design, develop, and deploy AI-powered solutions for IT operations. You will work with a team of engineers, data scientists, and domain experts to create and implement innovative applications that leverage NVIDIA's Observability, Infrastructure and Gen AI platforms. You will also collaborate with internal and external customers to understand their needs, define requirements, and deliver high-quality products.

What you'll be doing:

  • Lead the design, development, testing, and deployment of AIOps platform.

  • Apply machine learning, deep learning, natural language processing, and other AI techniques to solve IT operations challenges such as anomaly detection, root cause analysis, incident management, and automation.

  • Improve IT Infrastructure and Operations Management by defining and measuring AIOps metrics such as accuracy, reliability, scalability, performance, and efficiency.

  • Experience in implementing observability principles and practices such as monitoring, logging, tracing, and alerting.

  • Deep Knowledge in data science engineering such as data collection, data cleaning, data analysis, data modeling, and data visualization.

  • Expertise in integrating AIOps tools with IT operations management (ITOM) and IT service management (ITSM) systems, service desk, change management, configuration management, etc.

  • Demonstrate solid leadership skills and ability to lead and empower engineers and data scientists.

  • Design and communicate the AIOps roadmap, vision, and strategy to the team and the partners.

  • Collaborate effectively with customers, such as IT managers, business users, vendors, and partners, to ensure alignment and satisfaction.

  • Playing a pivotal role in harnessing AI, generative AI, and machine learning for Nvidia IT teams.

What we need to see:

  • Bachelor's degree or higher in computer science, engineering, or related field (or equivalent experience).

  • 15+ years of industry experience in extensive engineering projects, with a particular emphasis on infrastructure automation, distributed systems, and tool development for managing large-scale private or public cloud systems.

  • 5+ years of experience and understanding working with AIOps technologies and platforms.

  • Proficient in Python, TensorFlow, PyTorch, or other AI frameworks and libraries.

  • Proficiency in Python and Go programming; your coding and debugging expertise are pivotal to your success in this role.

  • Demonstrated commitment to sound software engineering principles and a strong willingness to acquire new skills.

  • Experience in working with IT systems, tools, and processes such as ITSM, ITOM, monitoring, logging, and alerting.

  • Ability to work independently and collaboratively in a fast-paced and dynamic environment.

  • Hands-On experience in designing and implementing end-to-end architecture and large-scale rollout of AIOps product.

  • Developed Gen AI applications using LLMs, RAG for incident diagnosis, identifying root causes and incident resolution.

Ways to stand out from the crowd:

  • Proficiency in developing and deploying generative AI solutions such as language model, chatbot, and conversational assistant.

  • Hands-On experience in Integrating workflow automation tools with AIOps for incident resolution and self-healing

  • Deep background and understanding of Machine Learning: developing, training, and applying machine learning models across large operational datasets.

  • Experience with pre-training & fine-tuning LLM models and working on ML frameworks such as SKLearn, XGBoost, PyTorch, Tensorflow.

  • Have hands-on experience with various AIOps platforms such as BigPanda, DataDog, Moogsoft, ITOM Health, Splunk, Elastic Stack, Dynatrace, New Relic, etc.

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're a creative individual who thrives on achieving goals and enjoys a dynamic learning environment, then why not seize this opportunity? Apply today!

The base salary range is 248,000 USD - 385,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.