My job alerts

Lead /Principal - Site Reliability Engineering

Salesforce

Software Engineering

Hyderabad, Telangana, India

Posted on Nov 6, 2024

Apply now

To get the best candidate experience, please consider applying for a maximum of 3 roles within 12 months to ensure you are not duplicating efforts.

Job Category

Software Engineering

Job Details

About Salesforce

We’re Salesforce, the Customer Company, inspiring the future of business with AI+ Data +CRM. Leading with our core values, we help companies across every industry blaze new trails and connect with customers in a whole new way. And, we empower you to be a Trailblazer, too — driving your performance and career growth, charting new paths, and improving the state of the world. If you believe in business as the greatest platform for change and in companies doing well and doing good – you’ve come to the right place.

Role Description:
As a Lead/Principal Software Engineer in Site/Product Reliability Engineering, you will play a pivotal role in ensuring and scaling the reliability of our AgentForce platform. Working in our India operations center, this role requires shift work, including weekends, to support services aligned with US hours. You will be part of a high-impact team dedicated to maintaining the availability and performance of Salesforce’s AgentForce platform, with a focus on generative and predictive AI platform production support

About Salesforce:
We’re Salesforce, the Customer Company, inspiring the future of business with AI + Data + CRM. We help companies blaze new trails and connect with customers in meaningful ways, while empowering our teams to drive positive change in the world and achieve their career goals.

Your Impact:
In this role, you will address end-to-end production challenges related to the AgentForce AI platform. You will lead the triaging of production issues for critical projects within our Generative AI platform, implement automated solutions to enhance reliability, and maintain comprehensive documentation of production incidents. Additionally, you will collaborate closely with AgentForce AI, product, and platform teams as part of a dynamic and innovative group of developers, architects, and product engineers.

Key Responsibilities:

Passion for triaging and solving complex problems in production systems.
You will establish the reliability process and collaborate closely with lead engineers.
Multi-System Debugging and Triage (must-have): AgentForce integrates multiple Salesforce platforms, such as Core, Service Cloud, Sales Cloud, Data Cloud, and AI Cloud, in addition to LLM providers like OpenAI, Azure OpenAI, and AWS Bedrock. Expertise in diagnosing and triaging performance and scalability issues across these diverse systems and vendors, as well as addressing scaling challenges, is essential.
Capable of investigating alerts and customer-reported issues, comprehensively analyzing the end-to-end stack. This includes first-level triage to assess all systems involved in a specific use case, identifying root causes, and generating detailed reports. Escalate to relevant engineering contacts and work to resolve the issue when necessary
Salesforce Core Platform Knowledge (nice to have) Familiarity with Salesforce Core platform and its architecture is a plus, given AgentForce’s diverse configurations, user permissions, and CRM licensing setups. Strong knowledge of feature provisioning, user permissions, and CRM licensing requirements is beneficial.
Production Support & Issue Triage: Lead and shape the production triage process for AgentForce, focusing on service, infrastructure deployment, configuration, performance, and latency issues.
Collaborate with cross-functional teams and external partners to ensure scalable and reliable services.
Maintain comprehensive documentation of production issues, workflows, and areas for improvement.
Infrastructure & Scaling Management: Understand and support capacity modeling and forecasting to ensure adequate capacity for Agentforce services in production
Ensure and drive the scaling of Large Language Models (LLMs) and associated services in prod are in line with projected capacity requirements based on usage pattern. Consistently review chatbot and AI model utilization and optimize capacity based on usage trends to prevent any outage
Automation & Operational Excellence: Create and maintain playbooks and detailed knowledge articles for future analysis and troubleshooting. Automate manual processes to maintain high availability and repeatability of production systems.
Monitoring & Trust Management: Utilize the availability and trust dashboards, adjust SLOs and SLIs based on production feedback.
Identify automation gaps in prod and compare the establish critical user journey (CUJ) benchmarks for reliability and trustworthiness
Cross-functional Collaboration: Establish strong partnerships with Customer Support Groups (CSG) team to streamline escalations and minimize disruptions.
Be part of the 24x7 on-call support and multi-GEO coverage to maintain service reliability during peak periods.
Stakeholder Collaboration: Collaborate with business and engineering stakeholders for operational excellence, processes, and SLAs. Drive improvements based on key metrics, KPIs, and customer feedback.

Minimum Qualifications:

Bachelor’s degree in Computer Science, Engineering, or a related technical field.
Proven expertise in implementing robust reliability processes across full-stack, end-to-end ML platforms, with in-depth understanding of Generative AI architecture and systems.
8+ years of experience in production support and triaging roles with a focus on end to end , infrastructure and operational reliability.
Experience in DevOps or data center management roles with expertise in Linux system engineering.
Strong knowledge of cloud services (AWS preferred), container technologies (Docker, Kubernetes), and CI/CD tools (Jenkins, GitLab).
Proficiency in scripting languages (Python, Shell, Golang) and knowledge of AI model deployment and scaling.

Preferred Qualifications:

Experience in managing large-scale AI applications and services, including monitoring and diagnostic techniques.
Expertise in deploying and managing LLMs and technologies like Retrieval-Augmented Generation (RAG).
Background in monitoring tools such as Splunk,Prometheus, Grafana, and ELK stack.
Knowledge of java profiler( e.g java filght recorder), open telemetry
Knowledge of TCP/IP networking protocols and infrastructure services in IaaS environments.
Familiarity with MLOps tools and practices for supporting the machine learning lifecycle.
AWS or Salesforce certifications are a plus.

What We Offer:

An opportunity to lead and scale key initiatives within our AI platform.
A collaborative work environment focused on innovation and impact.
Competitive compensation and benefits package.

Accommodations

If you require assistance due to a disability applying for open positions please submit a request via this Accommodations Request Form.

Posting Statement

At Salesforce we believe that the business of business is to improve the state of our world. Each of us has a responsibility to drive Equality in our communities and workplaces. We are committed to creating a workforce that reflects society through inclusive programs and initiatives such as equal pay, employee resource groups, inclusive benefits, and more. Learn more about Equality at www.equality.com and explore our company benefits at www.salesforcebenefits.com.

Salesforce is an Equal Employment Opportunity and Affirmative Action Employer. Qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender perception or identity, national origin, age, marital status, protected veteran status, or disability status. Salesforce does not accept unsolicited headhunter and agency resumes. Salesforce will not pay any third-party agency or company that does not have a signed agreement with Salesforce.

Salesforce welcomes all.

Apply now

See more open positions at Salesforce

Portfolio Jobs

Lead /Principal - Site Reliability Engineering