Manager, Cloud Platform Operations
Tungsten Automation
Operations
Sofia City Province, Bulgaria
Posted on Mar 21, 2026
Manager, Cloud Platform Operations
Tracking Code
E26-031
Job Location
Business Centre "Labirint" 5th Floor Liulin 10 District, Sofia, Sofia,
Job Level
Not Applicable
Category
Cloud
Position Type
Full-Time/Regular
Job Purpose
As the Manager, Cloud Platform Operations, you will lead a global team of 20+ engineers to ensure the stability, scalability, and reliability of our cloud infrastructure. You will act as a strategic leader, bridging high-level business goals with deep technical execution, while driving the adoption of Site Reliability Engineering (SRE) principles and intelligent automation.
Key Responsibilities
1. Operational Strategy & SRE Governance
- Define Reliability Standards: Establish and maintain SLIs (Service Level Indicators) and SLOs (Service Level Objectives), managing Error Budgets to balance system reliability with feature velocity.
- AIOps Implementation: Lead the transition to AI-driven observability, selecting and implementing tools that proactively identify anomalies before they impact operations.
- Toil Reduction: Audit team workflows to identify repetitive manual tasks and prioritize automation to eliminate inefficiencies.
2. Global Team Leadership
- Follow-the-Sun Model: Design seamless handover processes across global regions (APAC, EMEA, AMER) to ensure 24/7 service continuity.
- Workload Management: Balance "Keep the Lights On" (KTLO) tasks with high-value engineering projects, ensuring sustainable workloads for the team.
- Mentorship & Growth: Conduct regular 1-on-1s, provide career development guidance, and manage performance for a diverse, distributed team.
3. Technical Oversight & Architecture
- GitOps & IaC Standards: Enforce Infrastructure as Code (IaC) best practices using tools like Terraform and Ansible, ensuring version-controlled, automated deployments.
- Architectural Leadership: Act as a consulting architect, ensuring new services are designed for scalability, operability, and cost-efficiency in Azure or AWS.
- Kubernetes Expertise: Oversee the lifecycle of Kubernetes clusters, including upgrades, security patching, and the implementation of custom Operators.
4. Incident Management & On-Call Excellence
- Incident Escalation: Serve as the primary escalation point for major incidents, ensuring rapid resolution and clear communication with stakeholders.
- Post-Mortem Culture: Lead blameless post-mortems to identify root causes and implement preventative measures for future reliability.
5. Intelligent Automation & AI Enablement
- AI-Driven Efficiency: Leverage AI-enabled tools (e.g., chatbots, documentation automation platforms, analytics assistants) to enhance operational efficiency, improve data accuracy, and streamline routine workflows, while ensuring full compliance with company AI governance frameworks and data privacy standards.
- Process Optimization: Identify high-impact opportunities to embed AI and automation into support, monitoring, and reporting processes to reduce manual effort and increase team productivity.
Required Skills
Technical Expertise
- Cloud Platforms: Proficiency in Azure (preferred) or AWS.
- SRE Principles: Deep understanding of SRE practices, including toil reduction, performance monitoring, and SLO lifecycle management.
- Infrastructure as Code: Hands-on experience with Terraform, Ansible, and GitOps workflows.
- Kubernetes Mastery: Expertise in Kubernetes, including Operators and cluster lifecycle management.
- Observability Tools: Familiarity with AIOps tools and platforms like Prometheus, Grafana, New Relic, and Logzio.
- System Administration: Proficiency in Linux kernel tuning or Windows Server administration.
- Skills in prompting AI systems and assessing output quality
- Ability to leverage AI to ideate, develop and scale to the needs of the department
Soft Skills
- Strategic Thinking: Ability to translate complex technical challenges into actionable strategies aligned with business goals.
- Communication: Skilled at providing clear, concise updates to executive leadership and fostering collaboration across teams.
- Problem Solving: Innovative thinker who thrives in high-pressure environments and advocates for engineering best practices.
Required Experience
Leadership & Experience
- Team Management: Minimum of 2 years managing teams of 10+ engineers, with 5+ years of hands-on experience in DevOps, SRE, or Cloud Operations.
- Global Operations: Proven experience managing distributed teams and navigating the complexities of 24/7 operational environments.