My job alerts

Manager, Cloud Platform Operations

Tungsten Automation

Operations

Londonderry, UK

Posted on Mar 21, 2026

Apply now

Manager, Cloud Platform Operations

LinkedIn Twitter Email Message Share

Tracking Code

E26-031

Job Location

Business Centre "Labirint" 5th Floor Liulin 10 District, Sofia, Sofia,

Job Level

Not Applicable

Position Type

Full-Time/Regular

Job Purpose

As the Manager, Cloud Platform Operations, you will lead a global team of 20+ engineers to ensure the stability, scalability, and reliability of our cloud infrastructure. You will act as a strategic leader, bridging high-level business goals with deep technical execution, while driving the adoption of Site Reliability Engineering (SRE) principles and intelligent automation.

Key Responsibilities

1. Operational Strategy & SRE Governance

Define Reliability Standards: Establish and maintain SLIs (Service Level Indicators) and SLOs (Service Level Objectives), managing Error Budgets to balance system reliability with feature velocity.
AIOps Implementation: Lead the transition to AI-driven observability, selecting and implementing tools that proactively identify anomalies before they impact operations.
Toil Reduction: Audit team workflows to identify repetitive manual tasks and prioritize automation to eliminate inefficiencies.

2. Global Team Leadership

Follow-the-Sun Model: Design seamless handover processes across global regions (APAC, EMEA, AMER) to ensure 24/7 service continuity.
Workload Management: Balance "Keep the Lights On" (KTLO) tasks with high-value engineering projects, ensuring sustainable workloads for the team.
Mentorship & Growth: Conduct regular 1-on-1s, provide career development guidance, and manage performance for a diverse, distributed team.

3. Technical Oversight & Architecture

GitOps & IaC Standards: Enforce Infrastructure as Code (IaC) best practices using tools like Terraform and Ansible, ensuring version-controlled, automated deployments.
Architectural Leadership: Act as a consulting architect, ensuring new services are designed for scalability, operability, and cost-efficiency in Azure or AWS.
Kubernetes Expertise: Oversee the lifecycle of Kubernetes clusters, including upgrades, security patching, and the implementation of custom Operators.

4. Incident Management & On-Call Excellence

Incident Escalation: Serve as the primary escalation point for major incidents, ensuring rapid resolution and clear communication with stakeholders.
Post-Mortem Culture: Lead blameless post-mortems to identify root causes and implement preventative measures for future reliability.

5. Intelligent Automation & AI Enablement

AI-Driven Efficiency: Leverage AI-enabled tools (e.g., chatbots, documentation automation platforms, analytics assistants) to enhance operational efficiency, improve data accuracy, and streamline routine workflows, while ensuring full compliance with company AI governance frameworks and data privacy standards.
Process Optimization: Identify high-impact opportunities to embed AI and automation into support, monitoring, and reporting processes to reduce manual effort and increase team productivity.

Required Skills

Technical Expertise

Cloud Platforms: Proficiency in Azure (preferred) or AWS.
SRE Principles: Deep understanding of SRE practices, including toil reduction, performance monitoring, and SLO lifecycle management.
Infrastructure as Code: Hands-on experience with Terraform, Ansible, and GitOps workflows.
Kubernetes Mastery: Expertise in Kubernetes, including Operators and cluster lifecycle management.
Observability Tools: Familiarity with AIOps tools and platforms like Prometheus, Grafana, New Relic, and Logzio.
System Administration: Proficiency in Linux kernel tuning or Windows Server administration.
Skills in prompting AI systems and assessing output quality
Ability to leverage AI to ideate, develop and scale to the needs of the department

Soft Skills

Strategic Thinking: Ability to translate complex technical challenges into actionable strategies aligned with business goals.
Communication: Skilled at providing clear, concise updates to executive leadership and fostering collaboration across teams.
Problem Solving: Innovative thinker who thrives in high-pressure environments and advocates for engineering best practices.

Required Experience

Leadership & Experience

Team Management: Minimum of 2 years managing teams of 10+ engineers, with 5+ years of hands-on experience in DevOps, SRE, or Cloud Operations.
Global Operations: Proven experience managing distributed teams and navigating the complexities of 24/7 operational environments.

Apply now

See more open positions at Tungsten Automation

Portfolio Jobs