Senior Site Reliability Engineer

Professional.me

United Arab Emirates · Full-time · Mid-Senior

About the Client:

We are sourcing on behalf of a forward-thinking AI company based in Abu Dhabi that builds large-scale, data-driven platforms powering machine learning and predictive intelligence solutions across industries. As part of their mission to scale infrastructure and reliability alongside cutting-edge AI products, the company is seeking a Site Reliability Engineer (SRE) to join their growing engineering team.

Location:

Hybrid – Abu Dhabi, UAE

Role Summary:

As a Site Reliability Engineer, you will be responsible for ensuring the availability, scalability, and reliability of complex AI and big data systems. You will collaborate with software engineers, data scientists, and DevOps teams to automate infrastructure, monitor performance, resolve incidents, and proactively improve system robustness. This role is ideal for someone passionate about operational excellence at scale, particularly in data-heavy environments.

Key Responsibilities:

Design, implement, and maintain high-availability infrastructure supporting AI workloads and data pipelines.
Develop tools and automation to improve deployment, monitoring, alerting, and incident response.
Ensure SLAs and SLOs are defined, tracked, and met across production systems.
Support and maintain Kubernetes clusters and containerized microservices in cloud and hybrid environments.
Work closely with engineering and data teams to improve system reliability during model training and large-scale inference.
Implement security best practices across CI/CD, networking, and runtime environments.
Perform capacity planning, failure analysis, and chaos testing to identify bottlenecks.
Document incident post-mortems and lead blameless root cause analysis and remediation planning.
Integrate observability tools (e.g., Prometheus, Grafana, ELK stack) to improve visibility into application and infrastructure performance.

Required Qualifications & Experience:

Bachelor’s degree in Computer Science, Engineering, or related field.
4+ years of experience in Site Reliability Engineering, DevOps, or cloud infrastructure roles.
Proficiency in Linux systems, containerization (Docker), and orchestration (Kubernetes).
Strong experience with cloud platforms (AWS, Azure, or GCP), infrastructure as code (Terraform, Helm, etc.).
Hands-on experience building and scaling big data pipelines or machine learning infrastructure.
Solid programming/scripting skills in Python, Bash, or Go.
Experience managing monitoring and logging tools (e.g., Prometheus, Grafana, ELK, Datadog).
Strong understanding of networking, DNS, SSL/TLS, and security hardening.

Preferred Qualifications:

Experience supporting real-time data processing systems (e.g., Kafka, Spark, Flink).
Exposure to MLOps workflows and model deployment pipelines.
Familiarity with distributed system challenges and recovery strategies.
Previous experience in high-growth AI, SaaS, or cloud-native companies.

Key Skills:

Site Reliability & Incident Management
Cloud Infrastructure (AWS/GCP/Azure)
Kubernetes & CI/CD Automation
Monitoring & Observability
High Availability Systems Design
Big Data Infrastructure Support
Infrastructure as Code (Terraform, Helm)
Python/Bash Scripting
Performance Optimization
Security & Compliance

By applying to this position, you are granting us permission to keep your CV on file for consideration for this and future opportunities.

Key Skills

Ranked by relevance

ai cloud machine learning prometheus terraform grafana devops cicd elk infrastructure as code containerization microservices kubernetes big data docker kafka spark mlops bash saas aws gcp dns

Related Jobs

3 roles aligned with this opportunity

View all jobs

Senior Network Site Reliability Engineer (NetSRE)

2026-07-12

Full-time

Associate

Switzerland

Internet Marketplace Platforms

Engineering

Site Reliability Engineer (full remote B2B)

2026-07-12

Full-time

Mid-Senior

Romania

IT Services

Consulting

Head of Data Engineering

2026-07-10

Full-time

Not Applicable

Austria

Internet Marketplace Platforms

Engineering

🇦🇪

Country Guide

United Arab Emirates

Tax-friendly regional tech hub

Posted: May 21, 2025
Type: Full-time
Level: Mid-Senior
Location: Abu Dhabi
Company: Professional.me

Industries

Internet Marketplace Platforms

Related Jobs

3 roles aligned with this opportunity

View all jobs

Senior Network Site Reliability Engineer (NetSRE)

2026-07-12

Full-time

Associate

Switzerland

Internet Marketplace Platforms

Engineering

Site Reliability Engineer (full remote B2B)

2026-07-12

Full-time

Mid-Senior

Romania

IT Services

Consulting

Head of Data Engineering

2026-07-10

Full-time

Not Applicable

Austria

Internet Marketplace Platforms

Engineering

Senior Site Reliability Engineer

Key Skills

Related Jobs

Senior Network Site Reliability Engineer (NetSRE)

Site Reliability Engineer (full remote B2B)

Head of Data Engineering

Related Jobs

Senior Network Site Reliability Engineer (NetSRE)

Site Reliability Engineer (full remote B2B)

Head of Data Engineering

Cookie Settings