-
Professional.me

Senior Site Reliability Engineer

Professional.me
United Arab Emirates · Full-time · Mid-Senior

About the Client:


We are sourcing on behalf of a forward-thinking AI company based in Abu Dhabi that builds large-scale, data-driven platforms powering machine learning and predictive intelligence solutions across industries. As part of their mission to scale infrastructure and reliability alongside cutting-edge AI products, the company is seeking a Site Reliability Engineer (SRE) to join their growing engineering team.


Location:


Hybrid – Abu Dhabi, UAE



Role Summary:


As a Site Reliability Engineer, you will be responsible for ensuring the availability, scalability, and reliability of complex AI and big data systems. You will collaborate with software engineers, data scientists, and DevOps teams to automate infrastructure, monitor performance, resolve incidents, and proactively improve system robustness. This role is ideal for someone passionate about operational excellence at scale, particularly in data-heavy environments.



Key Responsibilities:


  • Design, implement, and maintain high-availability infrastructure supporting AI workloads and data pipelines.
  • Develop tools and automation to improve deployment, monitoring, alerting, and incident response.
  • Ensure SLAs and SLOs are defined, tracked, and met across production systems.
  • Support and maintain Kubernetes clusters and containerized microservices in cloud and hybrid environments.
  • Work closely with engineering and data teams to improve system reliability during model training and large-scale inference.
  • Implement security best practices across CI/CD, networking, and runtime environments.
  • Perform capacity planning, failure analysis, and chaos testing to identify bottlenecks.
  • Document incident post-mortems and lead blameless root cause analysis and remediation planning.
  • Integrate observability tools (e.g., Prometheus, Grafana, ELK stack) to improve visibility into application and infrastructure performance.



Required Qualifications & Experience:


  • Bachelor’s degree in Computer Science, Engineering, or related field.
  • 4+ years of experience in Site Reliability Engineering, DevOps, or cloud infrastructure roles.
  • Proficiency in Linux systems, containerization (Docker), and orchestration (Kubernetes).
  • Strong experience with cloud platforms (AWS, Azure, or GCP), infrastructure as code (Terraform, Helm, etc.).
  • Hands-on experience building and scaling big data pipelines or machine learning infrastructure.
  • Solid programming/scripting skills in Python, Bash, or Go.
  • Experience managing monitoring and logging tools (e.g., Prometheus, Grafana, ELK, Datadog).
  • Strong understanding of networking, DNS, SSL/TLS, and security hardening.



Preferred Qualifications:


  • Experience supporting real-time data processing systems (e.g., Kafka, Spark, Flink).
  • Exposure to MLOps workflows and model deployment pipelines.
  • Familiarity with distributed system challenges and recovery strategies.
  • Previous experience in high-growth AI, SaaS, or cloud-native companies.



Key Skills:


  • Site Reliability & Incident Management
  • Cloud Infrastructure (AWS/GCP/Azure)
  • Kubernetes & CI/CD Automation
  • Monitoring & Observability
  • High Availability Systems Design
  • Big Data Infrastructure Support
  • Infrastructure as Code (Terraform, Helm)
  • Python/Bash Scripting
  • Performance Optimization
  • Security & Compliance


By applying to this position, you are granting us permission to keep your CV on file for consideration for this and future opportunities.

Key Skills

Ranked by relevance

ai cloud machine learning prometheus terraform grafana devops cicd elk infrastructure as code containerization microservices kubernetes big data docker kafka spark mlops bash saas aws gcp dns
Login to Apply
Posted
May 21, 2025
Type
Full-time
Level
Mid-Senior
Location
Abu Dhabi

Industries

Internet Marketplace Platforms

Categories

Engineering Information Technology

Related Jobs

3 roles aligned with this opportunity

View all jobs
View Job Details
Jobgether
Related

AI Research Engineer (Agentic Post-training)

2026-05-27

Full-time
Not Applicable
Australia
Internet Marketplace Platforms
Engineering
View Job Details
Jobgether
Related

AI Research Engineer (Agentic Post-training)

2026-05-26

Full-time
Not Applicable
Sweden
Internet Marketplace Platforms
Engineering
View Job Details
Jobgether
Related

AI Research Engineer (Agentic Post-training)

2026-05-26

Full-time
Not Applicable
Romania
Internet Marketplace Platforms
Engineering