We are sourcing on behalf of a forward-thinking AI company based in Abu Dhabi that builds large-scale, data-driven platforms powering machine learning and predictive intelligence solutions across industries. As part of their mission to scale infrastructure and reliability alongside cutting-edge AI products, the company is seeking a Site Reliability Engineer (SRE) to join their growing engineering team.
Hybrid – Abu Dhabi, UAE
As a Site Reliability Engineer, you will be responsible for ensuring the availability, scalability, and reliability of complex AI and big data systems. You will collaborate with software engineers, data scientists, and DevOps teams to automate infrastructure, monitor performance, resolve incidents, and proactively improve system robustness. This role is ideal for someone passionate about operational excellence at scale, particularly in data-heavy environments.
- Design, implement, and maintain high-availability infrastructure supporting AI workloads and data pipelines.
- Develop tools and automation to improve deployment, monitoring, alerting, and incident response.
- Ensure SLAs and SLOs are defined, tracked, and met across production systems.
- Support and maintain Kubernetes clusters and containerized microservices in cloud and hybrid environments.
- Work closely with engineering and data teams to improve system reliability during model training and large-scale inference.
- Implement security best practices across CI/CD, networking, and runtime environments.
- Perform capacity planning, failure analysis, and chaos testing to identify bottlenecks.
- Document incident post-mortems and lead blameless root cause analysis and remediation planning.
- Integrate observability tools (e.g., Prometheus, Grafana, ELK stack) to improve visibility into application and infrastructure performance.
- Bachelor’s degree in Computer Science, Engineering, or related field.
- 4+ years of experience in Site Reliability Engineering, DevOps, or cloud infrastructure roles.
- Proficiency in Linux systems, containerization (Docker), and orchestration (Kubernetes).
- Strong experience with cloud platforms (AWS, Azure, or GCP), infrastructure as code (Terraform, Helm, etc.).
- Hands-on experience building and scaling big data pipelines or machine learning infrastructure.
- Solid programming/scripting skills in Python, Bash, or Go.
- Experience managing monitoring and logging tools (e.g., Prometheus, Grafana, ELK, Datadog).
- Strong understanding of networking, DNS, SSL/TLS, and security hardening.
- Experience supporting real-time data processing systems (e.g., Kafka, Spark, Flink).
- Exposure to MLOps workflows and model deployment pipelines.
- Familiarity with distributed system challenges and recovery strategies.
- Previous experience in high-growth AI, SaaS, or cloud-native companies.
- Site Reliability & Incident Management
- Cloud Infrastructure (AWS/GCP/Azure)
- Kubernetes & CI/CD Automation
- Monitoring & Observability
- High Availability Systems Design
- Big Data Infrastructure Support
- Infrastructure as Code (Terraform, Helm)
- Python/Bash Scripting
- Performance Optimization
- Security & Compliance
By applying to this position, you are granting us permission to keep your CV on file for consideration for this and future opportunities.
Key Skills
Ranked by relevance
Related Jobs
3 roles aligned with this opportunity
AI Research Engineer (Agentic Post-training)
2026-05-27
AI Research Engineer (Agentic Post-training)
2026-05-26
AI Research Engineer (Agentic Post-training)
2026-05-26
- Posted
- May 21, 2025
- Type
- Full-time
- Level
- Mid-Senior
- Location
- Abu Dhabi
- Company
- Professional.me
Industries
Categories
Related Jobs
3 roles aligned with this opportunity
AI Research Engineer (Agentic Post-training)
2026-05-27
AI Research Engineer (Agentic Post-training)
2026-05-26
AI Research Engineer (Agentic Post-training)
2026-05-26