ML Ops/SRE Engineer

AppliedAI

United Arab Emirates · Full-time · Associate

Opus - ML Ops/Site Reliability Engineer

Location: Abu Dhabi, UAE

Position Overview

We're seeking a Site Reliability Engineer to join our growing team. In this role, you'll work at the

intersection of operations and development, ensuring our platform's reliability, performance, and security while supporting our development teams in building and maintaining scalable solutions.

Key Responsibilities

Monitor and maintain production, development and staging environments, ensuring high availability and optimal performance of our architecture
Collaborate with DevOps, MLOps and Development teams to troubleshoot and resolve issues
Support and optimize our observability stack
Help maintain and improve our incident response processes
Assist in capacity planning and performance optimization for systems handling up to 10,000 requests per minute
Ensure compliance with security standards and regulatory requirements across our infrastructure
Participate in on-call rotation during regular business hours with occasional emergency support
Implement and maintain SLOs, SLIs, and SLAs
Optimize cloud costs while maintaining system performance and reliability
Support our CI/CD pipelines and deployment processes

Required Skills & Experience

Infrastructure Best Practices

Experience implementing infrastructure following AWS Well-Architected Framework principles
Knowledge of infrastructure patterns for high availability, fault tolerance, and disaster recovery
Understanding of infrastructure security best practices (principle of least privilege, network segmentation, encryption at rest/in transit)
Experience with infrastructure compliance and governance frameworks
Familiarity with cost optimization strategies and FinOps practices
Knowledge of Infrastructure as Code best practices (modularity, reusability, versioning)
Understanding of observability patterns (logging, metrics, tracing)

Technical Experience

3+ years of experience in SRE, DevOps, or similar roles
Strong experience with AWS services, including:

- Compute: Lambda, ECS, Fargate

- Networking: ALB, ELB, API Gateway, Route53, CloudFront, AppSync, API Gateway

- Databases: DynamoDB, RDS (PostgreSQL), Aurora

- Messaging: EventBridge, SNS, SQS

- Security: Security Groups, Secrets Manager (SM), Systems Manager (SSM), IAM

- Developer Tools: ECR, CodeBuild, CodeDeploy

Experience with monitoring and observability tools
Knowledge of infrastructure as code (CDK and Terraform)
Understanding of event-driven architectures
Experience with containerization and microservices
Solid scripting and automation skills
Strong problem-solving abilities and systematic debugging skills
Experience working in Agile environments

Preferred Qualifications

AWS certifications
Experience with Next.js, Node.JS and Python
Knowledge of authentication systems (Auth0, SSO)
Familiarity with regulatory compliance requirements (SOC 2, HIPAA, GDPR, PCI DSS)
Experience with ML/LLM operations
Multi-region AWS deployment experience
Experience with high-traffic systems
Experience with database management and optimization
Knowledge of caching strategies and CDN implementations
Understanding of data lifecycle management and ETL processes
Experience with vector and graph databases is a plus

What We Offer

Opportunity to work with cutting-edge technologies
Collaborative environment with dedicated Architecture, Development, DevOps & MLOps teams
Growth potential in a rapidly scaling startup
Work with a globally distributed team
Regular working hours with flexibility for occasional emergency support
Chance to shape and improve SRE practices

Required Qualities

Strong communication skills
Problem-solving mindset
Team player attitude
Self-motivated and proactive approach
Ability to work in a fast-paced startup environment
Interest in continuous learning and improvement

The ideal candidate will combine technical expertise with a passion for system reliability and a

collaborative approach to problem-solving. They should be comfortable working in a dynamic startup environment while maintaining high standards for system reliability and performance.

Key Skills

Ranked by relevance

devops aws infrastructure as code high availability mlops technical expertise incident response containerization fault tolerance postgresql dynamodb hipaa cloud gdpr cicd etl ecs

Related Jobs

3 roles aligned with this opportunity

View all jobs

DevOps Engineer

2026-05-27

Full-time

Associate

Argentina

Software Development

Engineering

Staff Software Engineer

2026-05-27

Full-time

Not Applicable

Switzerland

Technology

Engineering

Senior Backend Engineer - Kotlin (all genders)

2026-06-03

Full-time

Not Applicable

Austria

Technology

Engineering

🇦🇪

Country Guide

United Arab Emirates

Tax-friendly regional tech hub

Posted: Oct 06, 2025
Type: Full-time
Level: Associate
Location: Abu Dhabi
Company: AppliedAI

Industries

Technology Information Internet

Related Jobs

3 roles aligned with this opportunity

View all jobs

DevOps Engineer

2026-05-27

Full-time

Associate

Argentina

Software Development

Engineering

Staff Software Engineer

2026-05-27

Full-time

Not Applicable

Switzerland

Technology

Engineering

Senior Backend Engineer - Kotlin (all genders)

2026-06-03

Full-time

Not Applicable

Austria

Technology

Engineering

ML Ops/SRE Engineer

Key Skills

Related Jobs

DevOps Engineer

Staff Software Engineer

Senior Backend Engineer - Kotlin (all genders)

Related Jobs

DevOps Engineer

Staff Software Engineer

Senior Backend Engineer - Kotlin (all genders)

Cookie Settings