Opus - ML Ops/Site Reliability Engineer
Location: Abu Dhabi, UAE
Position Overview
We're seeking a Site Reliability Engineer to join our growing team. In this role, you'll work at the
intersection of operations and development, ensuring our platform's reliability, performance, and security while supporting our development teams in building and maintaining scalable solutions.
Key Responsibilities
- Monitor and maintain production, development and staging environments, ensuring high availability and optimal performance of our architecture
- Collaborate with DevOps, MLOps and Development teams to troubleshoot and resolve issues
- Support and optimize our observability stack
- Help maintain and improve our incident response processes
- Assist in capacity planning and performance optimization for systems handling up to 10,000 requests per minute
- Ensure compliance with security standards and regulatory requirements across our infrastructure
- Participate in on-call rotation during regular business hours with occasional emergency support
- Implement and maintain SLOs, SLIs, and SLAs
- Optimize cloud costs while maintaining system performance and reliability
- Support our CI/CD pipelines and deployment processes
Required Skills & Experience
Infrastructure Best Practices
- Experience implementing infrastructure following AWS Well-Architected Framework principles
- Knowledge of infrastructure patterns for high availability, fault tolerance, and disaster recovery
- Understanding of infrastructure security best practices (principle of least privilege, network segmentation, encryption at rest/in transit)
- Experience with infrastructure compliance and governance frameworks
- Familiarity with cost optimization strategies and FinOps practices
- Knowledge of Infrastructure as Code best practices (modularity, reusability, versioning)
- Understanding of observability patterns (logging, metrics, tracing)
Technical Experience
- 3+ years of experience in SRE, DevOps, or similar roles
- Strong experience with AWS services, including:
- Compute: Lambda, ECS, Fargate
- Networking: ALB, ELB, API Gateway, Route53, CloudFront, AppSync, API Gateway
- Databases: DynamoDB, RDS (PostgreSQL), Aurora
- Messaging: EventBridge, SNS, SQS
- Security: Security Groups, Secrets Manager (SM), Systems Manager (SSM), IAM
- Developer Tools: ECR, CodeBuild, CodeDeploy
- Experience with monitoring and observability tools
- Knowledge of infrastructure as code (CDK and Terraform)
- Understanding of event-driven architectures
- Experience with containerization and microservices
- Solid scripting and automation skills
- Strong problem-solving abilities and systematic debugging skills
- Experience working in Agile environments
Preferred Qualifications
- AWS certifications
- Experience with Next.js, Node.JS and Python
- Knowledge of authentication systems (Auth0, SSO)
- Familiarity with regulatory compliance requirements (SOC 2, HIPAA, GDPR, PCI DSS)
- Experience with ML/LLM operations
- Multi-region AWS deployment experience
- Experience with high-traffic systems
- Experience with database management and optimization
- Knowledge of caching strategies and CDN implementations
- Understanding of data lifecycle management and ETL processes
- Experience with vector and graph databases is a plus
What We Offer
- Opportunity to work with cutting-edge technologies
- Collaborative environment with dedicated Architecture, Development, DevOps & MLOps teams
- Growth potential in a rapidly scaling startup
- Work with a globally distributed team
- Regular working hours with flexibility for occasional emergency support
- Chance to shape and improve SRE practices
Required Qualities
- Strong communication skills
- Problem-solving mindset
- Team player attitude
- Self-motivated and proactive approach
- Ability to work in a fast-paced startup environment
- Interest in continuous learning and improvement
The ideal candidate will combine technical expertise with a passion for system reliability and a
collaborative approach to problem-solving. They should be comfortable working in a dynamic startup environment while maintaining high standards for system reliability and performance.
Key Skills
Ranked by relevance
Related Jobs
3 roles aligned with this opportunity
DevOps Engineer
2026-05-27
Staff Software Engineer
2026-05-27
Senior Backend Engineer - Kotlin (all genders)
2026-06-03
- Posted
- Oct 06, 2025
- Type
- Full-time
- Level
- Associate
- Location
- Abu Dhabi
- Company
- AppliedAI
Industries
Categories
Related Jobs
3 roles aligned with this opportunity
DevOps Engineer
2026-05-27
Staff Software Engineer
2026-05-27
Senior Backend Engineer - Kotlin (all genders)
2026-06-03