AppliedAI
ML Ops/SRE Engineer
AppliedAIUnited Arab Emirates16 hours ago
Full-timeEngineering

Opus - ML Ops/Site Reliability Engineer

Location: Abu Dhabi, UAE


Position Overview

We're seeking a Site Reliability Engineer to join our growing team. In this role, you'll work at the

intersection of operations and development, ensuring our platform's reliability, performance, and security while supporting our development teams in building and maintaining scalable solutions.


Key Responsibilities

  • Monitor and maintain production, development and staging environments, ensuring high availability and optimal performance of our architecture
  • Collaborate with DevOps, MLOps and Development teams to troubleshoot and resolve issues
  • Support and optimize our observability stack
  • Help maintain and improve our incident response processes
  • Assist in capacity planning and performance optimization for systems handling up to 10,000 requests per minute
  • Ensure compliance with security standards and regulatory requirements across our infrastructure
  • Participate in on-call rotation during regular business hours with occasional emergency support
  • Implement and maintain SLOs, SLIs, and SLAs
  • Optimize cloud costs while maintaining system performance and reliability
  • Support our CI/CD pipelines and deployment processes


Required Skills & Experience


Infrastructure Best Practices

  • Experience implementing infrastructure following AWS Well-Architected Framework principles
  • Knowledge of infrastructure patterns for high availability, fault tolerance, and disaster recovery
  • Understanding of infrastructure security best practices (principle of least privilege, network segmentation, encryption at rest/in transit)
  • Experience with infrastructure compliance and governance frameworks
  • Familiarity with cost optimization strategies and FinOps practices
  • Knowledge of Infrastructure as Code best practices (modularity, reusability, versioning)
  • Understanding of observability patterns (logging, metrics, tracing)


Technical Experience

  • 3+ years of experience in SRE, DevOps, or similar roles
  • Strong experience with AWS services, including:

- Compute: Lambda, ECS, Fargate

- Networking: ALB, ELB, API Gateway, Route53, CloudFront, AppSync, API Gateway

- Databases: DynamoDB, RDS (PostgreSQL), Aurora

- Messaging: EventBridge, SNS, SQS

- Security: Security Groups, Secrets Manager (SM), Systems Manager (SSM), IAM

- Developer Tools: ECR, CodeBuild, CodeDeploy

  • Experience with monitoring and observability tools
  • Knowledge of infrastructure as code (CDK and Terraform)
  • Understanding of event-driven architectures
  • Experience with containerization and microservices
  • Solid scripting and automation skills
  • Strong problem-solving abilities and systematic debugging skills
  • Experience working in Agile environments


Preferred Qualifications


  • AWS certifications
  • Experience with Next.js, Node.JS and Python
  • Knowledge of authentication systems (Auth0, SSO)
  • Familiarity with regulatory compliance requirements (SOC 2, HIPAA, GDPR, PCI DSS)
  • Experience with ML/LLM operations
  • Multi-region AWS deployment experience
  • Experience with high-traffic systems
  • Experience with database management and optimization
  • Knowledge of caching strategies and CDN implementations
  • Understanding of data lifecycle management and ETL processes
  • Experience with vector and graph databases is a plus


What We Offer

  • Opportunity to work with cutting-edge technologies
  • Collaborative environment with dedicated Architecture, Development, DevOps & MLOps teams
  • Growth potential in a rapidly scaling startup
  • Work with a globally distributed team
  • Regular working hours with flexibility for occasional emergency support
  • Chance to shape and improve SRE practices


Required Qualities

  • Strong communication skills
  • Problem-solving mindset
  • Team player attitude
  • Self-motivated and proactive approach
  • Ability to work in a fast-paced startup environment
  • Interest in continuous learning and improvement


The ideal candidate will combine technical expertise with a passion for system reliability and a

collaborative approach to problem-solving. They should be comfortable working in a dynamic startup environment while maintaining high standards for system reliability and performance.

Key Skills

Ranked by relevance