Track This Job
Add this job to your tracking list to:
- Monitor application status and updates
- Change status (Applied, Interview, Offer, etc.)
- Add personal notes and comments
- Set reminders for follow-ups
- Track your entire application journey
Save This Job
Add this job to your saved collection to:
- Access easily from your saved jobs dashboard
- Review job details later without searching again
- Compare with other saved opportunities
- Keep a collection of interesting positions
- Receive notifications about saved jobs before they expire
AI-Powered Job Summary
Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.
Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.
Opus - ML Ops/Site Reliability Engineer
Location: Abu Dhabi, UAE
Position Overview
We're seeking a Site Reliability Engineer to join our growing team. In this role, you'll work at the
intersection of operations and development, ensuring our platform's reliability, performance, and security while supporting our development teams in building and maintaining scalable solutions.
Key Responsibilities
- Monitor and maintain production, development and staging environments, ensuring high availability and optimal performance of our architecture
- Collaborate with DevOps, MLOps and Development teams to troubleshoot and resolve issues
- Support and optimize our observability stack
- Help maintain and improve our incident response processes
- Assist in capacity planning and performance optimization for systems handling up to 10,000 requests per minute
- Ensure compliance with security standards and regulatory requirements across our infrastructure
- Participate in on-call rotation during regular business hours with occasional emergency support
- Implement and maintain SLOs, SLIs, and SLAs
- Optimize cloud costs while maintaining system performance and reliability
- Support our CI/CD pipelines and deployment processes
Required Skills & Experience
Infrastructure Best Practices
- Experience implementing infrastructure following AWS Well-Architected Framework principles
- Knowledge of infrastructure patterns for high availability, fault tolerance, and disaster recovery
- Understanding of infrastructure security best practices (principle of least privilege, network segmentation, encryption at rest/in transit)
- Experience with infrastructure compliance and governance frameworks
- Familiarity with cost optimization strategies and FinOps practices
- Knowledge of Infrastructure as Code best practices (modularity, reusability, versioning)
- Understanding of observability patterns (logging, metrics, tracing)
Technical Experience
- 3+ years of experience in SRE, DevOps, or similar roles
- Strong experience with AWS services, including:
- Compute: Lambda, ECS, Fargate
- Networking: ALB, ELB, API Gateway, Route53, CloudFront, AppSync, API Gateway
- Databases: DynamoDB, RDS (PostgreSQL), Aurora
- Messaging: EventBridge, SNS, SQS
- Security: Security Groups, Secrets Manager (SM), Systems Manager (SSM), IAM
- Developer Tools: ECR, CodeBuild, CodeDeploy
- Experience with monitoring and observability tools
- Knowledge of infrastructure as code (CDK and Terraform)
- Understanding of event-driven architectures
- Experience with containerization and microservices
- Solid scripting and automation skills
- Strong problem-solving abilities and systematic debugging skills
- Experience working in Agile environments
Preferred Qualifications
- AWS certifications
- Experience with Next.js, Node.JS and Python
- Knowledge of authentication systems (Auth0, SSO)
- Familiarity with regulatory compliance requirements (SOC 2, HIPAA, GDPR, PCI DSS)
- Experience with ML/LLM operations
- Multi-region AWS deployment experience
- Experience with high-traffic systems
- Experience with database management and optimization
- Knowledge of caching strategies and CDN implementations
- Understanding of data lifecycle management and ETL processes
- Experience with vector and graph databases is a plus
What We Offer
- Opportunity to work with cutting-edge technologies
- Collaborative environment with dedicated Architecture, Development, DevOps & MLOps teams
- Growth potential in a rapidly scaling startup
- Work with a globally distributed team
- Regular working hours with flexibility for occasional emergency support
- Chance to shape and improve SRE practices
Required Qualities
- Strong communication skills
- Problem-solving mindset
- Team player attitude
- Self-motivated and proactive approach
- Ability to work in a fast-paced startup environment
- Interest in continuous learning and improvement
The ideal candidate will combine technical expertise with a passion for system reliability and a
collaborative approach to problem-solving. They should be comfortable working in a dynamic startup environment while maintaining high standards for system reliability and performance.
Key Skills
Ranked by relevanceReady to apply?
Join AppliedAI and take your career to the next level!
Application takes less than 5 minutes