Ampstek
Site Reliability Engineer
AmpstekUnited Arab Emirates11 hours ago
ContractInformation Technology

Site Reliability Engineer (SRE)

From designing fault-tolerant architectures to leading incident responses, you’ll have the freedom to

shape how we deliver stable, secure, and high-performance banking services.

About the Role

We’re looking for a talented Site Reliability Engineer (SRE) to keep our systems running smoothly,

reliably, and at scale. Through smart automation, deep observability, and a calm head in a crisis, you’ll

help us balance speed, compliance, and stability, working alongside DevOps, Cloud, Quality

Engineering, and Product teams to drive continuous improvements in performance, security, and

resilience.

You’ll play a key role in enhancing reliability, accelerating delivery, and ensuring seamless digital

experiences for ADCB customers.

This role reports directly to our Lead SRE / Tribe Executive Manager.

What You Will Be Doing


• Define and implement SLIs / SLOs and error budgets for business-critical digital

banking services.

• Build actionable observability (metrics, logs, traces, dashboards, and alerts) using

Dynatrace, Prometheus, Grafana, and ELK, while reducing alert fatigue.

• Leverage AI-driven insights and anomaly detection (Dynatrace Davis AI or equivalent

AIOps platform) to proactively predict and resolve reliability issues before impact.

• Lead incident management — from on-call triage and root-cause analysis to blameless

postmortems with actionable follow-ups.

• Improve deployment safety with robust rollout / rollback strategies, canary and blue-

green deployments, and production readiness reviews.

• Support and optimize microservices-based architectures, ensuring service reliability,

scalability, and inter-service resilience.

• Conduct capacity planning, performance tuning, and resilience testing, optimizing for

both reliability and cost efficiency.

• Automate operational toil — from runbooks and remediation scripts to proactive health

checks and self-healing workflows.

• Collaborate with DevOps to embed reliability gates and validations into CI / CD

pipelines (GitHub Actions, Jenkins, GitLab CI / CD or Azure DevOps).

• Own and evolve the observability and AIOps stack, driving intelligent automation and

predictive alerting capabilities.

• Maintain high-quality documentation, playbooks, and operational standards across

environments.

• Ensure operational compliance and security alignment with internal controls and

regulatory standards.

• Analyze system performance, availability, and cost data to continually optimize

operations.

• Provide reliability support and escalation guidance for critical production systems

during major incidents.

Experience and Qualifications


• 5+ years of experience in SRE or DevOps roles, building and managing large-scale,

high-availability systems across banking, fintech, e-commerce, or other data-intensive

digital ecosystems.


• Bachelor’s degree in Computer Science or equivalent technical experience.

• Strong experience with Linux environments and performance troubleshooting.

• Proven expertise in Terraform and Infrastructure as Code (IaC) methodologies.

• Proficiency with Kubernetes and container orchestration in microservices

environments.

• Hands-on experience with AWS (preferred); exposure to Azure or GCP is an advantage.

• Deep knowledge of Dynatrace (AIOps, Davis AI), Prometheus, Grafana, and the ELK

stack.

• Experience implementing AI / ML-driven reliability or automation solutions (AIOps,

anomaly detection, predictive alerting).

• Practical understanding of CI / CD pipelines (GitHub Actions, Jenkins, GitLab CI / CD

or Azure DevOps).

• Experience with Kafka, RabbitMQ, Redis, Aurora, and RDS databases.

• Strong scripting or programming skills in Python, Bash, or Go. The Ideal Candidate

• Organized, structured, and meticulous in approach.

• Experienced in cross-functional collaboration and working with distributed teams.

• Strong analytical mindset with excellent troubleshooting skills for complex production

systems.

• Calm and composed communicator under pressure, capable of leading during high-

impact incidents.

• Proactive problem-solver who anticipates issues and drives preventive improvements.

• Passionate about AI-driven automation, observability, and reliability engineering.

• Continuously learning, keeping up-to-date with cloud-native, microservices, and SRE

best practices.

• A collaborative and adaptable team player who thrives in a fast-paced, regulated

environment and is passionate about building reliable, scalable systems that empower

digital banking innovation.

Key Skills

Ranked by relevance