We are looking for an experienced Product Site Reliability Engineer (SRE) to help ensure the performance, scalability, and reliability of our customer-facing products and platforms. As a critical link between product development and operations, the SRE designs resilient systems, automates workflows, and builds observability into the product lifecycle to enable fast-paced innovation without compromising stability.
Key Responsibilities
System Reliability & Performance
- Ensure availability, latency, scalability, and overall system health aligns with SLAs and SLOs.
- Continuously improve monitoring, alerting, and observability capabilities.
Incident Management
- Lead root cause analysis and conduct blameless postmortems.
- Develop and maintain incident response playbooks to reduce MTTD and MTTR.
Automation & Tooling
- Automate operational tasks to reduce manual work and improve efficiency.
- Build and maintain CI/CD pipelines and infrastructure as code (IaC) for seamless product delivery.
Collaboration with Product & Engineering
- Work closely with engineering teams to embed reliability into product design.
- Promote best practices such as chaos testing, capacity planning, and progressive deployment strategies (blue/green, canary releases).
Continuous Improvement
- Define, measure, and track key reliability metrics (SLIs, SLOs, error budgets).
- Identify and implement infrastructure and architectural improvements to enhance system resilience.
Required Skills & Experience
Technical Skills
- Deep knowledge of cloud platforms (AWS, GCP, or Azure).
- Experience with containerization and orchestration (Docker, Kubernetes).
- Proficiency in Infrastructure as Code tools (Terraform, Ansible, or similar).
- Expertise in CI/CD tools (e.g., Jenkins, GitHub Actions, GitLab CI).
- Familiarity with observability and monitoring tools (Prometheus, Grafana, Datadog, New Relic).
- Strong scripting and programming skills (Python, Go, Bash, or similar).
- Understanding of distributed systems, networking, and database reliability (SQL/NoSQL).
Professional Skills
- 5+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering.
- Strong analytical and problem-solving mindset.
- Excellent communication and collaboration skills across cross-functional teams.
- Demonstrated experience in incident management and conducting postmortems.
Key Skills
Ranked by relevance
Related Jobs
3 roles aligned with this opportunity
Junior Developer
2025-10-15
AI Software Engineer (m/f/d) - Berlin
2026-05-21
Scala Software Developer
2026-05-21
- Posted
- Aug 20, 2025
- Type
- Full-time
- Level
- Mid-Senior
- Location
- Abu Dhabi Emirate
- Company
- RP International
Industries
Categories
Related Jobs
3 roles aligned with this opportunity
Junior Developer
2025-10-15
AI Software Engineer (m/f/d) - Berlin
2026-05-21
Scala Software Developer
2026-05-21