Avrioc Technologies
Senior Site Reliability Engineer
Avrioc TechnologiesUnited Arab Emirates4 hours ago
Full-timeEngineering, Information Technology

JD

Requirements:

  • 8+ years as a DevOps & SRE, with a focus on leading SRE practices implementation for the enterprise applications.
  • Strong experience with cloud platforms (AWS, GCP, Azure) and services like EC2, S3, Lambda, RDS, etc.
  • Hands-on experience with Infrastructure as Code (IaC) tools such as Terraform, CloudFormation, and Ansible.
  • Should have experience in building and managing observability frameworks (monitoring, logging, alerting) to track system health and improve performance.
  • Proven experience automating key processes such as deployments, testing, and incident response, using CI/CD tools like Jenkins, Argo CD or similar.
  • Design, deploy, and manage observability tools and processes, including logging, monitoring, and alerting systems using Elastic stack, Grafana, Prometheus, Dynatrace, New relic.
  • Manage and optimize Kubernetes clusters, ensuring scalability, availability, and efficient container orchestration
  • Design and manage Helm charts for scalable and reusable Kubernetes deployments, ensuring streamlined application releases and maintenance.
  • Hands-on experience with AWS managed databases and self-managed databases like MySQL, Cassandra etc.
  • Experience in designing and implementing BCP & DR strategies for availability.
  • Building a pro-active monitoring system that works on the methodology of alerting & auto-healing a system to prevent service outages. Also, build customized dashboards.
  • Participate in on-call support, handle escalation issues, conduct incident review, write project documentation.
  • Expertise in scripting languages like Python, Bash, or Go for automating workflows and infrastructure management.
  • Proactively monitor and plan for future capacity needs, ensuring scalable and resilient architectures across AWS resources.
  • Experience in conducting fault injection testing, chaos engineering using multiple open-source tools like chaos-mesh, litmus & AWS fault injection service.


Responsibilities & Authorities

Responsibilities:

  • Architect and deploy scalable, highly available cloud infrastructure to support production workloads and applications.
  • Ensure systems are fault-tolerant, performant, and can handle high-traffic and growing demands.
  • Proven experience in overseeing and optimizing application release processes within CI/CD pipelines, ensuring seamless, reliable updates.
  • Lead incident response efforts, ensuring minimal service disruption and quick resolution of issues.
  • Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to ensure the reliability and performance of applications.
  • Maintain clear, detailed documentation on infrastructure, processes, incidents, and operational procedures.
  • Work closely with engineering, DevOps, and product teams to align on SRE best practices, promote knowledge sharing, and support application reliability needs.
  • Drive continuous improvements in processes, tools, and systems to improve the reliability and performance of production services.


Common responsibilities:

  • Comply to Avrioc’s Information security and Information service management policies, procedures, and standards.
  • Maintain confidentiality and integrity of information and attend mandatory Information security trainings.
  • Report information security incidents through Avrioc’s established incident reporting channel.

Key Skills

Ranked by relevance