AppliedAI
Site Reliability Engineer
AppliedAIUnited Arab Emirates12 hours ago
Full-timeEngineering, Information Technology

AppliedAI is a pioneering AI technology company headquartered in Abu Dhabi, committed to innovation and excellence in artificial intelligence solutions in regulated industries such as healthcare, insurance, government, and financial services. We are seeking a skilled and dynamic Infrastructure Engineer to join our team. 


Position Overview: 

As an Infrastructure Engineer, you will play a key role in delivering and maintaining the Opus Infrastructure component of our platform. This includes contributing to feature design, development, deployment, and testing. You’ll work hands-on with technologies like AWS, Azure, Crossplane, Terraform, Kubernetes, GitOps practices, IAM, and microservices, in an environment that values scalability, performance, and collaboration.

The ideal candidate brings solid engineering fundamentals, startup energy, and thrives in a fast-paced, growth-stage tech company with ambitious goals and tight feedback loops.


Key Responsibilities:

1. Infrastructure Deployment & Automation

  • Design, implement, and maintain cloud infrastructure using Terraform, Helm, Kustomize, and FluxCD (GitOps).
  • Manage and optimize multi-cloud environments (AWS, Azure) and cross-account deployments.
  • Build and maintain CI/CD pipelines ensuring smooth and automated deployments across environments.

2. Containerization & Orchestration

  • Migrate and manage Docker images from public registries to internal/private repositories.
  • Deploy and manage microservices using Kubernetes (EKS/AKS) clusters.
  • Optimize cluster performance, resource utilization, and workload distribution for scalability and cost efficiency.



3. Network & Security Hardening

  • Harden networks, load balancers, and API gateways to ensure secure communication between services.
  • Manage IAM roles, policies, and service permissions to ensure least-privilege access.
  • Implement and maintain VPC, security groups, and network policies for cross-environment isolation.

4. Monitoring, Observability & Reliability

  • Establish and maintain monitoring, logging, and tracing systems using tools like Prometheus, Grafana, CloudWatch, and ELK.
  • Proactively identify performance bottlenecks, network issues, and reliability risks.
  • Drive continuous optimization to improve system uptime, stability, and resilience.

5. Collaboration & Continuous Improvement

  • Work closely with backend and infrastructure teams to support feature rollouts and operational readiness.
  • Contribute to documentation, runbooks, and incident response processes.
  • Champion DevOps best practices to improve release cycles, infrastructure as code, and automation coverage.

Education:

  • Bachelor’s degree in Computer Science, Artificial Intelligence, or a related technical field.


Experience:

  • 2–5 years of experience in a relevant engineering or development role, ideally within a tech company or startup.
  • Proven experience with AWS (EKS, IAM, VPC), Terraform, Kubernetes, and microservice-oriented architecture.

Skills:

  • Strong problem-solving and debugging skills
  • Proficiency in GitOps workflows (FluxCD, ArgoCD)
  • Ensure reliability of CI/CD pipelines
  • Familiarity with network optimization and security hardening
  • Ability to work cross-functionally in a distributed team
  • Comfort operating in fast-paced, agile environments
  • Excellent communication and collaboration skills
  • Passion for innovation and learning new technologies
  • Versioning

Key Skills

Ranked by relevance