DevOps Support Engineer

NorthBay - Pakistan

United Arab Emirates · Full-time · Not Applicable

About AI Factory

The AI Factory operates sovereign AI infrastructure including:

GPU clusters
Cloud subscriptions
Containerized workloads
API gateways
Multi-environment deployments (Sandbox → Staging → Production)

The DevOps Support Engineer Ensures

Infrastructure stability
Deployment reliability
Operational continuity for AI workloads

Role Overview

The DevOps Support Engineer is responsible for supporting:

Cloud infrastructure
CI/CD pipelines
Containerized AI workloads
API gateways
Production environments

The Role Focuses On

Platform stability
Environment health
Deployment reliability
Infrastructure troubleshooting
Structured incident management
Environment discipline
Production governance

This is an operational reliability role aligned with modern DevOps, SRE, and AIOps practices.

The Engineer Acts As

L1 operational responder for infrastructure/platform incidents
Ensures issues are diagnosed, contained, escalated appropriately
Ensures resolution within defined service levels

Key Responsibilities

Infrastructure, Cloud & Environment Support
Support Azure subscriptions, resource groups, networking, and access control
Monitor GPU environments, container clusters, and AI runtime environments
Troubleshoot deployment failures across Sandbox, Staging, and Production
DevOps & CI/CD Support
Monitor CI/CD pipelines and resolve build/deployment issues
Support Git workflows, version control issues, and release rollouts
Ensure environment configuration consistency
Validate infrastructure changes post-deployment
Perform rollback support when required
GPU & AI Runtime Operations Support
Monitor GPU utilization and allocation
Identify memory saturation and CUDA/container runtime errors
Support AI model deployment on GPU nodes
Detect performance bottlenecks affecting inference services
API Gateway, WAF & Integrations
Troubleshoot API gateway routing issues and throttling policies
Monitor rate limiting and traffic control mechanisms
Investigate WAF-related blocking incidents
Support secure external integrations
Support integrations with enterprise systems:

Microsoft 365
SharePoint
Teams
Oracle
Jira

Troubleshoot authentication issues, webhook failures, and API timeouts
Observability & Incident Response
Monitor service availability, CPU/GPU utilization, memory, storage, and logs
Detect infrastructure bottlenecks affecting AI workloads
Act as first-line responder for infrastructure and platform-related incidents (P0–P3)
Perform triage using logs, metrics, system databases, and environment diagnostics
Classify incidents by severity and business impact in line with defined SLAs
Contain and mitigate production-impacting issues
Coordinate with L2/L3 teams and vendors
Escalate with full diagnostic context (logs, metrics snapshots, timestamps, components)
Track incident lifecycle to closure and ensure no SLA breach
Documentation & Knowledge Management
Maintain and improve:

Infrastructure runbooks
Deployment troubleshooting guides
Environment configuration documentation
FAQs

Document recurring failure patterns (deployment errors, GPU saturation, network misconfigurations)
Handle ITSM/ticketing documentation
Capture and publish Root Cause Analysis (RCA) summaries for major incidents
Update environment diagrams and operational checklists after changes
Platform Reliability
Support Kubernetes clusters, Docker containers, and orchestration layers
Validate scaling, failover, and resilience mechanisms
Ensure uptime SLAs for AI products, platforms, and APIs
Security & Compliance Coordination
Support IAM, access control, WAF, and network configurations
Coordinate with security teams for incident remediation
Ensure adherence to environment governance policies

Required Technical Skills

Strong hands-on experience with Azure (AWS/GCP acceptable)
Experience supporting Kubernetes and Docker environments
Familiarity with CI/CD tools (Azure DevOps, GitHub Actions, Jenkins)
Experience with monitoring tools (Azure Monitor, Dynatrace, Grafana)
Understanding of networking, IAM, API gateways, and WAF
Experience supporting production cloud environments under SLA constraints
Familiarity with Infrastructure-as-Code concepts (ARM/Terraform)

Experience

4–7 years in DevOps, Cloud Operations, Platform Support, or SRE-aligned roles
Experience supporting containerized or AI workloads preferred
Exposure to regulated or government environments advantageous
Arabic speaker is a plus

Key Skills

Ranked by relevance

ai devops cloud cicd kubernetes docker sla storage git

Related Jobs

3 roles aligned with this opportunity

View all jobs

AI Support Engineer

2026-03-04

Full-time

Not Applicable

United Arab Emirates

IT Services

Information Technology

Senior Flutter Developer

2025-08-31

Full-time

Mid-Senior

United Arab Emirates

IT Services

Engineering

Senior Java Developer (Spring Boot)

2025-08-22

Full-time

Mid-Senior

United Arab Emirates

IT Services

Engineering

🇦🇪

Country Guide

United Arab Emirates

Tax-friendly regional tech hub

Posted: Mar 04, 2026
Type: Full-time
Level: Not Applicable
Location: United Arab Emirates
Company: NorthBay - Pakistan

Industries

IT Services IT Consulting

Related Jobs

3 roles aligned with this opportunity

View all jobs

AI Support Engineer

2026-03-04

Full-time

Not Applicable

United Arab Emirates

IT Services

Information Technology

Senior Flutter Developer

2025-08-31

Full-time

Mid-Senior

United Arab Emirates

IT Services

Engineering

Senior Java Developer (Spring Boot)

2025-08-22

Full-time

Mid-Senior

United Arab Emirates

IT Services

Engineering

DevOps Support Engineer

Key Skills

Related Jobs

AI Support Engineer

Senior Flutter Developer

Senior Java Developer (Spring Boot)

Related Jobs

AI Support Engineer

Senior Flutter Developer

Senior Java Developer (Spring Boot)

Cookie Settings