NorthBay - Pakistan
DevOps Support Engineer
NorthBay - PakistanUnited Arab Emirates12 hours ago
Full-timeInformation Technology
About AI Factory

The AI Factory operates sovereign AI infrastructure including:

  • GPU clusters
  • Cloud subscriptions
  • Containerized workloads
  • API gateways
  • Multi-environment deployments (Sandbox → Staging → Production)

The DevOps Support Engineer Ensures

  • Infrastructure stability
  • Deployment reliability
  • Operational continuity for AI workloads

Role Overview

The DevOps Support Engineer is responsible for supporting:

  • Cloud infrastructure
  • CI/CD pipelines
  • Containerized AI workloads
  • API gateways
  • Production environments

The Role Focuses On

  • Platform stability
  • Environment health
  • Deployment reliability
  • Infrastructure troubleshooting
  • Structured incident management
  • Environment discipline
  • Production governance

This is an operational reliability role aligned with modern DevOps, SRE, and AIOps practices.

The Engineer Acts As

  • L1 operational responder for infrastructure/platform incidents
  • Ensures issues are diagnosed, contained, escalated appropriately
  • Ensures resolution within defined service levels

Key Responsibilities

  • Infrastructure, Cloud & Environment Support
  • Support Azure subscriptions, resource groups, networking, and access control
  • Monitor GPU environments, container clusters, and AI runtime environments
  • Troubleshoot deployment failures across Sandbox, Staging, and Production
  • DevOps & CI/CD Support
  • Monitor CI/CD pipelines and resolve build/deployment issues
  • Support Git workflows, version control issues, and release rollouts
  • Ensure environment configuration consistency
  • Validate infrastructure changes post-deployment
  • Perform rollback support when required
  • GPU & AI Runtime Operations Support
  • Monitor GPU utilization and allocation
  • Identify memory saturation and CUDA/container runtime errors
  • Support AI model deployment on GPU nodes
  • Detect performance bottlenecks affecting inference services
  • API Gateway, WAF & Integrations
  • Troubleshoot API gateway routing issues and throttling policies
  • Monitor rate limiting and traffic control mechanisms
  • Investigate WAF-related blocking incidents
  • Support secure external integrations
  • Support integrations with enterprise systems:
    • Microsoft 365
    • SharePoint
    • Teams
    • Oracle
    • Jira
  • Troubleshoot authentication issues, webhook failures, and API timeouts
  • Observability & Incident Response
  • Monitor service availability, CPU/GPU utilization, memory, storage, and logs
  • Detect infrastructure bottlenecks affecting AI workloads
  • Act as first-line responder for infrastructure and platform-related incidents (P0–P3)
  • Perform triage using logs, metrics, system databases, and environment diagnostics
  • Classify incidents by severity and business impact in line with defined SLAs
  • Contain and mitigate production-impacting issues
  • Coordinate with L2/L3 teams and vendors
  • Escalate with full diagnostic context (logs, metrics snapshots, timestamps, components)
  • Track incident lifecycle to closure and ensure no SLA breach
  • Documentation & Knowledge Management
  • Maintain and improve:
    • Infrastructure runbooks
    • Deployment troubleshooting guides
    • Environment configuration documentation
    • FAQs
  • Document recurring failure patterns (deployment errors, GPU saturation, network misconfigurations)
  • Handle ITSM/ticketing documentation
  • Capture and publish Root Cause Analysis (RCA) summaries for major incidents
  • Update environment diagrams and operational checklists after changes
  • Platform Reliability
  • Support Kubernetes clusters, Docker containers, and orchestration layers
  • Validate scaling, failover, and resilience mechanisms
  • Ensure uptime SLAs for AI products, platforms, and APIs
  • Security & Compliance Coordination
  • Support IAM, access control, WAF, and network configurations
  • Coordinate with security teams for incident remediation
  • Ensure adherence to environment governance policies
Required Technical Skills

  • Strong hands-on experience with Azure (AWS/GCP acceptable)
  • Experience supporting Kubernetes and Docker environments
  • Familiarity with CI/CD tools (Azure DevOps, GitHub Actions, Jenkins)
  • Experience with monitoring tools (Azure Monitor, Dynatrace, Grafana)
  • Understanding of networking, IAM, API gateways, and WAF
  • Experience supporting production cloud environments under SLA constraints
  • Familiarity with Infrastructure-as-Code concepts (ARM/Terraform)

Experience

  • 4–7 years in DevOps, Cloud Operations, Platform Support, or SRE-aligned roles
  • Experience supporting containerized or AI workloads preferred
  • Exposure to regulated or government environments advantageous
  • Arabic speaker is a plus

Key Skills

Ranked by relevance