-
View all jobs
About AI Factory
The AI Factory operates sovereign AI infrastructure including:
The DevOps Support Engineer is responsible for supporting:
The Engineer Acts As
The AI Factory operates sovereign AI infrastructure including:
- GPU clusters
- Cloud subscriptions
- Containerized workloads
- API gateways
- Multi-environment deployments (Sandbox → Staging → Production)
- Infrastructure stability
- Deployment reliability
- Operational continuity for AI workloads
The DevOps Support Engineer is responsible for supporting:
- Cloud infrastructure
- CI/CD pipelines
- Containerized AI workloads
- API gateways
- Production environments
- Platform stability
- Environment health
- Deployment reliability
- Infrastructure troubleshooting
- Structured incident management
- Environment discipline
- Production governance
The Engineer Acts As
- L1 operational responder for infrastructure/platform incidents
- Ensures issues are diagnosed, contained, escalated appropriately
- Ensures resolution within defined service levels
- Infrastructure, Cloud & Environment Support
- Support Azure subscriptions, resource groups, networking, and access control
- Monitor GPU environments, container clusters, and AI runtime environments
- Troubleshoot deployment failures across Sandbox, Staging, and Production
- DevOps & CI/CD Support
- Monitor CI/CD pipelines and resolve build/deployment issues
- Support Git workflows, version control issues, and release rollouts
- Ensure environment configuration consistency
- Validate infrastructure changes post-deployment
- Perform rollback support when required
- GPU & AI Runtime Operations Support
- Monitor GPU utilization and allocation
- Identify memory saturation and CUDA/container runtime errors
- Support AI model deployment on GPU nodes
- Detect performance bottlenecks affecting inference services
- API Gateway, WAF & Integrations
- Troubleshoot API gateway routing issues and throttling policies
- Monitor rate limiting and traffic control mechanisms
- Investigate WAF-related blocking incidents
- Support secure external integrations
- Support integrations with enterprise systems:
- Microsoft 365
- SharePoint
- Teams
- Oracle
- Jira
- Troubleshoot authentication issues, webhook failures, and API timeouts
- Observability & Incident Response
- Monitor service availability, CPU/GPU utilization, memory, storage, and logs
- Detect infrastructure bottlenecks affecting AI workloads
- Act as first-line responder for infrastructure and platform-related incidents (P0–P3)
- Perform triage using logs, metrics, system databases, and environment diagnostics
- Classify incidents by severity and business impact in line with defined SLAs
- Contain and mitigate production-impacting issues
- Coordinate with L2/L3 teams and vendors
- Escalate with full diagnostic context (logs, metrics snapshots, timestamps, components)
- Track incident lifecycle to closure and ensure no SLA breach
- Documentation & Knowledge Management
- Maintain and improve:
- Infrastructure runbooks
- Deployment troubleshooting guides
- Environment configuration documentation
- FAQs
- Document recurring failure patterns (deployment errors, GPU saturation, network misconfigurations)
- Handle ITSM/ticketing documentation
- Capture and publish Root Cause Analysis (RCA) summaries for major incidents
- Update environment diagrams and operational checklists after changes
- Platform Reliability
- Support Kubernetes clusters, Docker containers, and orchestration layers
- Validate scaling, failover, and resilience mechanisms
- Ensure uptime SLAs for AI products, platforms, and APIs
- Security & Compliance Coordination
- Support IAM, access control, WAF, and network configurations
- Coordinate with security teams for incident remediation
- Ensure adherence to environment governance policies
- Strong hands-on experience with Azure (AWS/GCP acceptable)
- Experience supporting Kubernetes and Docker environments
- Familiarity with CI/CD tools (Azure DevOps, GitHub Actions, Jenkins)
- Experience with monitoring tools (Azure Monitor, Dynatrace, Grafana)
- Understanding of networking, IAM, API gateways, and WAF
- Experience supporting production cloud environments under SLA constraints
- Familiarity with Infrastructure-as-Code concepts (ARM/Terraform)
- 4–7 years in DevOps, Cloud Operations, Platform Support, or SRE-aligned roles
- Experience supporting containerized or AI workloads preferred
- Exposure to regulated or government environments advantageous
- Arabic speaker is a plus
Key Skills
Ranked by relevance
ai
devops
cloud
cicd
kubernetes
docker
sla
storage
git
Related Jobs
3 roles aligned with this opportunity
View Job Details
Related
AI Support Engineer
2026-03-04
Full-time
Not Applicable
United Arab Emirates
IT Services
Information Technology
View Job Details
Related
Senior Flutter Developer
2025-08-31
Full-time
Mid-Senior
United Arab Emirates
IT Services
Engineering
View Job Details
Related
Senior Java Developer (Spring Boot)
2025-08-22
Full-time
Mid-Senior
United Arab Emirates
IT Services
Engineering
Login to Apply
- Posted
- Mar 04, 2026
- Type
- Full-time
- Level
- Not Applicable
- Location
- United Arab Emirates
- Company
- NorthBay - Pakistan
Industries
IT Services
IT Consulting
Categories
Information Technology
Related Jobs
3 roles aligned with this opportunity
View Job Details
Related
AI Support Engineer
2026-03-04
Full-time
Not Applicable
United Arab Emirates
IT Services
Information Technology
View Job Details
Related
Senior Flutter Developer
2025-08-31
Full-time
Mid-Senior
United Arab Emirates
IT Services
Engineering
View Job Details
Related
Senior Java Developer (Spring Boot)
2025-08-22
Full-time
Mid-Senior
United Arab Emirates
IT Services
Engineering