Track This Job
Add this job to your tracking list to:
- Monitor application status and updates
- Change status (Applied, Interview, Offer, etc.)
- Add personal notes and comments
- Set reminders for follow-ups
- Track your entire application journey
Save This Job
Add this job to your saved collection to:
- Access easily from your saved jobs dashboard
- Review job details later without searching again
- Compare with other saved opportunities
- Keep a collection of interesting positions
- Receive notifications about saved jobs before they expire
AI-Powered Job Summary
Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.
Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.
The AI Factory operates sovereign AI infrastructure including:
- GPU clusters
- Cloud subscriptions
- Containerized workloads
- API gateways
- Multi-environment deployments (Sandbox → Staging → Production)
- Infrastructure stability
- Deployment reliability
- Operational continuity for AI workloads
The DevOps Support Engineer is responsible for supporting:
- Cloud infrastructure
- CI/CD pipelines
- Containerized AI workloads
- API gateways
- Production environments
- Platform stability
- Environment health
- Deployment reliability
- Infrastructure troubleshooting
- Structured incident management
- Environment discipline
- Production governance
The Engineer Acts As
- L1 operational responder for infrastructure/platform incidents
- Ensures issues are diagnosed, contained, escalated appropriately
- Ensures resolution within defined service levels
- Infrastructure, Cloud & Environment Support
- Support Azure subscriptions, resource groups, networking, and access control
- Monitor GPU environments, container clusters, and AI runtime environments
- Troubleshoot deployment failures across Sandbox, Staging, and Production
- DevOps & CI/CD Support
- Monitor CI/CD pipelines and resolve build/deployment issues
- Support Git workflows, version control issues, and release rollouts
- Ensure environment configuration consistency
- Validate infrastructure changes post-deployment
- Perform rollback support when required
- GPU & AI Runtime Operations Support
- Monitor GPU utilization and allocation
- Identify memory saturation and CUDA/container runtime errors
- Support AI model deployment on GPU nodes
- Detect performance bottlenecks affecting inference services
- API Gateway, WAF & Integrations
- Troubleshoot API gateway routing issues and throttling policies
- Monitor rate limiting and traffic control mechanisms
- Investigate WAF-related blocking incidents
- Support secure external integrations
- Support integrations with enterprise systems:
- Microsoft 365
- SharePoint
- Teams
- Oracle
- Jira
- Troubleshoot authentication issues, webhook failures, and API timeouts
- Observability & Incident Response
- Monitor service availability, CPU/GPU utilization, memory, storage, and logs
- Detect infrastructure bottlenecks affecting AI workloads
- Act as first-line responder for infrastructure and platform-related incidents (P0–P3)
- Perform triage using logs, metrics, system databases, and environment diagnostics
- Classify incidents by severity and business impact in line with defined SLAs
- Contain and mitigate production-impacting issues
- Coordinate with L2/L3 teams and vendors
- Escalate with full diagnostic context (logs, metrics snapshots, timestamps, components)
- Track incident lifecycle to closure and ensure no SLA breach
- Documentation & Knowledge Management
- Maintain and improve:
- Infrastructure runbooks
- Deployment troubleshooting guides
- Environment configuration documentation
- FAQs
- Document recurring failure patterns (deployment errors, GPU saturation, network misconfigurations)
- Handle ITSM/ticketing documentation
- Capture and publish Root Cause Analysis (RCA) summaries for major incidents
- Update environment diagrams and operational checklists after changes
- Platform Reliability
- Support Kubernetes clusters, Docker containers, and orchestration layers
- Validate scaling, failover, and resilience mechanisms
- Ensure uptime SLAs for AI products, platforms, and APIs
- Security & Compliance Coordination
- Support IAM, access control, WAF, and network configurations
- Coordinate with security teams for incident remediation
- Ensure adherence to environment governance policies
- Strong hands-on experience with Azure (AWS/GCP acceptable)
- Experience supporting Kubernetes and Docker environments
- Familiarity with CI/CD tools (Azure DevOps, GitHub Actions, Jenkins)
- Experience with monitoring tools (Azure Monitor, Dynatrace, Grafana)
- Understanding of networking, IAM, API gateways, and WAF
- Experience supporting production cloud environments under SLA constraints
- Familiarity with Infrastructure-as-Code concepts (ARM/Terraform)
- 4–7 years in DevOps, Cloud Operations, Platform Support, or SRE-aligned roles
- Experience supporting containerized or AI workloads preferred
- Exposure to regulated or government environments advantageous
- Arabic speaker is a plus
Key Skills
Ranked by relevanceReady to apply?
Join NorthBay - Pakistan and take your career to the next level!
Application takes less than 5 minutes

