DevOps Engineer - AI Infrastructure & GPU Orchestration
Company Description
NEXUS is revolutionizing the data center industry with the first AI-native Data Center Operating System. Addressing the growing complexity of AI-driven workloads and infrastructure, our platform unifies DCIM, APM, FinOps, Kubernetes orchestration, AI workload management, and full-stack observability into one intelligent, real-time system.With cutting-edge predictive intelligence and automated remediation, the platform ensures optimized performance, cost efficiency, and seamless AI deployment. At NEXUS, we are shaping a future with autonomous infrastructure intelligence for smarter, more efficient decisions.
Role Description
This is a full-time hybrid role for a DevOps Engineer specializing in AI Infrastructure and GPU Orchestration. The DevOps Engineer will be responsible for building and maintaining scalable infrastructure, implementing infrastructure as code (IaC), developing automation scripts, streamlining continuous integration workflows, and managing Linux-based systems. The role also involves optimizing GPU clusters, collaborating with software developers, and ensuring high system performance to support innovative AI-driven workloads.
Key Responsibilities
- GPU Workload Orchestration: Design and manage complex Kubernetes environments (EKS, AKS, GKE, or bare metal) specifically tuned for AI/ML workloads, including GPU scheduling, device plugins, and node affinity.
- DCIM Integration: Build and maintain infrastructure pipelines that interface with Data Center Infrastructure Management (DCIM) systems to monitor power, cooling, and hardware health at the rack level.
- Advanced APM & Telemetry: Implement deep Application Performance Monitoring (APM) and observability stacks (Prometheus, Grafana, Datadog) to track GPU utilization, memory bandwidth, and workload latency in real-time.
- Infrastructure as Code (IaC): Architect and deploy scalable, multi-cloud and hybrid environments using Terraform or equivalents, ensuring our platform can deploy rapidly into diverse enterprise environments.
- CI/CD for AI Infrastructure: Own the CI/CD pipelines (GitHub Actions, GitLab CI) that deliver our orchestration software, ensuring zero-downtime deployments for mission-critical AI systems.
- Performance Tuning: Work closely with the core engineering team to optimize network routing, storage I/O, and compute resource allocation for heavy AI training and inference workloads.
Qualifications
- Minimum 3-5 years of professional experience in DevOps, SRE, or Infrastructure Engineering, with a strong focus on high-performance computing or AI infrastructure.
- Expert-level skills in Terraform,Ansible, or similar technologies and CI/CD automation, coupled with strong scripting abilities in Python, Go, or Bash.
- Strong knowledge of Continuous Integration tools (e.g., Jenkins, GitHub Actions, GitLab CI/CD)
- Background in System Administration and expertise in managing multi-OS-based environments
- Understanding of GPU clusters and handling modern AI workloads
- Deep, hands-on experience with Kubernetes, specifically managing stateful workloads, custom resource definitions (CRDs), and GPU node provisioning.
- Proven ability to design and implement comprehensive APM and telemetry solutions for complex, distributed systems.
- Understanding of data center operations, including power, thermal management, and hardware-level monitoring.
- Multi-cloud infrastructure experience is a plus
- Ability to troubleshoot and optimize performance across complex infrastructure
- Strong problem-solving abilities and a collaborative mindset
Key Skills
Ranked by relevance
Related Jobs
3 roles aligned with this opportunity
DevOps Engineer
2026-05-27
AI/ML Engineer
2026-05-27
Staff AI Engineer - 2543
2026-05-20
- Posted
- May 08, 2026
- Type
- Full-time
- Level
- Entry
- Location
- Dubai
- Company
- NEXUS
Industries
Categories
Related Jobs
3 roles aligned with this opportunity
DevOps Engineer
2026-05-27
AI/ML Engineer
2026-05-27
Staff AI Engineer - 2543
2026-05-20