We are enabling dependable GPU compute by operating Kubernetes and Linux platforms focused on Volcano scheduling and automated infrastructure operations. As a Middle DevOps Engineer, you will manage Kubernetes administration, run GPU clusters on Kubernetes and Linux nodes, and create automation with Python and UNIX shell scripting for a client-facing delivery team. Apply to help deliver stable, efficient AI compute environments at scale.
Responsibilities
- Provision, configure, and operate GPU-enabled Kubernetes clusters and standalone Linux compute environments to keep scheduling and performance optimized
- Set up and administer Volcano job scheduling, including queue setup, POD execution, GPU allocation, and namespace quota enforcement
- Own Kubernetes administration across namespaces, RBAC, resource quotas, and workload isolation approaches
- Automate job submission, resource provisioning, and system reporting by creating and maintaining Python and Shell scripts
- Coordinate with orchestration, optimization, and observability teams to raise scheduling efficiency, improve capacity utilization, and streamline researcher workflows
- Observe infrastructure health and resource utilization, supplying data and feedback for optimization and reporting needs
- Improve infrastructure, tooling, and automation workflows to increase performance, scalability, and usability
- Maintain operational processes that provide a smooth and efficient experience for researchers running diverse AI and computational workloads
Requirements
- Hands-on background with 2+ years of experience in DevOps or infrastructure engineering within complex, large-scale environments
- Expertise in Kubernetes administration and orchestration, including namespaces, POD scheduling/distribution, PVC, NFS, and resource quota management
- Practical experience with the Volcano scheduler for GPU job execution, queue configuration, and workload prioritization integrated with Kubernetes
- Proven ability to operate GPU cluster environments in Kubernetes as well as on standalone Linux compute nodes
- Advanced Python scripting skills for infrastructure automation, plus proficiency in UNIX Shell scripting such as Bash
- Strong Linux system administration skills, including troubleshooting, performance tuning, and configuration management
- Solid understanding of infrastructure automation and orchestration concepts and related tooling
- Fluent English communication skills (spoken and written) for direct client interaction
Nice to have
- Knowledge of Helm package management for Kubernetes applications
- Familiarity with monitoring and observability solutions, particularly Prometheus, Grafana, and Loki
- Skills in Infrastructure as Code tools such as Terraform
- Background in multi-cloud Kubernetes environments including Amazon EKS and Google GKE
- Understanding of Azure Networking including VPN, ExpressRoute, and network security
- Familiarity with AI-assisted coding tools such as GitHub Copilot, ChatGPT, and Claude
- Experience with hybrid (cloud and on-premises) scheduling and resource optimization
We offer
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn
Key Skills
Ranked by relevance
Related Jobs
3 roles aligned with this opportunity
Senior GoLang Developer
2026-04-09
SAP Logistics (SCM) Engineering Manager
2026-04-07
Senior AEM Back End Developer
2026-04-08
- Posted
- Apr 02, 2026
- Type
- Full-time
- Level
- Associate
- Location
- Brazil
- Company
- EPAM Systems
Industries
Categories
Related Jobs
3 roles aligned with this opportunity
Senior GoLang Developer
2026-04-09
SAP Logistics (SCM) Engineering Manager
2026-04-07
Senior AEM Back End Developer
2026-04-08