-
View all jobs
We are a US-based outsource software development company that has been delivering exceptional software experience to our clients since 2011, helping technology companies to become industry leaders.
Over the past few years, we've been hiring specialists all over the world while our main development centers were in Ukraine. Now, we keep expanding and start growing our centers in different parts of the world. Dev.Pro is open to hire specialists from other countries as well as Ukrainians who live outside of Ukraine now. We stand with Ukraine and keep supporting our people by offering a friendly remote environment while adhering to the values of democracy, human rights, and state sovereignty.
About This Opportunity
We invite a skilled and experienced Senior Site Reliability Engineer to join our fully remote, international team. In this role, you'll ensure our GPU clusters and supporting AI infrastructure are reliable, resilient, automated, and observable at scale. You'll work with NVIDIA, Slurm, and Kubernetes to turn bare-metal GPU clusters into high-performance AI infrastructure.
What's in it for you:
In this role, you'll apply your expertise to ensure our GPU clusters and AI infrastructure run reliably, efficiently, and at scale.
Over the past few years, we've been hiring specialists all over the world while our main development centers were in Ukraine. Now, we keep expanding and start growing our centers in different parts of the world. Dev.Pro is open to hire specialists from other countries as well as Ukrainians who live outside of Ukraine now. We stand with Ukraine and keep supporting our people by offering a friendly remote environment while adhering to the values of democracy, human rights, and state sovereignty.
About This Opportunity
We invite a skilled and experienced Senior Site Reliability Engineer to join our fully remote, international team. In this role, you'll ensure our GPU clusters and supporting AI infrastructure are reliable, resilient, automated, and observable at scale. You'll work with NVIDIA, Slurm, and Kubernetes to turn bare-metal GPU clusters into high-performance AI infrastructure.
What's in it for you:
- Join a fast-scaling company shaping the future of AI infrastructure in Europe
- Scale, optimize, and automate bare-metal GPU clusters for some of the most compute-intensive AI workloads
- Collaborate with a top-tier international team and grow through global AI and cloud events
- 5+ years as an SRE, DevOps, or HPC engineer in large-scale compute environments
- Expertise in HPC workload managers (Slurm, PBS Pro, LSF)
- Strong Python or Go skills for automation and observability
- Infrastructure-as-code experience (Terraform, Ansible, Helm)
- Kubernetes experience for AI workloads (vLLM, Ray, Triton Inference Server)
- GPU resource management knowledge (MIG, NCCL, CUDA, containers)
- Experience with storage systems (VAST, WEKA, DDN) and parallel filesystems (GPFS, Lustre)
- Linux systems engineering, CI/CD, and configuration management skills
- Strategic thinking with strong technical and business communication
- Organization, autonomy, adaptability
- Advanced English level
- Exposure to BlueField DPU, NVSwitch, or Slurm-on-Kubernetes hybrid orchestration
In this role, you'll apply your expertise to ensure our GPU clusters and AI infrastructure run reliably, efficiently, and at scale.
- Automate deployment, scaling, and lifecycle management of GPU clusters
- Optimize HPC scheduling and AI workload orchestration, including job preemption and GPU affinity
- Implement observability and monitoring across GPU, NVLink, InfiniBand, and storage layers
- Ensure reliability and uptime through SLOs, error budgets, chaos testing, and automated remediation
- Collaborate with teams to optimize performance, resources, and fault recovery at petascale
Key Skills
Ranked by relevance
ai
kubernetes
storage
configuration management
terraform
ansible
python
devops
linux
cloud
cicd
Related Jobs
3 roles aligned with this opportunity
View Job Details
Related
DevOps Engineer
2026-05-27
Full-time
Associate
Argentina
Software Development
Engineering
View Job Details
Related
Network and Systems Engineer
2026-05-28
Full-time
Not Applicable
Belgium
IT Services
Information Technology
View Job Details
Related
Senior DevOps Engineer
2026-05-20
Full-time
Mid-Senior
Argentina
IT Services
Information Technology
Login to Apply
- Posted
- Oct 27, 2025
- Type
- Full-time
- Level
- Mid-Senior
- Location
- Greater Buenos Aires
- Company
- Dev.Pro
Industries
IT Services
IT Consulting
Categories
Other
Related Jobs
3 roles aligned with this opportunity
View Job Details
Related
DevOps Engineer
2026-05-27
Full-time
Associate
Argentina
Software Development
Engineering
View Job Details
Related
Network and Systems Engineer
2026-05-28
Full-time
Not Applicable
Belgium
IT Services
Information Technology
View Job Details
Related
Senior DevOps Engineer
2026-05-20
Full-time
Mid-Senior
Argentina
IT Services
Information Technology