Senior Site Reliability Engineer - OPS00023

Dev.ProArgentina3 days ago

Full-timeRemote FriendlyOther

Track This Job

Add this job to your tracking list to:

Monitor application status and updates
Change status (Applied, Interview, Offer, etc.)
Add personal notes and comments
Set reminders for follow-ups
Track your entire application journey

Save This Job

Add this job to your saved collection to:

Access easily from your saved jobs dashboard
Review job details later without searching again
Compare with other saved opportunities
Keep a collection of interesting positions
Receive notifications about saved jobs before they expire

AI-Powered Job Summary

Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.

Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.

We are a US-based outsource software development company that has been delivering exceptional software experience to our clients since 2011, helping technology companies to become industry leaders.

Over the past few years, we've been hiring specialists all over the world while our main development centers were in Ukraine. Now, we keep expanding and start growing our centers in different parts of the world. Dev.Pro is open to hire specialists from other countries as well as Ukrainians who live outside of Ukraine now. We stand with Ukraine and keep supporting our people by offering a friendly remote environment while adhering to the values of democracy, human rights, and state sovereignty.

About This Opportunity

We invite a skilled and experienced Senior Site Reliability Engineer to join our fully remote, international team. In this role, you'll ensure our GPU clusters and supporting AI infrastructure are reliable, resilient, automated, and observable at scale. You'll work with NVIDIA, Slurm, and Kubernetes to turn bare-metal GPU clusters into high-performance AI infrastructure.

What's in it for you:

Join a fast-scaling company shaping the future of AI infrastructure in Europe
Scale, optimize, and automate bare-metal GPU clusters for some of the most compute-intensive AI workloads
Collaborate with a top-tier international team and grow through global AI and cloud events

Is that you?

5+ years as an SRE, DevOps, or HPC engineer in large-scale compute environments
Expertise in HPC workload managers (Slurm, PBS Pro, LSF)
Strong Python or Go skills for automation and observability
Infrastructure-as-code experience (Terraform, Ansible, Helm)
Kubernetes experience for AI workloads (vLLM, Ray, Triton Inference Server)
GPU resource management knowledge (MIG, NCCL, CUDA, containers)
Experience with storage systems (VAST, WEKA, DDN) and parallel filesystems (GPFS, Lustre)
Linux systems engineering, CI/CD, and configuration management skills
Strategic thinking with strong technical and business communication
Organization, autonomy, adaptability
Advanced English level

Desirable:

Exposure to BlueField DPU, NVSwitch, or Slurm-on-Kubernetes hybrid orchestration

Key Responsibilities And Your Contribution

In this role, you'll apply your expertise to ensure our GPU clusters and AI infrastructure run reliably, efficiently, and at scale.

Automate deployment, scaling, and lifecycle management of GPU clusters
Optimize HPC scheduling and AI workload orchestration, including job preemption and GPU affinity
Implement observability and monitoring across GPU, NVLink, InfiniBand, and storage layers
Ensure reliability and uptime through SLOs, error budgets, chaos testing, and automated remediation
Collaborate with teams to optimize performance, resources, and fault recovery at petascale

Key Skills

Ranked by relevance

Ready to apply?

Join Dev.Pro and take your career to the next level!

Application takes less than 5 minutes

Apply