Track This Job
Add this job to your tracking list to:
- Monitor application status and updates
- Change status (Applied, Interview, Offer, etc.)
- Add personal notes and comments
- Set reminders for follow-ups
- Track your entire application journey
Save This Job
Add this job to your saved collection to:
- Access easily from your saved jobs dashboard
- Review job details later without searching again
- Compare with other saved opportunities
- Keep a collection of interesting positions
- Receive notifications about saved jobs before they expire
AI-Powered Job Summary
Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.
Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.
About The Role
We are seeking a highly skilled Senior Site Reliability Engineer (SRE) / High-Performance Computing (HPC) Engineer to design, build, and operate the large-scale infrastructure that powers a $2.5B pre-IPO technology company. Our systems run on massive distributed clusters, handling some of the most demanding workloads in cloud, AI, and data-driven computing.
In this role, you will be responsible for ensuring the reliability, scalability, and performance of mission-critical platforms. You will optimize HPC workloads, streamline CI/CD for large-scale clusters, and enable research and product teams to deliver innovations with speed and confidence. This is a hands-on position with the opportunity to influence architecture, lead reliability initiatives, and solve some of the hardest problems in distributed systems and performance engineering.
What You’ll Do
- Design Reliable Infrastructure: Architect and maintain large-scale, distributed HPC and cloud-native systems with a focus on uptime, scalability, and resilience.
- Optimize HPC Workloads: Tune scheduling, job orchestration, and performance for compute- and memory-intensive workloads (AI/ML, simulations, large-scale analytics).
- Build Observability: Implement monitoring, logging, and alerting systems that provide full visibility into cluster and service health.
- Automate Everything: Develop tooling and automation for provisioning, scaling, and recovery of critical systems.
- Ensure Security & Compliance: Implement best practices for access control, encryption, and governance across HPC and cloud environments.
- Collaborate Cross-Functionally: Work with engineering, research, and product teams to deliver reliable infrastructure for next-gen applications.
- Incident Response: Lead troubleshooting, root cause analysis, and postmortems for high-severity incidents.
- Professional Experience: 7+ years in SRE, infrastructure engineering, or HPC roles with a proven track record of supporting large-scale distributed systems.
- Technical Skills: Expertise in Linux systems, Python or Go, and infrastructure-as-code (Terraform, Ansible, or similar).
- HPC Expertise: Strong knowledge of job schedulers (Slurm, Kubernetes, or Mesos), workload managers, and parallel/distributed computing.
- Cloud & Hybrid: Hands-on experience with AWS, GCP, or Azure in combination with on-premises HPC clusters.
- Observability: Proficiency with monitoring and logging frameworks (Prometheus, Grafana, ELK, OpenTelemetry).
- Resilience Engineering: Experience with chaos engineering, failure testing, and disaster recovery planning.
- Collaboration: Strong communication skills and the ability to work with research scientists, engineers, and operations teams.
- Education: Bachelor’s or Master’s degree in Computer Science, Engineering, or related field.
This is an opportunity to join a pre-IPO technology leader valued at $2.5B, at a time of rapid growth and innovation. As a Senior SRE / HPC Engineer, you will shape the infrastructure that powers next-generation AI, analytics, and large-scale computing. You’ll solve some of the most complex reliability and performance challenges, collaborate with world-class teams, and play a key role in preparing the company for IPO and beyond. The scale is massive, the challenges are unique, and your impact will be immediate.
About Andiamo
Andiamo is a globally recognized staffing and consulting firm specializing in placing the top 2% of technology and go-to-market professionals with the world’s largest and most well-known companies.
For over 20 years, we've maintained the status of tier-one vendor for firms such as Amazon, Bloomberg, Palantir, MasterCard, Visa, Two Sigma, Citadel, as well as other major financial services firms, elite hedge funds, Google-backed tech start-ups, and major software firms.
Our talent solutions include Permanent Placement, Contract Staffing, Executive Search, and Dedicated Recruiting Services (RPO). Find out more at www.andiamogo.com
Key Skills
Ranked by relevanceReady to apply?
Join Andiamo and take your career to the next level!
Application takes less than 5 minutes