Andiamo
Senior Site Reliability Engineer / HPC - Pre-IPO Tech Leader
AndiamoCanada1 day ago
Full-timeRemote FriendlyEngineering, Information Technology
Senior Site Reliability Engineer / HPC – Pre-IPO Tech Leader

About The Role

We are seeking a highly skilled Senior Site Reliability Engineer (SRE) / High-Performance Computing (HPC) Engineer to design, build, and operate the large-scale infrastructure that powers a $2.5B pre-IPO technology company. Our systems run on massive distributed clusters, handling some of the most demanding workloads in cloud, AI, and data-driven computing.

In this role, you will be responsible for ensuring the reliability, scalability, and performance of mission-critical platforms. You will optimize HPC workloads, streamline CI/CD for large-scale clusters, and enable research and product teams to deliver innovations with speed and confidence. This is a hands-on position with the opportunity to influence architecture, lead reliability initiatives, and solve some of the hardest problems in distributed systems and performance engineering.

What You’ll Do

  • Design Reliable Infrastructure: Architect and maintain large-scale, distributed HPC and cloud-native systems with a focus on uptime, scalability, and resilience.
  • Optimize HPC Workloads: Tune scheduling, job orchestration, and performance for compute- and memory-intensive workloads (AI/ML, simulations, large-scale analytics).
  • Build Observability: Implement monitoring, logging, and alerting systems that provide full visibility into cluster and service health.
  • Automate Everything: Develop tooling and automation for provisioning, scaling, and recovery of critical systems.
  • Ensure Security & Compliance: Implement best practices for access control, encryption, and governance across HPC and cloud environments.
  • Collaborate Cross-Functionally: Work with engineering, research, and product teams to deliver reliable infrastructure for next-gen applications.
  • Incident Response: Lead troubleshooting, root cause analysis, and postmortems for high-severity incidents.

What We’re Looking For

  • Professional Experience: 7+ years in SRE, infrastructure engineering, or HPC roles with a proven track record of supporting large-scale distributed systems.
  • Technical Skills: Expertise in Linux systems, Python or Go, and infrastructure-as-code (Terraform, Ansible, or similar).
  • HPC Expertise: Strong knowledge of job schedulers (Slurm, Kubernetes, or Mesos), workload managers, and parallel/distributed computing.
  • Cloud & Hybrid: Hands-on experience with AWS, GCP, or Azure in combination with on-premises HPC clusters.
  • Observability: Proficiency with monitoring and logging frameworks (Prometheus, Grafana, ELK, OpenTelemetry).
  • Resilience Engineering: Experience with chaos engineering, failure testing, and disaster recovery planning.
  • Collaboration: Strong communication skills and the ability to work with research scientists, engineers, and operations teams.
  • Education: Bachelor’s or Master’s degree in Computer Science, Engineering, or related field.

Why Join

This is an opportunity to join a pre-IPO technology leader valued at $2.5B, at a time of rapid growth and innovation. As a Senior SRE / HPC Engineer, you will shape the infrastructure that powers next-generation AI, analytics, and large-scale computing. You’ll solve some of the most complex reliability and performance challenges, collaborate with world-class teams, and play a key role in preparing the company for IPO and beyond. The scale is massive, the challenges are unique, and your impact will be immediate.

About Andiamo

Andiamo is a globally recognized staffing and consulting firm specializing in placing the top 2% of technology and go-to-market professionals with the world’s largest and most well-known companies.

For over 20 years, we've maintained the status of tier-one vendor for firms such as Amazon, Bloomberg, Palantir, MasterCard, Visa, Two Sigma, Citadel, as well as other major financial services firms, elite hedge funds, Google-backed tech start-ups, and major software firms.

Our talent solutions include Permanent Placement, Contract Staffing, Executive Search, and Dedicated Recruiting Services (RPO). Find out more at www.andiamogo.com

Key Skills

Ranked by relevance