Senior Site Reliability Engineer / HPC - Pre-IPO Tech Leader

Andiamo

Canada · Full-time · Mid-Senior

Senior Site Reliability Engineer / HPC – Pre-IPO Tech Leader

About The Role

We are seeking a highly skilled Senior Site Reliability Engineer (SRE) / High-Performance Computing (HPC) Engineer to design, build, and operate the large-scale infrastructure that powers a $2.5B pre-IPO technology company. Our systems run on massive distributed clusters, handling some of the most demanding workloads in cloud, AI, and data-driven computing.

In this role, you will be responsible for ensuring the reliability, scalability, and performance of mission-critical platforms. You will optimize HPC workloads, streamline CI/CD for large-scale clusters, and enable research and product teams to deliver innovations with speed and confidence. This is a hands-on position with the opportunity to influence architecture, lead reliability initiatives, and solve some of the hardest problems in distributed systems and performance engineering.

What You’ll Do

Design Reliable Infrastructure: Architect and maintain large-scale, distributed HPC and cloud-native systems with a focus on uptime, scalability, and resilience.
Optimize HPC Workloads: Tune scheduling, job orchestration, and performance for compute- and memory-intensive workloads (AI/ML, simulations, large-scale analytics).
Build Observability: Implement monitoring, logging, and alerting systems that provide full visibility into cluster and service health.
Automate Everything: Develop tooling and automation for provisioning, scaling, and recovery of critical systems.
Ensure Security & Compliance: Implement best practices for access control, encryption, and governance across HPC and cloud environments.
Collaborate Cross-Functionally: Work with engineering, research, and product teams to deliver reliable infrastructure for next-gen applications.
Incident Response: Lead troubleshooting, root cause analysis, and postmortems for high-severity incidents.

What We’re Looking For

Professional Experience: 7+ years in SRE, infrastructure engineering, or HPC roles with a proven track record of supporting large-scale distributed systems.
Technical Skills: Expertise in Linux systems, Python or Go, and infrastructure-as-code (Terraform, Ansible, or similar).
HPC Expertise: Strong knowledge of job schedulers (Slurm, Kubernetes, or Mesos), workload managers, and parallel/distributed computing.
Cloud & Hybrid: Hands-on experience with AWS, GCP, or Azure in combination with on-premises HPC clusters.
Observability: Proficiency with monitoring and logging frameworks (Prometheus, Grafana, ELK, OpenTelemetry).
Resilience Engineering: Experience with chaos engineering, failure testing, and disaster recovery planning.
Collaboration: Strong communication skills and the ability to work with research scientists, engineers, and operations teams.
Education: Bachelor’s or Master’s degree in Computer Science, Engineering, or related field.

Why Join

This is an opportunity to join a pre-IPO technology leader valued at $2.5B, at a time of rapid growth and innovation. As a Senior SRE / HPC Engineer, you will shape the infrastructure that powers next-generation AI, analytics, and large-scale computing. You’ll solve some of the most complex reliability and performance challenges, collaborate with world-class teams, and play a key role in preparing the company for IPO and beyond. The scale is massive, the challenges are unique, and your impact will be immediate.

About Andiamo

Andiamo is a globally recognized staffing and consulting firm specializing in placing the top 2% of technology and go-to-market professionals with the world’s largest and most well-known companies.

For over 20 years, we've maintained the status of tier-one vendor for firms such as Amazon, Bloomberg, Palantir, MasterCard, Visa, Two Sigma, Citadel, as well as other major financial services firms, elite hedge funds, Google-backed tech start-ups, and major software firms.

Our talent solutions include Permanent Placement, Contract Staffing, Executive Search, and Dedicated Recruiting Services (RPO). Find out more at www.andiamogo.com

Key Skills

Ranked by relevance

cloud ai kubernetes prometheus terraform ansible grafana python linux cicd aws gcp elk

Related Jobs

3 roles aligned with this opportunity

View all jobs