Track This Job
Add this job to your tracking list to:
- Monitor application status and updates
- Change status (Applied, Interview, Offer, etc.)
- Add personal notes and comments
- Set reminders for follow-ups
- Track your entire application journey
Save This Job
Add this job to your saved collection to:
- Access easily from your saved jobs dashboard
- Review job details later without searching again
- Compare with other saved opportunities
- Keep a collection of interesting positions
- Receive notifications about saved jobs before they expire
AI-Powered Job Summary
Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.
Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.
Job Description:
Senior Operations Engineer HPC
Respond to and resolve operational incidents, identify root causes for critical issues, and implement strategies to prevent recurrence and improve platform resiliency.
Proactively create and manage monitoring, logging, and alerting systems to ensure high availability, performance, and visibility across all services.
Take a Site Reliability Engineering approach to our services, improving the deployment, monitoring and incident response end-to-end.
Solve complex technical problems, with SCP applications, infrastructure and end user’s use of the services.
Administer platform tools like Ansible, Vault, Consul, Prometheus, and Grafana to support core functions like configuration management,secrets management, monitoring, and observability.
Mentor and coach junior engineers in the team, fostering a collaborative and high-performing culture.
Drive automation for deployment and management processes using GitOps workflows as well as CI/CD pipelines.
Essential Knowledge, Skills, and Experience
Experienced administering, maintaining and troubleshooting a Linux environment
Competent in automation and bash scripting
Highly customer focused; able to explain IT technical concepts in a manner which non-IT experts can understand
Hands-on experience working in a DevOps team and using agile methodologies Plus some of the following areas of expertise:
Hands-on knowledge of a range of scientific and HPC applications such as simulation software, bioinformatics tools or 3D data visualisation packages
Experience administering and optimising SLURM
Experience deploying and administering OpenStack
Experience with configuration automation and infrastructure as code (e.g.Ansible, Hashicorp Terraform, AWS CloudFormation, Amazon Cloud Developer Kit)
Experience deploying infrastructure and code to public cloud, especially AWS
Experience with software distribution frameworks such as Easybuild or Spack
Familiarity with container runtimes such as Docker, Singularity or enroot
Experience with frameworks for regression tests and benchmarks for HPC applications, like Reframe HPC
Key Skills
Ranked by relevanceReady to apply?
Join Avance Consulting and take your career to the next level!
Application takes less than 5 minutes

