Track This Job
Add this job to your tracking list to:
- Monitor application status and updates
- Change status (Applied, Interview, Offer, etc.)
- Add personal notes and comments
- Set reminders for follow-ups
- Track your entire application journey
Save This Job
Add this job to your saved collection to:
- Access easily from your saved jobs dashboard
- Review job details later without searching again
- Compare with other saved opportunities
- Keep a collection of interesting positions
- Receive notifications about saved jobs before they expire
AI-Powered Job Summary
Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.
Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.
IMPORTANT NOTE:
If you have experience with GPU, please mention it clearly in your projects/work experience (e.g., years of experience, responsibility or achievement in previous work).
<Expectation for the role>
As a Site Reliability Engineer, your key features include:
- Kubernetes operations – design, run, and improve large multi-cluster
- Kubernetes environments on AWS and Google Cloud, plus on-prem clusters; add support for Azure or Oracle Cloud when needed.
- Infrastructure as code – manage everything with Terraform or Pulumi and follow GitOps workflows.
- CI/CD – keep automated build and release pipelines reliable, with safe rollback paths.
- GPU fleet management – run NVIDIA drivers, MIG partitioning, autoscaling, and firmware updates; extend the same practices to AMD GPUs when they appear.
- Observability – operate and scale Prometheus and Grafana, define SLIs/SLOs, and automate capacity tracking.
- Incident response – share an on-call rotation, lead post-incident reviews, and keep runbooks current.
- Mentorship and process building – establish standard SRE processes and teach best practices to the wider engineering team.
<Must Have Requirements>
- Preferably graduated from a top university around the world.
- +4 years of experience as a Site Reliability Engineer
- Expert knowledge of Kubernetes internals and large-cluster administration, both cloud and on-prem.
- Hands-on experience with AWS and Google Cloud; familiarity with Azure or Oracle Cloud is a plus.
- Strong skills with Terraform or Pulumi, GitOps tools (Argo CD, Flux, or similar), and CI/CD pipelines.
- Deep understanding of Linux and networking fundamentals.
- Experience managing NVIDIA GPU clusters; AMD/ROCm knowledge is a bonus.
- Familiarity with specialized GPU clouds such as Lambda or Nebius is helpful.
- Solid background with Prometheus and Grafana at scale.
- Language: Working-level proficiency in English.
<Benefits>
- Paid Vacations
- Annual Bonus: 1-month salary
<Note>
- This is a full-time position requiring 40 hours per week, but it will be structured as contractor work.
- Devices: You will be expected to use your own computer to perform the work.
- Sole Employment: No second job is permitted.
Key Skills
Ranked by relevanceReady to apply?
Join Qlay and take your career to the next level!
Application takes less than 5 minutes

