[Remote] Site Reliability Engineer

QlayUkraine5 days ago

Full-timeRemote FriendlyEngineering, Information Technology

Track This Job

Add this job to your tracking list to:

Monitor application status and updates
Change status (Applied, Interview, Offer, etc.)
Add personal notes and comments
Set reminders for follow-ups
Track your entire application journey

Save This Job

Add this job to your saved collection to:

Access easily from your saved jobs dashboard
Review job details later without searching again
Compare with other saved opportunities
Keep a collection of interesting positions
Receive notifications about saved jobs before they expire

AI-Powered Job Summary

Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.

Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.

IMPORTANT NOTE:

If you have experience with GPU, please mention it clearly in your projects/work experience (e.g., years of experience, responsibility or achievement in previous work).

<Expectation for the role>

As a Site Reliability Engineer, your key features include:

Kubernetes operations – design, run, and improve large multi-cluster
Kubernetes environments on AWS and Google Cloud, plus on-prem clusters; add support for Azure or Oracle Cloud when needed.
Infrastructure as code – manage everything with Terraform or Pulumi and follow GitOps workflows.
CI/CD – keep automated build and release pipelines reliable, with safe rollback paths.
GPU fleet management – run NVIDIA drivers, MIG partitioning, autoscaling, and firmware updates; extend the same practices to AMD GPUs when they appear.
Observability – operate and scale Prometheus and Grafana, define SLIs/SLOs, and automate capacity tracking.
Incident response – share an on-call rotation, lead post-incident reviews, and keep runbooks current.
Mentorship and process building – establish standard SRE processes and teach best practices to the wider engineering team.

<Must Have Requirements>

Preferably graduated from a top university around the world.
+4 years of experience as a Site Reliability Engineer
Expert knowledge of Kubernetes internals and large-cluster administration, both cloud and on-prem.
Hands-on experience with AWS and Google Cloud; familiarity with Azure or Oracle Cloud is a plus.
Strong skills with Terraform or Pulumi, GitOps tools (Argo CD, Flux, or similar), and CI/CD pipelines.
Deep understanding of Linux and networking fundamentals.
Experience managing NVIDIA GPU clusters; AMD/ROCm knowledge is a bonus.
Familiarity with specialized GPU clouds such as Lambda or Nebius is helpful.
Solid background with Prometheus and Grafana at scale.
Language: Working-level proficiency in English.

<Benefits>

Paid Vacations
Annual Bonus: 1-month salary

<Note>

This is a full-time position requiring 40 hours per week, but it will be structured as contractor work.
Devices: You will be expected to use your own computer to perform the work.
Sole Employment: No second job is permitted.

Key Skills

Ranked by relevance

Ready to apply?

Join Qlay and take your career to the next level!

Application takes less than 5 minutes

Apply