Qlay
[Remote] Site Reliability Engineer
QlayUkraine5 days ago
Full-timeRemote FriendlyEngineering, Information Technology

IMPORTANT NOTE:

If you have experience with GPU, please mention it clearly in your projects/work experience (e.g., years of experience, responsibility or achievement in previous work).


<Expectation for the role>

As a Site Reliability Engineer, your key features include:

  • Kubernetes operations – design, run, and improve large multi-cluster
  • Kubernetes environments on AWS and Google Cloud, plus on-prem clusters; add support for Azure or Oracle Cloud when needed.
  • Infrastructure as code – manage everything with Terraform or Pulumi and follow GitOps workflows.
  • CI/CD – keep automated build and release pipelines reliable, with safe rollback paths.
  • GPU fleet management – run NVIDIA drivers, MIG partitioning, autoscaling, and firmware updates; extend the same practices to AMD GPUs when they appear.
  • Observability – operate and scale Prometheus and Grafana, define SLIs/SLOs, and automate capacity tracking.
  • Incident response – share an on-call rotation, lead post-incident reviews, and keep runbooks current.
  • Mentorship and process building – establish standard SRE processes and teach best practices to the wider engineering team.


<Must Have Requirements>

  • Preferably graduated from a top university around the world.
  • +4 years of experience as a Site Reliability Engineer
  • Expert knowledge of Kubernetes internals and large-cluster administration, both cloud and on-prem.
  • Hands-on experience with AWS and Google Cloud; familiarity with Azure or Oracle Cloud is a plus.
  • Strong skills with Terraform or Pulumi, GitOps tools (Argo CD, Flux, or similar), and CI/CD pipelines.
  • Deep understanding of Linux and networking fundamentals.
  • Experience managing NVIDIA GPU clusters; AMD/ROCm knowledge is a bonus.
  • Familiarity with specialized GPU clouds such as Lambda or Nebius is helpful.
  • Solid background with Prometheus and Grafana at scale.
  • Language: Working-level proficiency in English.


<Benefits>

  • Paid Vacations
  • Annual Bonus: 1-month salary


<Note> 

  • This is a full-time position requiring 40 hours per week, but it will be structured as contractor work.
  • Devices: You will be expected to use your own computer to perform the work.
  • Sole Employment: No second job is permitted.

Key Skills

Ranked by relevance