-
Kerry Consulting

Senior Site Reliability Engineer - AI Infrastructure

Kerry Consulting
Singapore · Full-time · Mid-Senior

Our client operates large-scale GPU cloud platforms across Asia-Pacific. As part of their expansion, they are looking for experienced platform engineers to build and scale their next-generation data center operations. This role offers direct impact in a well-funded technology company working at the forefront of sustainable AI infrastructure.



Role



You'll drive the technical foundation for MLOps capabilities and platform infrastructure supporting cutting-edge NVIDIA GPU clusters. This position demands expertise in designing and operating Kubernetes environments for high-performance computing, implementing Infrastructure-as-Code frameworks, and building world-class observability platforms. You'll collaborate directly with founders and engineering leadership to establish DevOps standards, enhance CI/CD pipelines, and integrate enterprise-grade monitoring across distributed systems. The role requires ownership of incident response, active participation in on-call rotation, and leading root cause analysis to elevate operational maturity. You'll work with technologies including Terraform, Ansible, Prometheus, Grafana, Loki, and OpenTelemetry while managing infrastructure supporting thousands of servers across multiple data centers.



Requirements



We seek candidates with 7+ years of platform engineering, SRE, or DevOps experience who have built observability and infrastructure platforms from first principles. Deep proficiency with containerization, Kubernetes cluster management, Infrastructure-as-Code tools, and the LGTM observability stack (Loki, Grafana, Tempo, Prometheus/Thanos) is essential. You must demonstrate hands-on expertise with Linux internals, networking stacks, distributed storage, and scripting languages such as Python, Go, or Bash. Experience with telemetry solutions (Redfish, gNMI, SNMP, eBPF) and compliance frameworks (SOC 2, ISO 27001) is highly valued. Bachelor's degree in Computer Science or related field required.



To Apply



To apply, please submit your resume to Yien Quek at [email protected]. We regret to inform that only successful shortlisted candidates will be notified. Licence No: 16S8060 | Registration no: R1109830

Key Skills

Ranked by relevance

kubernetes grafana devops loki incident response containerization prometheus terraform ansible storage python linux cloud mlops bash snmp cicd ai
Login to Apply
Posted
Feb 09, 2026
Type
Full-time
Level
Mid-Senior
Location
Singapore

Industries

Technology Information Internet

Categories

Information Technology

Related Jobs

3 roles aligned with this opportunity

View all jobs
View Job Details
Adecco
Related

Cybersecurity Operations Lead

2026-05-19

Full-time
Not Applicable
Singapore
Technology
Information Technology
View Job Details
smartclip
Related

Senior Software Engineer API (f/m/d) - Node.js, SQL

2026-05-28

Full-time
Mid-Senior
Germany
Technology
Information Technology
View Job Details
Kpler
Related

Power ML Engineer

2026-05-26

Full-time
Not Applicable
Singapore
Technology
Engineering