-
View all jobs
Day-to-Day Responsibilities:
- Building and enhancing the Ray-based compute layer for distributed data processing and model training on Kubernetes.
- You’ll work closely with the data science and engineering teams to set up and integrate the Ray infrastructure into the existing system.
- AI Infrastructure Development: Collaborate with the ML Platform team to design and implement a robust AI infrastructure using Ray, enabling scalable data processing and model training. Leverage GitOps practices for managing cloud infrastructure reproducibility on Kubernetes.
- Observability & Monitoring: Develop observability solutions for Ray, integrating monitoring and alerting capabilities using tools like Datadog, Prometheus, and Grafana. You’ll also contribute to creating operational guides and runbooks.
- Support for Data Science Teams: Assist data scientists in adopting Ray for their workloads and ensure smooth integration with existing tools and systems.
Required Skills and Qualifications:
- ML Ops Expertise: Strong understanding of machine learning operations, particularly in distributed computing environments. Experience with frameworks like Ray, Dask, Modin, Beam, Horovod, or Deepspeed is highly desirable.
- Technical Skills: Proficiency in Python and the broader ML ecosystem.
- Kubernetes Experience: Solid understanding of Kubernetes, with experience in deploying and managing systems. Familiarity with GitOps tools such as ArgoCD and configuration management tools like Helm and Kustomize is a plus.
- DevOps & Infrastructure Skills: Background in DevOps practices and Infrastructure as Code (IaC), with knowledge of Terraform or similar tools.
- Communication Skills: Strong written and verbal communication abilities, with a focus on collaboration and knowledge sharing.
Nice-to-Have:
- Ray Knowledge: A genuine interest in Ray is critical. Candidates without an interest in Ray will be considered a red flag by the hiring manager.
- Cloud Providers: Experience with cloud platforms, particularly AWS, is preferred.
- Incident Response & Security: Basic knowledge of incident response and security principles.
Key Skills
Ranked by relevance
c
kubernetes
ai
ui
cloud
esp
ha
incident response
devops
git
configuration management
infrastructure as code
distributed computing
machine learning
prometheus
terraform
deepspeed
grafana
datadog
python
scala
aws
das
Related Jobs
3 roles aligned with this opportunity
View Job Details
Related
Power ML Engineer
2026-05-26
Full-time
Not Applicable
Singapore
Technology
Engineering
View Job Details
Related
Cyber Security Engineer
2026-05-27
Full-time
Not Applicable
Australia
Technology
Information Technology
View Job Details
Related
DevOps Engineer
2026-05-27
Full-time
Associate
Argentina
Software Development
Engineering
Login to Apply
- Posted
- Dec 12, 2024
- Type
- Contract
- Level
- Associate
- Location
- Singapore
- Company
- Talentvis
Industries
Technology
Information
Media
Categories
Information Technology
Related Jobs
3 roles aligned with this opportunity
View Job Details
Related
Power ML Engineer
2026-05-26
Full-time
Not Applicable
Singapore
Technology
Engineering
View Job Details
Related
Cyber Security Engineer
2026-05-27
Full-time
Not Applicable
Australia
Technology
Information Technology
View Job Details
Related
DevOps Engineer
2026-05-27
Full-time
Associate
Argentina
Software Development
Engineering