-
Rakuten Asia Pte Ltd

AI/ML Infrastructure Engineer, GPU Infrastructure

Rakuten Asia Pte Ltd
Singapore · Full-time · Mid-Senior

Situated in the heart of Singapore's Central Business District, Rakuten Asia Pte. Ltd. is Rakuten's Asia Regional headquarters. Established in August 2012 as part of Rakuten's global expansion strategy, Rakuten Asia comprises various businesses that provide essential value-added services to Rakuten's global ecosystem. Through advertisement product development, product strategy, and data management, among others, Rakuten Asia is strengthening Rakuten Group's core competencies to take the lead in an increasingly digitalized world.


The Machine learning and Deep learning Engineering Department (MDE) is a group of engineers and scientists who specialize in natural language processing (NLP), search, and recommendation systems. We conduct state-of-the-art research and apply cutting-edge technologies, such as transformer model, dense retrieval, distributed GPU training, and large-scale machine learning, to a variety of Rakuten products and services. We are looking for passionate experts in machine learning research and engineering to join us in our journey to define the next-generation e-commerce experience.


The GPU Engineering team is at the forefront of delivering a robust GPU infrastructure and cutting-edge ML platforms that powers the development and deployment of ML models across various teams of ML engineers and researchers within Rakuten. Use cases include semantic search, visual search, recommendation, LLMs, and more.


As an MLOps Engineer in the GPU Engineering team, you will be at the heart of Rakuten's ML operations, focusing on the deployment, monitoring, and management of ML models. You'll work closely with ML Engineers across the department to provide a reliable infrastructure that supports rapid model development, training, and deployment. Your expertise will contribute to the efficiency and scalability of our ML projects, directly impacting Rakuten's product innovation and service excellence.


Responsibility

  • Design, implement, and maintain ML pipelines for automated training, testing, and deployment of machine learning models, ensuring scalability and efficiency.
  • Work collaboratively with ML engineers to troubleshoot and optimize model performance, ensuring models are production-ready and meet defined SLAs.
  • Manage and monitor Kubernetes clusters and related infrastructure to support high-volume ML workloads, implementing best practices for security and resilience.
  • Develop and maintain documentation on ML infrastructure, tools, and best practices, providing guidance and support to ML teams.
  • Continuously evaluate and incorporate new technologies and tools to enhance the ML platform's capabilities and performance.


Mandatory Qualifications

  • Minimum at least 3 years or more of experience in MLOps, with a proven track record of managing ML infrastructure
  • Kubernetes Proficiency: Deep understanding of Kubernetes (K8s) infrastructure and its application in managing ML workloads
  • Programming Skills: Proficiency in Python or Golang
  • Proven experience with Linux OS, with the ability to maintain system performance, ensure proper configuration, and leverage tools to troubleshoot software, hardware, and network-related issues
  • Education: Bachelor’s or higher degree in Computer Science, Engineering, or a related technical discipline
  • Strong communication and teamwork skills
  • Passion for technology and solving challenging problems


Good to have experiences:

  • Familiarity with ML frameworks (e.g., TensorFlow, PyTorch) and CUDA
  • CI/CD Tools: Experience with CI/CD tools (e.g., GitHub Actions, Jenkins, GitLab CI) and container technologies (e.g., Docker)
  • Experience training large models, including LLMs


Rakuten is an equal opportunities employer and welcomes applications regardless of sex, marital status, ethnic origin, sexual orientation, religious belief, or age.

Key Skills

Ranked by relevance

machine learning kubernetes mlops natural language processing deep learning tensorflow gitlab ci jenkins pytorch python gitlab linux cicd
Login to Apply
Posted
May 19, 2025
Type
Full-time
Level
Mid-Senior
Location
Singapore

Industries

Software Development

Categories

Information Technology Engineering

Related Jobs

3 roles aligned with this opportunity

View all jobs
View Job Details
Scandit
Related

Senior Embedded Machine Learning Engineer (C++)

2026-05-28

Full-time
Mid-Senior
Finland
Software Development
Information Technology
View Job Details
voize
Related

DevOps Engineer - (m/f/d)

2026-05-28

Full-time
Not Applicable
Germany
Software Development
Engineering
View Job Details
Tenth Revolution Group
Related

DevOps Engineer

2026-05-28

Full-time
Mid-Senior
Germany
Information Services
Information Technology