DevOps Engineer

DT Cloud is a global digital transformation cloud computing company with a local presence. Since our first launch in 2006, we evolved from being a software development and R&D service provider to a telco responsible for thousands of installations of carrier network infrastructure from 2G through to 5G. In 2018, we packed our experience and created an enterprise-grade 5G-ready multi-regional cloud computing and digital enabling company.

Today, DT creates hyper-scale cloud services and innovative solutions in an affordable, secure and stable alternative platform. Through cloud computing, Big Data and technologies like AI/ML, IoT, Blockchain, DT enables digital transformation and helps governments, organizations and companies of all sizes and industries to further innovate and grow. DT Cloud transforms, optimizes and modernizes industries and people’s lives as a technological enabler both for the physical and digital worlds.

About the Project

We are building an advanced LLM Benchmarking Platform, designed to evaluate and compare large language Models (LLMs) across a variety of tasks and environments. The platform will run on Kubernetes (K8s) infrastructure and orchestrate LLM workloads, benchmarks, and integrations with GPU-based execution environments.

Role Overview

As our DevOps Engineer, you will play a key role in designing, deploying, and maintaining the underlying infrastructure for the platform. You will be responsible for managing the deployment pipelines, Kubernetes configurations, GPU resource orchestration, observability stack, and ensuring CI/CD automation for all core services. You’ll collaborate closely with the Full Stack Engineer and Technical Lead in a fast-moving, agile development environment.

Key Responsibilities

Set up and manage Kubernetes clusters (k3s, k8s) for multi-service orchestration
Create and maintain Helm charts for platform components
Automate infrastructure provisioning using Terraform or similar IaC tools
Implement CI/CD pipelines (GitHub Actions or GitLab CI preferred)
Manage GPU resources for containerized model inference jobs
Integrate observability stack (e.g., Prometheus, Grafana, Loki, Langfuse)
Ensure secure and reproducible deployments across environments
Support model deployment via Dockerized MLflow, HuggingFace, or custom endpoints
Assist in setting up benchmarking workloads using Argo Workflows or Volcano
Collaborate on deployment of third-party open-source tools (MLflow, Jupyter, ChromaDB)

Must-Have Skills

Strong experience with Kubernetes and container orchestration
Solid knowledge of Docker, Helm, and CI/CD automation
Familiarity with GPU scheduling in Kubernetes (e.g., NVIDIA device plugin)
Hands-on experience with cloud-native monitoring/logging stacks
Experience managing secure and production-grade infrastructure
Good scripting knowledge (Bash, Python, or Go)

Nice-to-Have Skills

Experience with Argo Workflows or Volcano Scheduler
Understanding of LLMs, ML model serving, or MLflow
Familiarity with Langfuse or other LLM observability tools
Exposure to on-premise clusters and multi-cloud infrastructure

Soft Skills

Independent and proactive work attitude
Strong collaboration in small, cross-functional teams
Clear communication, especially in asynchronous/remote setups
Agile mindset and ability to adapt to changing priorities

Why Join Us?

Work on cutting-edge LLM infrastructure and benchmarking workflows
Build open-source, modular architecture for reproducible research
Influence core architectural decisions from day one
Opportunity for long-term collaboration and platform evolution

DevOps Engineer – Platform & AI Tooling

Key Skills

Related Jobs

Customer Engineer

Full Stack Engineer

Related Jobs

DevOps Engineer

Customer Engineer

Full Stack Engineer

Cookie Settings