DT Cloud is a global digital transformation cloud computing company with a local presence. Since our first launch in 2006, we evolved from being a software development and R&D service provider to a telco responsible for thousands of installations of carrier network infrastructure from 2G through to 5G. In 2018, we packed our experience and created an enterprise-grade 5G-ready multi-regional cloud computing and digital enabling company.
Today, DT creates hyper-scale cloud services and innovative solutions in an affordable, secure and stable alternative platform. Through cloud computing, Big Data and technologies like AI/ML, IoT, Blockchain, DT enables digital transformation and helps governments, organizations and companies of all sizes and industries to further innovate and grow. DT Cloud transforms, optimizes and modernizes industries and people’s lives as a technological enabler both for the physical and digital worlds.
About the Project
We are building an advanced LLM Benchmarking Platform, designed to evaluate and compare large language Models (LLMs) across a variety of tasks and environments. The platform will run on Kubernetes (K8s) infrastructure and orchestrate LLM workloads, benchmarks, and integrations with GPU-based execution environments.
Role Overview
As our DevOps Engineer, you will play a key role in designing, deploying, and maintaining the underlying infrastructure for the platform. You will be responsible for managing the deployment pipelines, Kubernetes configurations, GPU resource orchestration, observability stack, and ensuring CI/CD automation for all core services. You’ll collaborate closely with the Full Stack Engineer and Technical Lead in a fast-moving, agile development environment.
Key Responsibilities
- Set up and manage Kubernetes clusters (k3s, k8s) for multi-service orchestration
- Create and maintain Helm charts for platform components
- Automate infrastructure provisioning using Terraform or similar IaC tools
- Implement CI/CD pipelines (GitHub Actions or GitLab CI preferred)
- Manage GPU resources for containerized model inference jobs
- Integrate observability stack (e.g., Prometheus, Grafana, Loki, Langfuse)
- Ensure secure and reproducible deployments across environments
- Support model deployment via Dockerized MLflow, HuggingFace, or custom endpoints
- Assist in setting up benchmarking workloads using Argo Workflows or Volcano
- Collaborate on deployment of third-party open-source tools (MLflow, Jupyter, ChromaDB)
Must-Have Skills
- Strong experience with Kubernetes and container orchestration
- Solid knowledge of Docker, Helm, and CI/CD automation
- Familiarity with GPU scheduling in Kubernetes (e.g., NVIDIA device plugin)
- Hands-on experience with cloud-native monitoring/logging stacks
- Experience managing secure and production-grade infrastructure
- Good scripting knowledge (Bash, Python, or Go)
Nice-to-Have Skills
- Experience with Argo Workflows or Volcano Scheduler
- Understanding of LLMs, ML model serving, or MLflow
- Familiarity with Langfuse or other LLM observability tools
- Exposure to on-premise clusters and multi-cloud infrastructure
Soft Skills
- Independent and proactive work attitude
- Strong collaboration in small, cross-functional teams
- Clear communication, especially in asynchronous/remote setups
- Agile mindset and ability to adapt to changing priorities
Why Join Us?
- Work on cutting-edge LLM infrastructure and benchmarking workflows
- Build open-source, modular architecture for reproducible research
- Influence core architectural decisions from day one
- Opportunity for long-term collaboration and platform evolution
Key Skills
Ranked by relevance
Related Jobs
3 roles aligned with this opportunity
DevOps Engineer
2026-03-25
Customer Engineer
2025-12-30
Full Stack Engineer
2026-03-25
- Posted
- Jul 07, 2025
- Type
- Full-time
- Level
- Associate
- Location
- Türkiye
- Company
- DT Cloud
Industries
Categories
Related Jobs
3 roles aligned with this opportunity
DevOps Engineer
2026-03-25
Customer Engineer
2025-12-30
Full Stack Engineer
2026-03-25