-
View all jobs
Job Summary
We are looking for an experienced DevSecOps / Platform Engineer (DevSecOps + AI Infra) to build, secure, and scale our fully self-hosted production environment across Hetzner, Kubernetes, Kafka, MinIO, GitLab, Redis, and AI/LLM infrastructure. This role includes complete ownership of production readiness, platform stability, observability, and deployment of self-hosted AI/ML models (LLMs, embeddings, vision models) on GPU/CPU infrastructure. This is a fully remote opportunity, with potential for relocation to the UAE in 2027, subject to business needs and mutual interest.
Responsibilities
We are looking for an experienced DevSecOps / Platform Engineer (DevSecOps + AI Infra) to build, secure, and scale our fully self-hosted production environment across Hetzner, Kubernetes, Kafka, MinIO, GitLab, Redis, and AI/LLM infrastructure. This role includes complete ownership of production readiness, platform stability, observability, and deployment of self-hosted AI/ML models (LLMs, embeddings, vision models) on GPU/CPU infrastructure. This is a fully remote opportunity, with potential for relocation to the UAE in 2027, subject to business needs and mutual interest.
Responsibilities
- Own infrastructure and production readiness across Hetzner and Cloudflare, including compute sizing (CPU/GPU/RAM/SSD), secure networking and firewalling, DNS/WAF/DDoS configuration, and automated backup, failover, and high-availability setup.
- Own self-hosted GitLab CI/CD and source control, including runner setup (Docker/VM/Kubernetes), secure secrets management, multi-environment pipelines (dev → staging → prod) with approvals, and GitOps integration using ArgoCD or FluxCD.
- Lead Kubernetes cluster engineering including cluster bootstrap (kubeadm/k3s/RKE/Terraform), RBAC and network policies, service mesh with mTLS (Istio/Linkerd), autoscaling (HPA/VPA), health probes, and robust etcd and cluster backup strategies.
- Own end-to-end monitoring, observability, and logging using Prometheus, Grafana, Alertmanager, Loki, and OMD (Nagios/CheckMK), with comprehensive alerting across nodes, pods, applications, networking, Kafka, databases, and storage.
- Manage and secure object storage using MinIO, including bucket policies, lifecycle rules, TLS integration, and secure credential management via Secrets/Vault.
- Implement secure networking and TLS using Traefik ingress, automated certificate rotation, Cloudflare WAF and DDoS protection, zero-trust principles, mTLS, and API gateway–level security policies.
- Design and manage secure networking and TLS using Traefik ingress with automated certificate rotation, Cloudflare WAF and DDoS protection with rate limiting, zero-trust networking principles, mTLS, and API gateway–level security policies
- Implement secure container and image management using best-practice Docker builds (multi-stage, minimal base images), Harbor registry with RBAC, vulnerability scanning and image signing, along with automated image retention and cleanup policies.
- Design and operate highly available PostgreSQL (PGVector/CloudNativePG), MongoDB, and ClickHouse clusters with operator-based deployments, replica sets, PITR backups, connection pooling, performance tuning, and full observability via Prometheus exporters.
- Design and operate a highly available Kafka cluster (Zookeeper/KRaft) with optimized topics, partitions, replication, lag monitoring, exporters, and robust retry/DLQ strategies for production reliability.
- Provision and manage Hetzner and Kubernetes infrastructure using Terraform with modular multi-environment setups, remote state management, and CI/CD-driven plan/apply workflows with approval gates.
- Implement end-to-end DevSecOps and compliance controls including secrets management, least-privilege RBAC, container image scanning, runtime security, and automated CVE detection and patching pipelines.
- Design, deploy, and operate secure, scalable self-hosted AI/LLM infrastructure including GPU model serving, multi-model routing, vector databases, AI DevSecOps, monitoring, and CI/CD for model lifecycle management.
- Define and maintain comprehensive backup and disaster recovery strategies for infrastructure, AI models, and object storage, including routine recovery testing and DR playbooks
- Ensure production-grade application readiness through ingress/load balancer configuration, environment-specific configs and secrets, and performance, load, and chaos testing.
- Mid/Senior-level (4–10 years)
- Bachelor's in Computer Science or related field
- Strong Kubernetes, Linux, networking, and Terraform expertise.
- Hands-on with GPU setups, CUDA, inference optimization.
- Experience with self-hosted AI/LLM models (Ollama, vLLM, TGI).
- Strong observability & security foundations.
Key Skills
Ranked by relevance
kubernetes
cloudflare
kafka
cicd
ai
prometheus
terraform
storage
gitlab
postgresql
bootstrap
grafana
docker
redis
linux
loki
Related Jobs
3 roles aligned with this opportunity
View Job Details
Related
DevSecOps Engineer (Remote- Egypt)
2026-01-05
Full-time
Executive
United Arab Emirates
IT Services
Engineering
View Job Details
Related
Senior DevOps Engineer (Crypto, max $220k/year)
2026-04-11
Full-time
Mid-Senior
Romania
IT Services
Engineering
View Job Details
Related
Site Reliability Engineer (SRE) Mid-Level / Senior, Portugal
2026-04-11
Full-time
Not Applicable
Portugal
IT Services
Engineering
Login to Apply
- Posted
- Feb 01, 2026
- Type
- Full-time
- Level
- Executive
- Location
- Dubai
- Company
- integra.works
Industries
IT Services
IT Consulting
Categories
Engineering
Related Jobs
3 roles aligned with this opportunity
View Job Details
Related
DevSecOps Engineer (Remote- Egypt)
2026-01-05
Full-time
Executive
United Arab Emirates
IT Services
Engineering
View Job Details
Related
Senior DevOps Engineer (Crypto, max $220k/year)
2026-04-11
Full-time
Mid-Senior
Romania
IT Services
Engineering
View Job Details
Related
Site Reliability Engineer (SRE) Mid-Level / Senior, Portugal
2026-04-11
Full-time
Not Applicable
Portugal
IT Services
Engineering