DevSecOps Engineer

integra.works

United Arab Emirates · Full-time · Executive

Job Summary

We are looking for an experienced DevSecOps / Platform Engineer (DevSecOps + AI Infra) to build, secure, and scale our fully self-hosted production environment across Hetzner, Kubernetes, Kafka, MinIO, GitLab, Redis, and AI/LLM infrastructure. This role includes complete ownership of production readiness, platform stability, observability, and deployment of self-hosted AI/ML models (LLMs, embeddings, vision models) on GPU/CPU infrastructure. This is a fully remote opportunity, with potential for relocation to the UAE in 2027, subject to business needs and mutual interest.

Responsibilities

Own infrastructure and production readiness across Hetzner and Cloudflare, including compute sizing (CPU/GPU/RAM/SSD), secure networking and firewalling, DNS/WAF/DDoS configuration, and automated backup, failover, and high-availability setup.
Own self-hosted GitLab CI/CD and source control, including runner setup (Docker/VM/Kubernetes), secure secrets management, multi-environment pipelines (dev → staging → prod) with approvals, and GitOps integration using ArgoCD or FluxCD.
Lead Kubernetes cluster engineering including cluster bootstrap (kubeadm/k3s/RKE/Terraform), RBAC and network policies, service mesh with mTLS (Istio/Linkerd), autoscaling (HPA/VPA), health probes, and robust etcd and cluster backup strategies.
Own end-to-end monitoring, observability, and logging using Prometheus, Grafana, Alertmanager, Loki, and OMD (Nagios/CheckMK), with comprehensive alerting across nodes, pods, applications, networking, Kafka, databases, and storage.
Manage and secure object storage using MinIO, including bucket policies, lifecycle rules, TLS integration, and secure credential management via Secrets/Vault.
Implement secure networking and TLS using Traefik ingress, automated certificate rotation, Cloudflare WAF and DDoS protection, zero-trust principles, mTLS, and API gateway–level security policies.
Design and manage secure networking and TLS using Traefik ingress with automated certificate rotation, Cloudflare WAF and DDoS protection with rate limiting, zero-trust networking principles, mTLS, and API gateway–level security policies
Implement secure container and image management using best-practice Docker builds (multi-stage, minimal base images), Harbor registry with RBAC, vulnerability scanning and image signing, along with automated image retention and cleanup policies.
Design and operate highly available PostgreSQL (PGVector/CloudNativePG), MongoDB, and ClickHouse clusters with operator-based deployments, replica sets, PITR backups, connection pooling, performance tuning, and full observability via Prometheus exporters.
Design and operate a highly available Kafka cluster (Zookeeper/KRaft) with optimized topics, partitions, replication, lag monitoring, exporters, and robust retry/DLQ strategies for production reliability.
Provision and manage Hetzner and Kubernetes infrastructure using Terraform with modular multi-environment setups, remote state management, and CI/CD-driven plan/apply workflows with approval gates.
Implement end-to-end DevSecOps and compliance controls including secrets management, least-privilege RBAC, container image scanning, runtime security, and automated CVE detection and patching pipelines.
Design, deploy, and operate secure, scalable self-hosted AI/LLM infrastructure including GPU model serving, multi-model routing, vector databases, AI DevSecOps, monitoring, and CI/CD for model lifecycle management.
Define and maintain comprehensive backup and disaster recovery strategies for infrastructure, AI models, and object storage, including routine recovery testing and DR playbooks
Ensure production-grade application readiness through ingress/load balancer configuration, environment-specific configs and secrets, and performance, load, and chaos testing.

Qualifications

Mid/Senior-level (4–10 years)
Bachelor's in Computer Science or related field
Strong Kubernetes, Linux, networking, and Terraform expertise.
Hands-on with GPU setups, CUDA, inference optimization.
Experience with self-hosted AI/LLM models (Ollama, vLLM, TGI).
Strong observability & security foundations.

Key Skills

Ranked by relevance

kubernetes cloudflare kafka cicd ai prometheus terraform storage gitlab postgresql bootstrap grafana docker redis linux loki

Related Jobs

3 roles aligned with this opportunity

View all jobs