SRE/LLM Ops Engineer (BE)

CluePointsBelgium3 days ago

Full-timeRemote FriendlyEngineering, Information Technology

Track This Job

Add this job to your tracking list to:

Monitor application status and updates
Change status (Applied, Interview, Offer, etc.)
Add personal notes and comments
Set reminders for follow-ups
Track your entire application journey

Save This Job

Add this job to your saved collection to:

Access easily from your saved jobs dashboard
Review job details later without searching again
Compare with other saved opportunities
Keep a collection of interesting positions
Receive notifications about saved jobs before they expire

AI-Powered Job Summary

Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.

Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.

Job Description

Company Description

At CluePoints, we’re redefining how clinical trials are run. As the premier provider of Risk-Based Quality Management (RBQM) and Data Quality Oversight software, we harness advanced statistics, artificial intelligence, and machine learning to ensure the quality, accuracy, and integrity of clinical trial data, helping life sciences organisations bring safer, more effective treatments to patients faster.

We’re proud to be an ambitious, fast-growing technology scale-up with a dynamic and diverse international team representing more than 40+ nationalities. Collaboration, flexibility, and continuous learning are part of our DNA.

At CluePoints, you’ll find a culture where you can grow, make an impact, and have fun along the way.Guided by our values of Care, Passion, and Smart Disruption, we’re united by a shared mission: to create smarter ways to run efficient clinical trials and deliver AI-powered insights that improve human outcomes worldwide.

The Role

The SRE, LLMOps (AI Platform) ensures our LLM-powered services are reliable, observable, and safe in production on Azure and Kubernetes. You’ll blend classic SRE disciplines with LLM-specific operations: robust evaluation pipelines, prompt/version governance, model/vendor failover, guardrails, and cost/performance monitoring. You know how to build automation with LangChain/LangGraph, operate API-based LLMs in production, and manage the inherent non-determinism of models through rigorous testing and observability.

Job Requirements

What You'll Bring

Experience: 5+ years in SRE/DevOps/Platform Engineering with 1–2+ years operating LLM or ML-backed applications in production (API-based or hosted models).
LLMOps: hands-on with LangChain/LangGraph building end-to-end chains/agents and RAG flows; comfort with vector stores (e.g., Azure AI Search, Pinecone), prompt/version control, and dataset tooling.
Observability: proficiency instrumenting LLM traces and app telemetry, alert tuning, and root-cause analysis; familiarity with LangSmith and/or Arize Phoenix (token/cost tracking, latency, failure modes).
Cloud & platform: strong Azure and Kubernetes (AKS) background; GitOps (Flux/ArgoCD), Helm/Kustomize; CI/CD (GitHub Actions/GitLab/Jenkins); IaC (Terraform); secrets, networking, and security baselines.
Languages & tooling: Python (preferred) and one of TypeScript/Go; REST/GraphQL; OpenAI/Azure OpenAI/Anthropic APIs; experience with Redis caches, message queues, and feature flags.

Job Responsibilities

What You'll Be Doing

Instrument deep observability: implement tracing for LLM chains/agents (inputs, outputs, token usage, latency, model/version), correlate with app metrics/logs, and set actionable alerts; leverage LangSmith/Arize Phoenix (or similar) and OpenTelemetry where appropriate.
Safety & guardrails: integrate content safety, PII redaction, jailbreak/prompt-injection defenses, and policy-based rails; document exceptions and reviewer workflows. Prefer native platform features (e.g., Azure AI Content Safety) or programmable rails (e.g., NVIDIA NeMo Guardrails).
Cost & capacity management: monitor token and request costs, throughput, and rate limits; implement caching, request shaping, and multi-tier model selection to balance quality, latency, and spend.
Build evaluation & testing pipelines: create golden datasets and automated evals (offline + CI/CD + canary) to catch regressions from code, prompt, data, or model changes; use LangSmith/OpenAI Evals (or equivalents) to track quality trends over time.
Platform operations on Azure/Kubernetes: ensure secure, compliant, and cost-efficient operation; maintain IaC, secrets, networking, scaling, and DR/BCP; partner with Security and QA on regulated SaaS controls.
Cross-functional enablement: work with product/dev teams to set acceptance criteria for AI features, add runtime feature flags/kill-switches, and embed evals/telemetry from day one.

Job Benefits

🇧🇪 What We Offer – Belgium

Health Insurance through Alan (100% hospitalisation cover, 80% ambulatory and dental)
Mobility Budget for eco-transport, housing, or car allowance (flexible 3-pillar system)
Group Insurance Plan with 6–12% employer pension contribution based on seniority
Meal Vouchers (€8/day) and Eco Vouchers for sustainable purchases
A hub-based hybrid model that blends flexibility with purpose — connecting teams through collaboration, learning, and a vibrant social culture.

🇧🇪 Equal Opportunities & GDPR Notice

CluePoints is an equal opportunities employer. We value and respect diversity in our workforce and do not tolerate discrimination based on gender, age, disability, ethnic origin, religion, sexual orientation, or any other protected ground under Belgian law.

Personal data collected as part of your application will be processed in compliance with the EU GDPR and Belgian data protection legislation.

You have the right to access, correct, or delete your personal data at any time by contacting [email protected].

Key Skills

Ranked by relevance

Ready to apply?

Join CluePoints and take your career to the next level!

Application takes less than 5 minutes

Apply