Position: Principal AI Engineer
Location: Abu Dhabi (on-site)
⸻
About the Client
We are sourcing on behalf of Alpheya, a B2B WealthTech startup based in Abu Dhabi backed by BNY Mellon and Lunate (a $100B AUM alternative asset management firm). The company has raised $300M to build a state-of-the-art wealth technology platform that enables banks and financial institutions in the Middle East to better serve affluent, HNW and UHNW investor segments. Alpheya operates as a true startup with cross-functional and agile teams while leveraging the capabilities and knowledge of established global partners.
About the role
They are looking for a software engineer who builds production systems and who has spent the last few years applying that discipline to AI-powered products.
You will take validated AI prototypes and turn them into production-grade software systems. You'll focus on reliability, observability, maintainability, and clear architecture for AI-powered features in a regulated environment.
You will also have responsibility for leading and mentoring a group of data and software engineers to deliver reliably and raise the engineering bar.
This is not a DevOps role. You will partner closely with our DevOps/SRE team (who owns core infrastructure, Kubernetes, and Terraform) to ensure AI services are operable and meet agreed SLAs.
Responsibilities
Productionising AI Features (core focus)
- Own the AI API surface in production: contracts/schemas, versioning, backward compatibility, and behaviour guarantees for downstream consumers
- Take RAG/agent prototypes from notebook/PoC to production services: clean interfaces, robust runtime behavior, and safe rollout paths
- Implement reliability patterns: timeouts, retries with backoff, idempotency, circuit breakers, rate limiting, graceful degradation, and fallbacks
- Build observability end-to-end: structured logging, metrics, tracing (OpenTelemetry), and actionable dashboards/alerts
- Own release quality: CI/CD for AI services, prompt/config versioning, regression tests, and staged deployments
- Drive operational readiness: runbooks, on-call-friendly diagnostics, incident retros, and continuous hardening
Architecture & System Design (important gap to fill)
- Design and evolve AI API contracts (endpoints/tool contracts), ensuring safe, stable interfaces and clear ownership boundaries
- Design service boundaries and interfaces for AI capabilities (APIs, contracts, and dependencies)
- Make pragmatic tradeoffs across latency, cost, quality, and compliance; document and communicate decisions
- Define patterns for state, memory, and persistence in agentic workflows (including partial failure handling and recovery)
- Establish integration patterns with existing platform services and data sources (without duplicating DevOps ownership)
Data & Retrieval Systems (as used by product features)
- Build/operate ingestion and refresh pipelines that support product knowledge bases (freshness, lineage, auditability)
- Implement retrieval quality monitoring (e.g., drift, relevance), caching strategies, and evaluation harnesses
- Partner with data/analytics teams on data contracts, validation checks, and SLAs
Team Leadership & Engineering Standards
- Lead and develop a team of data and software engineers. Set direction, review work, unblock people.
- Run design reviews and code reviews that raise the bar without slowing delivery
- Establish shared patterns and standards for production AI systems that the team can scale on
- Raise the engineering bar: code reviews, design reviews, and shared standards for production AI systems
- Collaborate across AI Product Engineering, Data Science, DevOps/SRE, Security, and Product to keep ownership boundaries clean
Innovation in AI SDLC & Product Delivery
- Own the evolution of our AI SDLC and AI stack: evaluate, pilot, and productionize tools/practices that measurably improve quality, reliability, delivery speed, latency, or cost (with clear success metrics and rollback paths), and enable innovation by AI product engineers/data scientists through reusable frameworks, templates, and paved paths
- Bring leading LLM engineering discipline into production
- Translate new capabilities (agents/tooling) into stable, well-governed product APIs without compromising operability or compliance
Requirements
- You are a software engineer first. 7+ years building production backend systems, with strong opinions about API design, error handling, testing, and operability
- Proven ability to turn ambiguous prototypes into reliable services with clear operational characteristics
- Comfortable owning systems across the full lifecycle: design → build → launch → operate
- TypeScript or Python at a production level: you write services, not scripts. Clean abstractions, proper error handling, tested code
- You can lead engineers. You've mentored, set technical direction, and delivered through a team not just as an individual contributor
Technical Skills
- Strong production-grade Python (or similar backend language): API/service development, performance, testing discipline
- Solid understanding of reliability engineering: resiliency patterns, SLOs/SLAs, capacity planning, and incident response
- Observability expertise: OpenTelemetry, metrics/alerting, tracing, and debugging distributed systems
- Practical experience with LLM application stacks (RAG/agents/tooling) and evaluation/testing approaches
- SQL fluency for investigating system behavior and data issues
By applying to this position, you are granting us permission to process your CV and keep your profile on file for consideration for this and future opportunities.
Key Skills
Ranked by relevance
Related Jobs
3 roles aligned with this opportunity
DevOps Engineer
2026-05-27
AI Engineer - DDAI
2026-05-20
DevOps - SRE Engineer - Argentina
2026-05-20
- Posted
- May 11, 2026
- Type
- Full-time
- Level
- Mid-Senior
- Location
- Abu Dhabi Emirate
- Company
- Professional.me
Industries
Categories
Related Jobs
3 roles aligned with this opportunity
DevOps Engineer
2026-05-27
AI Engineer - DDAI
2026-05-20
DevOps - SRE Engineer - Argentina
2026-05-20