Lead Platform Engineer

Open Innovation AIUnited Arab Emirates6 days ago

Full-timeInformation Technology, Product Management +1

Track This Job

Add this job to your tracking list to:

Monitor application status and updates
Change status (Applied, Interview, Offer, etc.)
Add personal notes and comments
Set reminders for follow-ups
Track your entire application journey

Save This Job

Add this job to your saved collection to:

Access easily from your saved jobs dashboard
Review job details later without searching again
Compare with other saved opportunities
Keep a collection of interesting positions
Receive notifications about saved jobs before they expire

AI-Powered Job Summary

Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.

Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.

Company Description

Open Innovation AI is a global technology company that specializes in developing advanced solutions for managing AI workloads. Its flagship product, the Open Innovation Cluster Manager (OICM), orchestrates complex AI tasks efficiently across diverse infrastructures. The platform is hardware-agnostic, optimized for various GPUs and accelerators hardware, and facilitates seamless integration and scalability for enterprise AI applications. Open Innovation AI focuses on optimizing and simplifying AI workload management and making AI technologies accessible to organizations of all sizes. With its innovative solutions, companies can reduce operational costs, accelerate time to value, and maximize their return on investment, ensuring that their AI strategies contribute directly to enhanced business outcomes.

About the Role

We're looking for a Lead Platform Engineer to design and build OICM (Open Innovation Cluster Manager), our AI/ML orchestration platform for distributed computing. You'll work on systems that manage GPU workloads across cloud and on-premises infrastructure, focusing on reliability, performance, and scalability. This role involves building distributed systems, implementing resource scheduling algorithms, and creating fault-tolerant services that operate across multiple environments. You'll need strong systems architecture skills and experience solving complex engineering problems at scale.

What You'll Do:

Build distributed systems that handle large-scale AI/ML workloads with high availability requirements
Develop APIs and microservices that process high request volumes with low latency
Implement/enhance scheduling algorithms for efficient GPU resource allocation and load balancing
Drive adoption of clean architecture principles and engineering best practices
Mentor senior engineers and lead technical initiatives
Analyze and optimize system performance across distributed environments

Technical Requirements:

8+ years of experience in building distributed systems and platform infrastructure
Expert proficiency in Python and Go.
Advanced Kubernetes experience: Custom operators, CRDs, networking, service mesh, and multi-cluster management
GPU computing expertise: MIG/vGpu, scheduling, and ML framework integration
Distributed systems knowledge: Consensus algorithms, caching, message queues, and fault tolerance
Performance engineering: System profiling, benchmarking, and optimization
Security practices: security by design, secrets management
Leadership experience: Leading technical projects and mentoring teams

Preferred Experience

Ray, Kubeflow, Pytorch or similar distributed computing frameworks
Open-source contributions in Kubernetes or ML infrastructure
Custom hardware integration and bare metal provisioning
Knowledge of networking is a plus
ML/AI model deployment and serving infrastructure
Infrastructure automation: Terraform, Ansible, GitOps workflows

Key Skills

Ranked by relevance

Ready to apply?

Join Open Innovation AI and take your career to the next level!

Application takes less than 5 minutes

Apply