Track This Job
Add this job to your tracking list to:
- Monitor application status and updates
- Change status (Applied, Interview, Offer, etc.)
- Add personal notes and comments
- Set reminders for follow-ups
- Track your entire application journey
Save This Job
Add this job to your saved collection to:
- Access easily from your saved jobs dashboard
- Review job details later without searching again
- Compare with other saved opportunities
- Keep a collection of interesting positions
- Receive notifications about saved jobs before they expire
AI-Powered Job Summary
Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.
Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.
Company Description
Open Innovation AI is a global technology company that specializes in developing advanced solutions for managing AI workloads. Its flagship product, the Open Innovation Cluster Manager (OICM), orchestrates complex AI tasks efficiently across diverse infrastructures. The platform is hardware-agnostic, optimized for various GPUs and accelerators hardware, and facilitates seamless integration and scalability for enterprise AI applications. Open Innovation AI focuses on optimizing and simplifying AI workload management and making AI technologies accessible to organizations of all sizes. With its innovative solutions, companies can reduce operational costs, accelerate time to value, and maximize their return on investment, ensuring that their AI strategies contribute directly to enhanced business outcomes.
About the Role
We're looking for a Lead Platform Engineer to design and build OICM (Open Innovation Cluster Manager), our AI/ML orchestration platform for distributed computing. You'll work on systems that manage GPU workloads across cloud and on-premises infrastructure, focusing on reliability, performance, and scalability. This role involves building distributed systems, implementing resource scheduling algorithms, and creating fault-tolerant services that operate across multiple environments. You'll need strong systems architecture skills and experience solving complex engineering problems at scale.
What You'll Do:
- Build distributed systems that handle large-scale AI/ML workloads with high availability requirements
- Develop APIs and microservices that process high request volumes with low latency
- Implement/enhance scheduling algorithms for efficient GPU resource allocation and load balancing
- Drive adoption of clean architecture principles and engineering best practices
- Mentor senior engineers and lead technical initiatives
- Analyze and optimize system performance across distributed environments
Technical Requirements:
- 8+ years of experience in building distributed systems and platform infrastructure
- Expert proficiency in Python and Go.
- Advanced Kubernetes experience: Custom operators, CRDs, networking, service mesh, and multi-cluster management
- GPU computing expertise: MIG/vGpu, scheduling, and ML framework integration
- Distributed systems knowledge: Consensus algorithms, caching, message queues, and fault tolerance
- Performance engineering: System profiling, benchmarking, and optimization
- Security practices: security by design, secrets management
- Leadership experience: Leading technical projects and mentoring teams
Preferred Experience
- Ray, Kubeflow, Pytorch or similar distributed computing frameworks
- Open-source contributions in Kubernetes or ML infrastructure
- Custom hardware integration and bare metal provisioning
- Knowledge of networking is a plus
- ML/AI model deployment and serving infrastructure
- Infrastructure automation: Terraform, Ansible, GitOps workflows
Key Skills
Ranked by relevanceReady to apply?
Join Open Innovation AI and take your career to the next level!
Application takes less than 5 minutes

