-
View all jobs
Overview
Pluralis Research is pioneering Protocol Learning—a fully decentralised way to train and deploy AI models that opens this layer to individuals rather than well resourced corporates. By pooling compute from many participants, incentivising their efforts, and preventing any single party from controlling a model’s full weights, we’re creating a genuinely open, collaborative path to frontier-scale AI.
We’re looking for an ML Training Platform Engineer to architect, build, and scale the foundational infrastructure powering our decentralized ML training platform. You will own core systems spanning infrastructure orchestration, distributed compute, and services integration, enabling continuous experimentation and large-scale model training.
Responsibilities
Ideally, you’ll have 5+ years of work experience with deep experience in:
Pluralis Research is pioneering Protocol Learning—a fully decentralised way to train and deploy AI models that opens this layer to individuals rather than well resourced corporates. By pooling compute from many participants, incentivising their efforts, and preventing any single party from controlling a model’s full weights, we’re creating a genuinely open, collaborative path to frontier-scale AI.
We’re looking for an ML Training Platform Engineer to architect, build, and scale the foundational infrastructure powering our decentralized ML training platform. You will own core systems spanning infrastructure orchestration, distributed compute, and services integration, enabling continuous experimentation and large-scale model training.
Responsibilities
- Multi-Cloud Infrastructure: Design resource management systems provisioning and orchestrating compute across AWS, GCP, and Azure using infrastructure-as-code (Pulumi/Terraform). Handle dynamic scaling, state synchronization, and concurrent operations across hundreds of heterogeneous nodes.
- Distributed Training Systems: Architect fault-tolerant infrastructure for distributed ML. GPU clusters, NVIDIA runtime, S3 checkpointing, Large dataset management and streaming, health monitoring, and resilient retry strategies.
- Real-World Networking: Build systems that simulate and handle real-world network conditions — bandwidth shaping, latency injection, packet loss — while managing dynamic node churn and ensuring efficient data flow across workers with heterogeneous connectivity, because our training happens on consumer nodes and non co-located infrastructure, not in a datacenter.
Ideally, you’ll have 5+ years of work experience with deep experience in:
- Infrastructure & Platform Engineering: Production experience with infrastructure-as-code (Pulumi/Terraform/CloudFormation) managing multi-cloud deployments, lifecycle orchestration, self-healing systems, Docker/Kubernetes (EKS), GPU workloads, and heterogeneous clusters at scale.
- Distributed Systems & ML Infrastructure: Deep understanding of distributed training workflows, checkpointing, data sharding, model versioning, long-running job orchestration, decentralized networking (P2P, NAT traversal, traffic shaping), and real-world bandwidth constraints.
- Systems Programming & Reliability: Strong Python engineering (asyncio, concurrency, retry logic, cloud SDKs, CLI tooling) with hands-on experience in observability, SRE practices, monitoring (Prometheus/Grafana), performance profiling, and incident response.
- Experience in a startup environment with an emphasis on micro-services orchestration or big tech background
- Deep understanding of multi-cloud infra & distributed training systems
- A team player with high attention to detail
- A strong passion to join
Key Skills
Ranked by relevance
cloud
incident response
python
aws
gcp
eks
nat
ai
s3
Related Jobs
3 roles aligned with this opportunity
View Job Details
Related
Machine Learning Engineer - ML Training Platform
2026-02-23
Full-time
Entry
Australia
Technology
Engineering
View Job Details
Related
Machine Learning Engineer - ML Training Platform
2026-02-23
Full-time
Entry
Australia
Technology
Engineering
View Job Details
Related
Machine Learning Engineer
2026-02-10
Full-time
Entry
Australia
Technology
Engineering
Login to Apply
- Posted
- Feb 23, 2026
- Type
- Full-time
- Level
- Entry
- Location
- Melbourne
- Company
- Pluralis Research
Industries
Technology
Information
Internet
Categories
Engineering
Information Technology
Related Jobs
3 roles aligned with this opportunity
View Job Details
Related
Machine Learning Engineer - ML Training Platform
2026-02-23
Full-time
Entry
Australia
Technology
Engineering
View Job Details
Related
Machine Learning Engineer - ML Training Platform
2026-02-23
Full-time
Entry
Australia
Technology
Engineering
View Job Details
Related
Machine Learning Engineer
2026-02-10
Full-time
Entry
Australia
Technology
Engineering