Machine Learning Engineer

Bagel LabsCanada2 days ago

Full-timeEngineering, Information Technology

Track This Job

Add this job to your tracking list to:

Monitor application status and updates
Change status (Applied, Interview, Offer, etc.)
Add personal notes and comments
Set reminders for follow-ups
Track your entire application journey

Save This Job

Add this job to your saved collection to:

Access easily from your saved jobs dashboard
Review job details later without searching again
Compare with other saved opportunities
Keep a collection of interesting positions
Receive notifications about saved jobs before they expire

AI-Powered Job Summary

Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.

Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.

Bagel Labs is an Artificial Intelligence Research Lab developing novel methods for distributed training of frontier diffusion models on commodity hardware. Our work enables training of state-of-the-art generative models for image, video, and world modelling, without centralized GPU superclusters, reducing training compute capex by up to 50%.

We ignore years of experience and pedigree. If you have high agency — meaning your default assumption is that you can control the outcome of whatever situation you are in — we want to hear from you. Every requirement below is flexible for a candidate with high enough agency and tolerance for ambiguity.

Role Description

You will build and run the systems that make decentralized diffusion training work in practice. Training pipelines, inference serving, GPU orchestration across commodity hardware — you own the engineering end-to-end.

Key Responsibilities

Build and maintain distributed training pipelines across heterogeneous, commodity GPU hardware.
Profile and optimize training throughput, memory usage, and fault tolerance. Write custom CUDA/Triton kernels when needed.
Design and operate inference infrastructure: batching, routing, serving large generative models.
Ship experiment tracking, CI/CD, and reproducibility tooling for the ML stack.
Work directly with researchers to turn new algorithms into code that actually runs at scale.

Who You Might Be

Strong in Python and PyTorch. Can read and write C++/CUDA when performance requires it.
Experience with distributed training: FSDP, DeepSpeed, Megatron-LM, or custom tensor/pipeline/data parallelism.
Systems thinker — you reason about networking, memory layouts, and failure modes upfront.
Comfortable with Linux, Docker/Kubernetes, job schedulers, bare-metal and cloud GPU setups.
Enough ML fundamentals (transformers, diffusion, optimization) to debug a training run end-to-end and hold your own with researchers.

What We Offer

Top-of-market compensation.
A deeply technical culture where bold, frontier ideas are debated, stress-tested, and built.
Paid travel to top ML conferences.

Key Skills

Ranked by relevance

Ready to apply?

Join Bagel Labs and take your career to the next level!

Application takes less than 5 minutes

Apply