Bagel Labs
Machine Learning Engineer
Bagel LabsCanada2 days ago
Full-timeEngineering, Information Technology

Bagel Labs is an Artificial Intelligence Research Lab developing novel methods for distributed training of frontier diffusion models on commodity hardware. Our work enables training of state-of-the-art generative models for image, video, and world modelling, without centralized GPU superclusters, reducing training compute capex by up to 50%.


We ignore years of experience and pedigree. If you have high agency — meaning your default assumption is that you can control the outcome of whatever situation you are in — we want to hear from you. Every requirement below is flexible for a candidate with high enough agency and tolerance for ambiguity.


Role Description

You will build and run the systems that make decentralized diffusion training work in practice. Training pipelines, inference serving, GPU orchestration across commodity hardware — you own the engineering end-to-end.


Key Responsibilities

  • Build and maintain distributed training pipelines across heterogeneous, commodity GPU hardware.
  • Profile and optimize training throughput, memory usage, and fault tolerance. Write custom CUDA/Triton kernels when needed.
  • Design and operate inference infrastructure: batching, routing, serving large generative models.
  • Ship experiment tracking, CI/CD, and reproducibility tooling for the ML stack.
  • Work directly with researchers to turn new algorithms into code that actually runs at scale.


Who You Might Be

  • Strong in Python and PyTorch. Can read and write C++/CUDA when performance requires it.
  • Experience with distributed training: FSDP, DeepSpeed, Megatron-LM, or custom tensor/pipeline/data parallelism.
  • Systems thinker — you reason about networking, memory layouts, and failure modes upfront.
  • Comfortable with Linux, Docker/Kubernetes, job schedulers, bare-metal and cloud GPU setups.
  • Enough ML fundamentals (transformers, diffusion, optimization) to debug a training run end-to-end and hold your own with researchers.


What We Offer

  • Top-of-market compensation.
  • A deeply technical culture where bold, frontier ideas are debated, stress-tested, and built.
  • Paid travel to top ML conferences.

Key Skills

Ranked by relevance