Track This Job
Add this job to your tracking list to:
- Monitor application status and updates
- Change status (Applied, Interview, Offer, etc.)
- Add personal notes and comments
- Set reminders for follow-ups
- Track your entire application journey
Save This Job
Add this job to your saved collection to:
- Access easily from your saved jobs dashboard
- Review job details later without searching again
- Compare with other saved opportunities
- Keep a collection of interesting positions
- Receive notifications about saved jobs before they expire
AI-Powered Job Summary
Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.
Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.
This is a key role in a growing team building deep technical expertise in ML training systems.
Responsibilities
- Optimize our model training pipeline to improve both speed and reliability, enabling faster and more efficient experimentation;
- Apply GPU-level optimization techniques using tools like JAX, Triton, low-level CUDA to improve training performance and efficiency at scale;
- Identify and resolve performance bottlenecks across the entire ML pipeline — from data loading and preprocessing to CUDA kernels;
- Build tools and extend internal infrastructure to support scalable, reproducible, and high-performance training workflows;
- Mentor and support engineers and researchers in adopting performance best practices across the team;
- Help grow the team's GPU and systems-level capabilities, and contribute to a culture of engineering excellence and rapid experimentation
- Demonstrated experience optimizing neural network training in production or large-scale research settings - e.g. reducing training time, improving hardware utilization, or accelerating feedback cycles for ML researchers;
- Extensive practical experience with ML frameworks such as PyTorch or JAX;
- Hands-on experience with training and optimizing deep learning architectures such as LSTM and Transformer-based models, including different attention mechanisms;
- Experience working with CUDA, Triton, or other low-level GPU technologies for performance tuning;
- Proficiency in profiling and debugging training pipelines, using tools such as Nsight/cprofiler/CUDA/gdb/torch profiler;
- Understanding of distributed training concepts (e.g. data/model/tensor/sequence/pipeline/context parallelism, memory and compute tradeoffs);
- A collaborative and proactive mindset, with strong communication skills and the ability to mentor teammates and partner effectively within the team;
- Strong proficiency in Python for building infrastructure-level tooling, debugging training systems, and integrating with ML frameworks and profiling tools;
- High base salary and social benefits;
- Generous bonus structure. We are very flexible in discussing salary and conditions of employment;
- Cutting-edge hardware and software in production as well as high technical expertise of the company which allows implementation of bold ideas and boosting great results. Ownership over initiatives that directly solve business problems;
- Ability to trade on dozens of international exchanges;
- Flexible workflow (lack of formalism and bureaucracy, no pressure and over-management) and working schedule;
- Tuition reimbursement, conference and training sponsorship
Key Skills
Ranked by relevanceReady to apply?
Join pinely and take your career to the next level!
Application takes less than 5 minutes