Track This Job
Add this job to your tracking list to:
- Monitor application status and updates
- Change status (Applied, Interview, Offer, etc.)
- Add personal notes and comments
- Set reminders for follow-ups
- Track your entire application journey
Save This Job
Add this job to your saved collection to:
- Access easily from your saved jobs dashboard
- Review job details later without searching again
- Compare with other saved opportunities
- Keep a collection of interesting positions
- Receive notifications about saved jobs before they expire
AI-Powered Job Summary
Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.
Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.
We are a small, well-funded team working on difficult, high-impact problems at the intersection of AI and distributed systems. We primarily work in-person from our office in downtown San Francisco.
Responsibilities
- Design and implement optimization techniques to increase model throughput and reduce latency across our suite of models
- Deploy and maintain large language models at scale in production environments
- Deploy new models as they are released by frontier labs
- Implement techniques like quantization, speculative decoding, and KV cache reuse
- Contribute regularly to open source projects such as SGLang and vLLM
- Deep dive into underlying codebases of TensorRT, PyTorch, TensorRT-LLM, vLLM, SGLang, CUDA, and other libraries to debug ML performance issues
- Collaborate with the engineering team to bring new features and capabilities to our inference platform
- Develop robust and scalable infrastructure for AI model serving
- Create and maintain technical documentation for inference systems
- 3+ years of experience writing high-performance, production-quality code
- Strong proficiency with Python and deep learning frameworks, particularly PyTorch
- Demonstrated experience with LLM inference optimization techniques
- Hands-on experience with SGLang and vLLM, with contributions to these projects strongly preferred
- Familiarity with Docker and Kubernetes for containerized deployments
- Experience with CUDA programming and GPU optimization
- Strong understanding of distributed systems and scalability challenges
- Proven track record of optimizing AI models for production environments
- Familiarity with TensorRT and TensorRT-LLM
- Knowledge of vision models and multimodal AI systems
- Experience implementing techniques like quantization and speculative decoding
- Contributions to open source machine learning projects
- Experience with large-scale distributed computing
We offer competitive compensation, equity in a high-growth startup, and comprehensive benefits. The base salary range for this role is $180,000 - $250,000, plus competitive equity and benefits including:
- Full healthcare coverage
- Quarterly offsites
- Flexible PTO
Key Skills
Ranked by relevanceReady to apply?
Join Scouto AI and take your career to the next level!
Application takes less than 5 minutes