Track This Job
Add this job to your tracking list to:
- Monitor application status and updates
- Change status (Applied, Interview, Offer, etc.)
- Add personal notes and comments
- Set reminders for follow-ups
- Track your entire application journey
Save This Job
Add this job to your saved collection to:
- Access easily from your saved jobs dashboard
- Review job details later without searching again
- Compare with other saved opportunities
- Keep a collection of interesting positions
- Receive notifications about saved jobs before they expire
AI-Powered Job Summary
Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.
Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.
Our team is led by Stefano Ermon (co-inventor of diffusion models, flash attention, and DPO; faculty at Stanford), Aditya Grover (co-inventor of node2vec and decision transformers; faculty at UCLA), and Volodymyr Kuleshov (prev. co-founder and CTO at Afresh Technologies; faculty at Cornell), and includes engineers from Google Deepmind, Meta AI, Microsoft AI, and OpenAI. We are currently deploying large-scale diffusion LLMs at Fortune 500 companies.
Role OverviewWe seek experienced Machine Learning Engineers to shape how we collect, process, and curate the datasets that power our models. This interdisciplinary role combines engineering expertise with research insights to build scalable data pipelines, develop synthetic data generation techniques, and ensure our models are trained on high-quality, diverse datasets.
Key Responsibilities
- Design and implement scalable data pipelines for processing petabyte-scale datasets
- Build systems for web crawling, data ingestion, and real-time data processing to support model training operations
- Develop tools and frameworks for efficient data storage, retrieval, and versioning across distributed systems
- Develop techniques for collecting, augmenting, filtering, and synthesizing training data using LLMs and other ML methods
- Create evaluation frameworks to measure data diversity, quality, and representativeness
- Build systems for human-in-the-loop data validation and annotation workflows
- Ensure data collection adheres to privacy regulations
- Collaborate with ML researchers to identify data requirements and optimize training recipes
- BS/MS/PhD in Computer Science, Machine Learning, or related field (or equivalent experience)
- 3+ years of experience building data processing pipelines at scale, particularly with AI/ML applications
- Strong proficiency in Python and experience with data processing frameworks (Apache Spark, Beam, Airflow)
- Experience with distributed computing and large-scale data storage systems (HDFS, S3, BigQuery)
- Solid understanding of machine learning fundamentals and experience with ML frameworks (PyTorch, TensorFlow)
- Experience with SQL and NoSQL databases for managing structured and unstructured data
- Familiarity with version control (Git) and infrastructure as code practices
- Strong analytical skills with attention to detail in data quality assessment
- Excellent communication skills to work effectively with researchers and engineers
- Experience with large language models and understanding of tokenization, embeddings, and model architectures
- Familiarity with web scraping, crawling technologies, and Common Crawl datasets
- Experience managing human annotation workflows and quality control processes
- Experience with vector databases and embedding-based retrieval systems
- Familiarity with synthetic data generation techniques and data augmentation strategies
- Knowledge of data privacy regulations and ethical AI practices
- Impact: Deploy LLMs that transform how millions of users work, create, and solve real-world problems.
- Innovation: Pioneer novel data recipes for diffusion LLMs.
- Growth: Enjoy a fast-paced, collaborative environment where your contributions will directly shape the future of generative AI.
- Competitive salary and equity in a rapidly growing startup.
- Flexible vacation and paid time off (PTO).
- Health, dental, and vision insurance.
- Professional development opportunities (conferences, courses, etc.).
We are an equal opportunity employer and encourage candidates of all backgrounds to apply.
PI275689529
Ready to apply?
Join Inception and take your career to the next level!
Application takes less than 5 minutes