Track This Job
Add this job to your tracking list to:
- Monitor application status and updates
- Change status (Applied, Interview, Offer, etc.)
- Add personal notes and comments
- Set reminders for follow-ups
- Track your entire application journey
Save This Job
Add this job to your saved collection to:
- Access easily from your saved jobs dashboard
- Review job details later without searching again
- Compare with other saved opportunities
- Keep a collection of interesting positions
- Receive notifications about saved jobs before they expire
AI-Powered Job Summary
Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.
Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.
Join our research team to solve information extraction 🙂
We are looking for two ML research interns completing their Master 2 to work on creating extraction-specific LLMs like NuExtract and NuMarkdown. Possibility of publications. If successful, this M2 internship will lead to a CIFRE PhD or full-time hire.
If interested, send your CV to [email protected] and briefly answer these two questions:
- What is your favorite internship topic from the list bellow? (you can also propose a topic)
- How would you tackle it?
Company Description
We are a 3-years-old startup with 10 employees located in Station F, Paris. We create LLMs for information extraction, and develop a platform to use these LLMs. Our goal is to solve information extraction.
We created:
- NuNER (paper: https://arxiv.org/abs/2402.15343)
- NuExtract, for structured extraction (blog: https://numind.ai/blog/nuextract-a-foundation-model-for-structured-extraction)
- NuMarkdown, the first reasoning OCR model (HF: https://huggingface.co/numind/NuMarkdown-8B-Thinking
Check our open-source models here: https://huggingface.co/numind
Workplace / Team
The internship is in Station F, Paris. Possibility to work from home a few days a week.
You will directly work with our CEO (https://www.linkedin.com/in/etiennebcp/) and our AI experts who created NuExtract (e.g. https://www.linkedin.com/in/alexandre-constantin-0800661ba/, https://www.linkedin.com/in/liam-cripwell-212a36105/).
You will also collaborate with researchers from the LLF (e.g. https://www.linkedin.com/in/timothée-bernard-47327986/, https://www.linkedin.com/in/benoit-crabbé-2982452/).
You will have access to H200 GPUs and frontier model APIs.
Qualifications
- Currently undergoing a M2 internship in AI/machine learning/NLP
Topic 1: Extraction Confidence
Users of NuExtract.ai want to be able to quickly verify the validity of extracted values. To do so, they need to know which values NuExtract is confident about, and which ones it is not.
The goal of this project is to figure out how we can get an uncertainty score for the extraction values of NuExtract. This is not trivial due to multiplicity of correct answers and correlations between answers.
Topic 2: Extraction Localization
Users of NuExtract.ai want to be able to quickly verify the validity of extracted values. To do so, they need to know where, in the document, the information is coming from (or deduced from).
The goal of this project is to figure out how to do this.
Topic 3: Long Document Extraction
LLMs have a limited context length. The goal of this project is to figure out how NuExtract could extract information from documents much longer than its context length.
Topic 4: Reasoning for Structured Extraction
The goal of this project if to train NuExtract able to reason via private chain of thoughts about its extraction.
Topic 5: Extraction Agent
Related to topic 4, the goal of this project is to provide a reasoning NuExtract the ability of using tools (e.g. zooming on document or performing a web search) in order to improve extraction quality.
Topic 6: Conversational Extraction
In many industries, one needs to obtain various information from a customer. For example, to find the best insurance, a broker might want to know some features about a customer’s house (surface, number of bedrooms, location, etc.). This task is essentially about filling-up a form interactively.
Your goal will be to create a conversational AI specialized in obtaining such information. The AI should be given a JSON template defining the target information, and then be able to carry a text-based conversation to obtain this information.
Topic 7: Structured Extraction Benchmark
Structured Extraction is the generic task of extracting information from a document and returning it as a structured output (e.g. a JSON), see https://numind.ai/blog/nuextract-a-foundation-model-for-structured-extraction
Thanks to modern generative LLMs, we can now tackle this task even for deep extraction trees. However - unlike for classification and NER - there is no public benchmark to test models on.
Your goal will be to create such benchmark.
Key Skills
Ranked by relevanceReady to apply?
Join NuMind (YC S22) and take your career to the next level!
Application takes less than 5 minutes