Machine Learning Research (M2) Intern

NuMind (YC S22)France9 hours ago

InternshipRemote Friendly

Track This Job

Add this job to your tracking list to:

Monitor application status and updates
Change status (Applied, Interview, Offer, etc.)
Add personal notes and comments
Set reminders for follow-ups
Track your entire application journey

Save This Job

Add this job to your saved collection to:

Access easily from your saved jobs dashboard
Review job details later without searching again
Compare with other saved opportunities
Keep a collection of interesting positions
Receive notifications about saved jobs before they expire

AI-Powered Job Summary

Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.

Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.

Join our research team to solve information extraction 🙂

We are looking for two ML research interns completing their Master 2 to work on creating extraction-specific LLMs like NuExtract and NuMarkdown. Possibility of publications. If successful, this M2 internship will lead to a CIFRE PhD or full-time hire.

If interested, send your CV to [email protected] and briefly answer these two questions:

What is your favorite internship topic from the list bellow? (you can also propose a topic)
How would you tackle it?

Company Description

We are a 3-years-old startup with 10 employees located in Station F, Paris. We create LLMs for information extraction, and develop a platform to use these LLMs. Our goal is to solve information extraction.

We created:

NuNER (paper: https://arxiv.org/abs/2402.15343)
NuExtract, for structured extraction (blog: https://numind.ai/blog/nuextract-a-foundation-model-for-structured-extraction)
NuMarkdown, the first reasoning OCR model (HF: https://huggingface.co/numind/NuMarkdown-8B-Thinking

Check our open-source models here: https://huggingface.co/numind

Workplace / Team

The internship is in Station F, Paris. Possibility to work from home a few days a week.

You will directly work with our CEO (https://www.linkedin.com/in/etiennebcp/) and our AI experts who created NuExtract (e.g. https://www.linkedin.com/in/alexandre-constantin-0800661ba/, https://www.linkedin.com/in/liam-cripwell-212a36105/).

You will also collaborate with researchers from the LLF (e.g. https://www.linkedin.com/in/timothée-bernard-47327986/, https://www.linkedin.com/in/benoit-crabbé-2982452/).

You will have access to H200 GPUs and frontier model APIs.

Qualifications

Currently undergoing a M2 internship in AI/machine learning/NLP

Possible Internship Topics (non exhaustive list):

Topic 1: Extraction Confidence

Users of NuExtract.ai want to be able to quickly verify the validity of extracted values. To do so, they need to know which values NuExtract is confident about, and which ones it is not.

The goal of this project is to figure out how we can get an uncertainty score for the extraction values of NuExtract. This is not trivial due to multiplicity of correct answers and correlations between answers.

Topic 2: Extraction Localization

Users of NuExtract.ai want to be able to quickly verify the validity of extracted values. To do so, they need to know where, in the document, the information is coming from (or deduced from).

The goal of this project is to figure out how to do this.

Topic 3: Long Document Extraction

LLMs have a limited context length. The goal of this project is to figure out how NuExtract could extract information from documents much longer than its context length.

Topic 4: Reasoning for Structured Extraction

The goal of this project if to train NuExtract able to reason via private chain of thoughts about its extraction.

Topic 5: Extraction Agent

Related to topic 4, the goal of this project is to provide a reasoning NuExtract the ability of using tools (e.g. zooming on document or performing a web search) in order to improve extraction quality.

Topic 6: Conversational Extraction

In many industries, one needs to obtain various information from a customer. For example, to find the best insurance, a broker might want to know some features about a customer’s house (surface, number of bedrooms, location, etc.). This task is essentially about filling-up a form interactively.

Your goal will be to create a conversational AI specialized in obtaining such information. The AI should be given a JSON template defining the target information, and then be able to carry a text-based conversation to obtain this information.

Topic 7: Structured Extraction Benchmark

Structured Extraction is the generic task of extracting information from a document and returning it as a structured output (e.g. a JSON), see https://numind.ai/blog/nuextract-a-foundation-model-for-structured-extraction

Thanks to modern generative LLMs, we can now tackle this task even for deep extraction trees. However - unlike for classification and NER - there is no public benchmark to test models on.

Your goal will be to create such benchmark.

Key Skills

Ranked by relevance

Ready to apply?

Join NuMind (YC S22) and take your career to the next level!

Application takes less than 5 minutes

Apply