NuMind (YC S22)
Machine Learning Research (M2) Intern
NuMind (YC S22)France9 hours ago
InternshipRemote Friendly

Join our research team to solve information extraction 🙂


We are looking for two ML research interns completing their Master 2 to work on creating extraction-specific LLMs like NuExtract and NuMarkdown. Possibility of publications. If successful, this M2 internship will lead to a CIFRE PhD or full-time hire.


If interested, send your CV to [email protected] and briefly answer these two questions:

  • What is your favorite internship topic from the list bellow? (you can also propose a topic)
  • How would you tackle it?


Company Description

We are a 3-years-old startup with 10 employees located in Station F, Paris. We create LLMs for information extraction, and develop a platform to use these LLMs. Our goal is to solve information extraction.


We created:

  • NuNER (paper: https://arxiv.org/abs/2402.15343)
  • NuExtract, for structured extraction (blog: https://numind.ai/blog/nuextract-a-foundation-model-for-structured-extraction)
  • NuMarkdown, the first reasoning OCR model (HF: https://huggingface.co/numind/NuMarkdown-8B-Thinking


Check our open-source models here: https://huggingface.co/numind


Workplace / Team

The internship is in Station F, Paris. Possibility to work from home a few days a week.


You will directly work with our CEO (https://www.linkedin.com/in/etiennebcp/) and our AI experts who created NuExtract (e.g. https://www.linkedin.com/in/alexandre-constantin-0800661ba/, https://www.linkedin.com/in/liam-cripwell-212a36105/).


You will also collaborate with researchers from the LLF (e.g. https://www.linkedin.com/in/timothĂ©e-bernard-47327986/, https://www.linkedin.com/in/benoit-crabbĂ©-2982452/).


You will have access to H200 GPUs and frontier model APIs.


Qualifications

  • Currently undergoing a M2 internship in AI/machine learning/NLP


Possible Internship Topics (non exhaustive list):


Topic 1: Extraction Confidence

Users of NuExtract.ai want to be able to quickly verify the validity of extracted values. To do so, they need to know which values NuExtract is confident about, and which ones it is not.

The goal of this project is to figure out how we can get an uncertainty score for the extraction values of NuExtract. This is not trivial due to multiplicity of correct answers and correlations between answers.


Topic 2: Extraction Localization

Users of NuExtract.ai want to be able to quickly verify the validity of extracted values. To do so, they need to know where, in the document, the information is coming from (or deduced from).

The goal of this project is to figure out how to do this.


Topic 3: Long Document Extraction

LLMs have a limited context length. The goal of this project is to figure out how NuExtract could extract information from documents much longer than its context length.


Topic 4: Reasoning for Structured Extraction

The goal of this project if to train NuExtract able to reason via private chain of thoughts about its extraction.


Topic 5: Extraction Agent

Related to topic 4, the goal of this project is to provide a reasoning NuExtract the ability of using tools (e.g. zooming on document or performing a web search) in order to improve extraction quality.


Topic 6: Conversational Extraction

In many industries, one needs to obtain various information from a customer. For example, to find the best insurance, a broker might want to know some features about a customer’s house (surface, number of bedrooms, location, etc.). This task is essentially about filling-up a form interactively.

Your goal will be to create a conversational AI specialized in obtaining such information. The AI should be given a JSON template defining the target information, and then be able to carry a text-based conversation to obtain this information.


Topic 7: Structured Extraction Benchmark

Structured Extraction is the generic task of extracting information from a document and returning it as a structured output (e.g. a JSON), see https://numind.ai/blog/nuextract-a-foundation-model-for-structured-extraction

Thanks to modern generative LLMs, we can now tackle this task even for deep extraction trees. However - unlike for classification and NER - there is no public benchmark to test models on.

Your goal will be to create such benchmark.

Key Skills

Ranked by relevance