NuMind (YC S22)
Research Scientist – Large Language Models
NuMind (YC S22)France18 days ago
Full-timeRemote Friendly

Join our research team to solve information extraction! 🙂


*Recent PhD required*

*You need to be an ML, NLP, and LLM expert*


We are looking for a Research Scientist out of PhD to create LLMs & VLMs such as NuExtract and NuMarkdown to power the https://nuextract.ai/ platform.


Your job will involve creating datasets, training LLMs, performing experiments / ablation studies, and so on. Check the list of typical topics bellow.


You will join a team of brilliant ML scientists supervised by our CEO (https://www.linkedin.com/in/etiennebcp/).


We are a 3-years-old AI startup with 12 employees located in Station F, Paris. We did YCombinator.


We have a hybrid work model -- you should be able to work from our office regularly (at least once a week).


Requirements

  • You should be out of PhD or post-doc.
  • You should have an ML/NLP/LLM background.
  • You should be self-driven, creative, passionate about ML/NLP/LLMs.
  • You should have both a researcher and a hacker/builder mindset.
  • You should like to work in a startup environment (fast pace, frequent changes of directions)


Responsibilities

  • Training task-specific LLMs
  • Running experiments/ablation studies
  • Creating datasets
  • Developing software related to LLMs
  • Staying up to date with relevant LLM & NLP research


Typical R&D topics we are working on (non exhaustive list):


1. Extraction Confidence

Users of NuExtract.ai want to be able to quickly verify the validity of extracted values in the JSON output. To do so, they need to know which values NuExtract is confident about, and which ones it is not.

We want to figure out how we can get an uncertainty score for the extraction values of NuExtract. This is not trivial due to multiplicity of correct answers and correlations between answers.


2. Extraction Localization

Users of NuExtract.ai want to be able to quickly verify the validity of extracted values. To do so, they need to know where, in the document, the information is coming from (or deduced from).

We want to figure out how to do this.


3. Long Document Extraction

LLMs have a limited context length which limits document size. We want to figure out how NuExtract could extract information from documents much longer than its context length.


4. Reasoning for Structured Extraction

We want to train NuExtract able to reason via private chain of thoughts about its extraction.


5. Extraction Agent

We want to provide a reasoning NuExtract the ability of using tools (e.g. zooming on document or performing a web search) in order to improve extraction quality.


6. Structured Extraction Benchmark

There is no public benchmark for structured extraction. We want to create such benchmark and make it public.


Links:


  • Platform: https://nuextract.ai/
  • Blog posts: https://numind.ai/blog
  • Hugging Face: https://huggingface.co/numind
  • Github: https://github.com/numindai
  • Discord: https://discord.com/invite/3tsEtJNCDe
  • NuNER paper: https://arxiv.org/abs/2402.15343

Key Skills

Ranked by relevance