Louis BrulΓ© Naudet PRO

louisbrulenaudet

AI & ML interests

Research in business taxation and development, University Dauphine-PSL πŸ“– | Backed by the Microsoft for Startups Hub program and Google Cloud Platform for startups program | Hugging Face for Legal πŸ€—

Organizations

louisbrulenaudet's activity

replied to tomaarsen's post 8 days ago
view reply

A dream update, I was just about to start working on a Hard Negatives Mining function, @tomaarsen , I'm gaining hours of sleep thanks to you and the rest of the community πŸ˜…

I'm going to test it out as soon as possible!

PS: I'd like to take this opportunity to thank you again for the new documentation, which is just perfect.

posted an update 9 days ago
view post
Post
2305
The Romulus model series has been released on Hugging Face, continually pre-trained on 34,864,949 tokens of French laws and intended to serve as a foundation for fine-tuning on labeled data πŸ€—

The training code, dataset and model weights are open and available free on HF and the training was based on H100 provided by Microsoft for Startups using Unsloth AI by @danielhanchen and @shimmyshimmer πŸ¦₯

Link to the base model: louisbrulenaudet/Romulus-cpt-Llama-3.1-8B-v0.1

Link to the instruct model: louisbrulenaudet/Romulus-cpt-Llama-3.1-8B-v0.1-Instruct

Link to the dataset: louisbrulenaudet/Romulus-cpt-fr

Please note that these models have not been aligned for the production of usable texts as they stand, and will certainly need to be refined for the desired tasks in order to produce satisfactory results.
  • 1 reply
Β·
posted an update 11 days ago
view post
Post
1424
An example of the application of LegalKit is the production of knowledge graphs, here is a demo Space πŸ”—

With the update of the French legal code data model uploaded to πŸ€— and the introduction of a column dedicated to HTML text, it's now easy to extract links between different articles and produce complex graphs with just a few lines of Python.

This simplified demo highlights the ease of implementation and creative potential, and enables the generation of complete data sets, although requiring a powerful graphics card for display. The framework used for the moment is D3.js, but perhaps other solutions are possible. I'd be delighted to hear your suggestions, and look forward to hearing from the community.

Link to the πŸ€— Space: louisbrulenaudet/legalkit-knowledge-graph
  • 2 replies
Β·
replied to victor's post 11 days ago
view reply

Hello @victor , thank you very much for this call, I have a few ideas, some of which overlap with other members of the community:

  • the possibility of sending DMs to other users or organizations;
  • the ability to disable update notifications in a space when you're the producer. This would be particularly useful for daily tasks, so as not to pollute the feeds of people who follow users who edit datasets every day...
  • image-text-to-text model support in the serverless API;
  • json mode in line with the OpenAI API for endpoint inference for text generation
posted an update 18 days ago
view post
Post
1758
Understanding the json format response with HF's Serverless Inference API πŸ€—

As it stands, there seems to be an inconsistency with the OpenAI documentation on the question of implementing the JSON response format using the InferenceClient completion API.

After investigating the InferenceClient source code, I share the official solution using a JSON Schema. This consolidates the structure of the response and simplifies parsing as part of an automated process for extracting metadata, information:
from huggingface_hub import InferenceClient

client = InferenceClient("meta-llama/Meta-Llama-3-70B-Instruct")

messages = [
    {
        "role": "user",
        "content": "I saw a puppy a cat and a raccoon during my bike ride in the park. What did I saw and when?",
    },
]

response_format = {
    "type": "json",
    "value": {
        "properties": {
            "location": {"type": "string"},
            "activity": {"type": "string"},
            "animals_seen": {"type": "integer", "minimum": 1, "maximum": 5},
            "animals": {"type": "array", "items": {"type": "string"}},
        },
        "required": ["location", "activity", "animals_seen", "animals"],
    },
}

response = client.chat_completion(
    messages=messages,
    response_format=response_format,
    max_tokens=500,
)

print(response.choices[0].message.content)

As a reminder, json mode is activated with the OpenAI client as follows:
response = client.chat.completions.create(
     model="gpt-3.5-turbo-0125",
     messages=[...],
     response_format={"type": "json_object"}
)

One question remains unanswered, however, and will perhaps be answered by the community: it seems that an incompatibility persists for list of dictionaries generation, and currently, the production of simple dictionaries seems to be the only functional option.
  • 1 reply
Β·
posted an update about 1 month ago
view post
Post
2749
πŸš€ RAGoon is now available on PyPI, GitHub, and as a Space on Hugging Face for batched embeddings generation πŸ€—

RAGoon is a set of NLP utilities for multi-model embedding production, high-dimensional vector visualization, and aims to improve language model performance by providing contextually relevant information through search-based querying, web scraping and data augmentation techniques.

At this stage, 5 major classes are available via RAGoon to facilitate:
- the production of chain embeddings for several models to simplify a continuous deployment process;
- production of LLM requests for web querying and content retrieval via the Google API;
- recursive chunking via tokens;
- data visualization and the function to load embeddings from a FAISS index, reduce their dimensionality using PCA and/or t-SNE, and visualize them in an interactive 3D graph;
- the creation of binary indexes for search with scalar (int8) rescoring.

Link to GitHub: https://github.com/louisbrulenaudet/ragoon
Link to the πŸ€— Space: louisbrulenaudet/ragoon
replied to severo's post about 2 months ago
view reply

That's fantastic, but could we go one step further and make it possible to follow batches of people automatically using a button, as is now possible on X?

posted an update 2 months ago
view post
Post
862
You can now find the OBIS - Ocean Biodiversity Information System, on Hugging Face with 128M rows, via the Datasets package stream πŸ€—

The datasets are integrated, allowing seamless search and mapping by species name, higher taxonomic level, geographic area, depth, time, and environmental parameters. OBIS originates from the Census of Marine Life (2000-2010) and was adopted as a project under IOC-UNESCO’s International Oceanographic Data and Information (IODE) programme in 2009.

Collectively, they have provided over 45 million observations of nearly 120,000 marine species, ranging from bacteria to whales, from the surface to 10,900 meters depth, and from the tropics to the poles.

Link to the dataset: louisbrulenaudet/obis
posted an update 2 months ago
view post
Post
2097
Introducing the first two projects on the HFforLegal community: the 'Laws' dataset and the associated search tool based on @nreimers and @tomaarsen 's Sentence Transformers library πŸ€—

The objective of these two tools is to centralize in a single format a set of rules from different countries and legal systems in order to facilitate NLP in the field of comparative law, enabling more accurate and comprehensive legal analysis across different jurisdictions 🌍

Link to the dataset : HFforLegal/laws
Link to the space: HFforLegal/laws-retrieval

We need your contributions to enrich this new knowledge base, and you will find in the 'Laws' dataset all the information you need to format your data and submit them to the appropriate split.
posted an update 3 months ago
view post
Post
2910
Announcing the creation of the "HF for Legal" organization, an open-source community dedicated to demystifying language models for legal professionals πŸ€—

Whether you're a practicing attorney, a legal scholar, or a technologist interested in legal applications of AI, HF for Legal may be your hub for exploration, learning, and free innovation βš—οΈ

On the occasion of this launch, you'll be able to find several notebooks I've been developing over the last few months for TSDAE pre-training of embedding models, the generation of indexes for semantic search, based on the formidable work of @tomaarsen and @nreimers , adapted to the field of French law, or the addition of information retrieval tasks to the MTEB.

Join us in our mission to make AI more accessible and understandable for the legal world, ensuring that the power of language models can be harnessed effectively and ethically.

Link to the org: https://huggingface.co/HFforLegal

Special thanks to @clem for encouraging me to start this organization. Let's hope we can bring together all the enthusiasts who work in this field.

Let's code and share together! πŸš€πŸ”—
posted an update 3 months ago
view post
Post
3170
I am delighted to announce the publication of my LegalKit, a French labeled dataset built for legal ML training πŸ€—

This dataset comprises multiple query-document pairs (+50k) curated for training sentence embedding models within the domain of French law.

The labeling process follows a systematic approach to ensure consistency and relevance:
- Initial Query Generation: Three instances of the LLaMA-3-70B model independently generate three different queries based on the same document.
- Selection of Optimal Query: A fourth instance of the LLaMA-3-70B model, using a dedicated selection prompt, evaluates the generated queries and selects the most suitable one.
- Final Label Assignment: The chosen query is used to label the document, aiming to ensure that the label accurately reflects the content and context of the original text.

Dataset: louisbrulenaudet/legalkit

Stay tuned for further updates and release information πŸ”₯

@clem , if we can create an "HF for Legal" organization, similar to what exists for journalists, I am available!

Note : My special thanks to @alvdansen for their illustration models ❀️
  • 2 replies
Β·
replied to their post 3 months ago
view reply

Hi Julius,

The error message indicates that the token used is invalid. Perhaps the explanation lies in the model used, have you accepted the license for it, I'm thinking in particular of the Llama models, and if so, have you configured a read access key in your settings?

posted an update 3 months ago
view post
Post
4048
Mixtral or Llama 70B on Google Spreadsheet thanks to Hugging Face's Serverless Inference API πŸ€—

The Add-on is now available on the HF repo "Journalists on Hugging Face" and allows rapid generation of synthetic data, automatic translation, answering questions and more from simple spreadsheet cells πŸ–₯️

Link to the πŸ€— Space : JournalistsonHF/huggingface-on-sheets

Although this tool was initially developed for journalists, it actually finds a much wider inking among daily users of the Google suite and the remaining use cases to be explored are numerous.

Only a free Hugging Face API key is required to start using this no-code extension.

Do not hesitate to submit ideas for features that we could add!

Thanks to @fdaudens for initiating this development.
  • 4 replies
Β·
replied to fdaudens's post 3 months ago
view reply

Wonderful, I love the demo, is there already a GitHub repo for the project?

Thanks a lot and have a nice day.

posted an update 4 months ago
view post
Post
976
I've just open sourced RAGoon, a small utility I use to integrate knowledge from the web into LLM inference based on Groq speed and pure Google search performance ⚑

RAGoon is a Python library available on PyPI that aims to improve the performance of language models by providing contextually relevant information through retrieval-based querying, parallel web scraping, and data augmentation techniques. It offers an integration of various APIs (OpenAI, Groq), enabling users to retrieve information from the web, enrich it with domain-specific knowledge, and feed it to language models for more informed responses.
from groq import Groq
# from openai import OpenAI
from ragoon import RAGoon

# Initialize RAGoon instance
ragoon = RAGoon(
    google_api_key="your_google_api_key",
    google_cx="your_google_cx",
    completion_client=Groq(api_key="your_groq_api_key")
)

# Search and get results
query = "I want to do a left join in python polars"
results = ragoon.search(
    query=query,
    completion_model="Llama3-70b-8192",
)

# Print list of results
print(results)

For the time being, this project remains simple, but can easily be integrated into a RAG pipeline.

Link to GitHub : https://github.com/louisbrulenaudet/ragoon
posted an update 4 months ago
view post
Post
507
Integrating the French Taxation Embedding Benchmark Task (beta) into the MTEB πŸ€—

I'm excited to announce an integration of the French Taxation Embedding Benchmark task into the Massive Text Embedding Benchmark (MTEB).

This addition expands the diverse set of tasks available within MTEB, enabling researchers and practitioners to develop and evaluate retrieval models focused on retrieving relevant tax articles or content based on provided queries.

Link to the πŸ€— Dataset : louisbrulenaudet/tax-retrieval-benchmark

Link to the GitHub repo : https://github.com/louisbrulenaudet/tax-retrieval-benchmark

Notes:
The Massive Text Embedding Benchmark for French Taxation and the Dataset are currently in beta and may not be suitable for direct use in production. The size of the Dataset may not be sufficient to handle a wide range of queries and scenarios encountered in real-world settings.

As the Dataset grows and matures, I will provide updates and guidance on its suitability for production use cases.
posted an update 5 months ago
view post
Post
2253
LegalKit Retrieval, a binary Search with Scalar (int8) Rescoring through French legal codes is now available as a πŸ€— Space.

This process is designed to be memory efficient and fast, with the binary index being small enough to fit in memory and the int8 index being loaded as a view. Additionally, the binary index is much faster (up to 32x) to search than the float32 index, while the rescoring is also extremely efficient.

This space also showcases the tsdae-lemone-mbert-base, a sentence embedding model based on BERT fitted using Transformer-based Sequential Denoising Auto-Encoder for unsupervised sentence embedding learning with one objective : french legal domain adaptation.

Link to the πŸ€— Space : louisbrulenaudet/legalkit-retrieval

Notes:
The SentenceTransformer model currently in use is in beta and may not be suitable for direct use in production.
  • 2 replies
Β·
posted an update 6 months ago
view post
Post
2315
To date, louisbrulenaudet/Maxine-34B-stock is the "Best 🀝 base merges and moerges model of around 30B" on the Open LLM Leaderboard ❀️‍πŸ”₯

It is a practical application of the stock method recently implemented by @arcee-ai in the MergeKit :
models:
    - model: ConvexAI/Luminex-34B-v0.2
    - model: fblgit/UNA-34BeagleSimpleMath-32K-v1
merge_method: model_stock
base_model: abacusai/Smaug-34B-v0.1
dtype: bfloat16

Model : louisbrulenaudet/Maxine-34B-stock
LLM Leaderboard best models ❀️‍πŸ”₯ Collection : open-llm-leaderboard/llm-leaderboard-best-models-652d6c7965a4619fb5c27a03
replied to victor's post 7 months ago
replied to victor's post 7 months ago