Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
custom_llm.py		custom_llm.py
pipeline.py		pipeline.py
requirements.in		requirements.in
requirements.txt		requirements.txt
training.ipynb		training.ipynb
tutorial.ipynb		tutorial.ipynb

README.md

Generating Embedding Data with LLMs

This repository is focused on generating synthetic datasets using large language models (LLMs) for training specialized non-LLM models like Sentence Transformers. This approach is particularly useful in domains where data scarcity is a barrier to developing robust models.

Overview

Utilizing LLMs for synthetic data generation opens new possibilities in various NLP tasks, including but not limited to text classification, sentiment analysis, and more. This folder provides scripts, tutorials, and references to help you generate, manage, and utilize synthetic datasets effectively.

Repository Structure

`tutorial.ipynb`

This Jupyter notebook serves as a detailed tutorial for generating and using synthetic data to fine-tune Sentence Transformer models. The notebook outlines:

The motivation for fine-tuning Sentence Transformers in specific domains or for particular tasks.
Steps to generate synthetic datasets using the distilabel library and a custom LLM.
Practical guidance on setting up and running a data generation pipeline, including library installations and using Hugging Face Inference Endpoints.

This resource is ideal for those looking to customize embeddings for unique domains or tasks where conventional models may fall short.

`custom_llm.py`

This is a custom LLM model (see https://distilabel.argilla.io/latest/sections/learn/tutorial/llm/#defining-custom-llms). It basically adds a grammar option to the existing Hugging Face Inference Endpoints LLM from the distilabel library. The grammar option allows users to define rules for generating text, making it easier to create synthetic datasets with specific structures or formats. This feature should be added to a future release of the distilabel library so you don't need to worry too much about this file.

`pipeline.py`

This Python script sets up and executes a custom pipeline for generating synthetic datasets suitable for training Sentence Transformers. Key functionalities include:

Loading datasets from Hugging Face Hub and processing them.
Generating text using a custom LLM with grammar rules defined in custom_llm.py.
Mining hard negatives and managing dataset columns for efficient data usage.
Pushing the processed data back to the Hugging Face Hub.

`training.ipynb`

This Jupyter notebook outlines the process of training Sentence Transformer models using the synthetic datasets generated by the scripts in this repository. Key aspects include:

Loading and preparing the synthetic dataset for training.
Setting up the Sentence Transformer model architecture.
Configuring training parameters and initiating the training process.

Key Resources

Hugging Face Collection: an example of the outputs generated by the scripts in this repository.
Training Sentence Transformers with Synthetic Data: documentation on training Sentence Transformers with synthetic data.
Creating Synthetic Similarity Datasets: A blog post on generating synthetic data for training Sentence Transformers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

embedding-datasets

embedding-datasets

README.md

Generating Embedding Data with LLMs

Overview

Repository Structure

`tutorial.ipynb`

`custom_llm.py`

`pipeline.py`

`training.ipynb`

Key Resources

Files

embedding-datasets

Directory actions

More options

Directory actions

More options

Latest commit

History

embedding-datasets

Folders and files

parent directory

README.md

Generating Embedding Data with LLMs

Overview

Repository Structure

tutorial.ipynb

custom_llm.py

pipeline.py

training.ipynb

Key Resources

`tutorial.ipynb`

`custom_llm.py`

`pipeline.py`

`training.ipynb`