Skip to content

Latest commit

 

History

History

embedding-datasets

Generating Embedding Data with LLMs

This repository is focused on generating synthetic datasets using large language models (LLMs) for training specialized non-LLM models like Sentence Transformers. This approach is particularly useful in domains where data scarcity is a barrier to developing robust models.

Overview

Utilizing LLMs for synthetic data generation opens new possibilities in various NLP tasks, including but not limited to text classification, sentiment analysis, and more. This folder provides scripts, tutorials, and references to help you generate, manage, and utilize synthetic datasets effectively.

Repository Structure

tutorial.ipynb

This Jupyter notebook serves as a detailed tutorial for generating and using synthetic data to fine-tune Sentence Transformer models. The notebook outlines:

  • The motivation for fine-tuning Sentence Transformers in specific domains or for particular tasks.
  • Steps to generate synthetic datasets using the distilabel library and a custom LLM.
  • Practical guidance on setting up and running a data generation pipeline, including library installations and using Hugging Face Inference Endpoints.

This resource is ideal for those looking to customize embeddings for unique domains or tasks where conventional models may fall short.

custom_llm.py

This is a custom LLM model (see https://distilabel.argilla.io/latest/sections/learn/tutorial/llm/#defining-custom-llms). It basically adds a grammar option to the existing Hugging Face Inference Endpoints LLM from the distilabel library. The grammar option allows users to define rules for generating text, making it easier to create synthetic datasets with specific structures or formats. This feature should be added to a future release of the distilabel library so you don't need to worry too much about this file.

pipeline.py

This Python script sets up and executes a custom pipeline for generating synthetic datasets suitable for training Sentence Transformers. Key functionalities include:

  • Loading datasets from Hugging Face Hub and processing them.
  • Generating text using a custom LLM with grammar rules defined in custom_llm.py.
  • Mining hard negatives and managing dataset columns for efficient data usage.
  • Pushing the processed data back to the Hugging Face Hub.

training.ipynb

This Jupyter notebook outlines the process of training Sentence Transformer models using the synthetic datasets generated by the scripts in this repository. Key aspects include:

  • Loading and preparing the synthetic dataset for training.
  • Setting up the Sentence Transformer model architecture.
  • Configuring training parameters and initiating the training process.

Key Resources