Skip to content

Latest commit

 

History

History
244 lines (206 loc) · 12.5 KB

File metadata and controls

244 lines (206 loc) · 12.5 KB

Fuzzy Dedup

Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.

Contributors

Description

The fdedup transform eliminates documents that are highly similar to each other (but not necessarily identical) from a set of Parquet files. This ensures that the resulting dataset contains only unique or sufficiently distinct entries.

Fuzzy dedup is a complex process made up of a pipeline that performs four main steps:

  1. Signature Calculation: creates a set of minhashes for each document, and uses them to create band signatures for the document.
  2. Cluster Analysis: groups documents into clusters based on matching band signatures. Within each cluster, it retains only the documents that have a Jaccard similarity above a specified threshold, and it identifies which documents to keep as unique and which ones to mark as duplicates.
  3. Duplicate List Generation: combines the similarity clusters identified in each band to create a single, unified list of duplicate documents.
  4. Data Cleaning: processes the documents by either filtering out duplicates or adding annotations to distinguish duplicates from non-duplicates.

Each one of these steps is described in more detail below.

Signature Calculation

This transform computes num_permutations minhashes and num_bands signatures for each document in the dataset, by following these processing steps:

  1. Shingle Generation: create a set of character or word shingles, using a specified window length. Character shingles are more effective at detecting similar documents, but require more computational resources compared to word shingles.
  2. Minhash Calculation: using the shingles as input, compute num_permutations minhashes for each document.
  3. Band Signature Calculation: divide the minhashes into num_bands, where each band contains num_minhashes_per_band minhashes. For each document, generate a unique signature for every band.

The values for num_bands and num_minhashes_per_band determine the likelihood that documents with a certain Jaccard similarity will be marked as duplicates. A Jupyter notebook in the utils folder generates a graph of this probability function, helping users explore how different settings for num_bands and num_minhashes_per_band impact the deduplication process.

To help distribute the workload and speed up processing of the next steps, the hash space of each band is divided into num_segments segments. The band signatures, the minhashes, the document ids, and lengths are stored in an organized output folder structure bands/band=b/segment=s, where b is the band number and s is the segment number.

Cluster Analysis

This transform leverages segmented processing to analyze the data generated by the Signature Calculation step efficiently and in parallel. Each worker processes a specific segment s of a band b by loading and analyzing all Parquet files from the folder bands/band=b/segment=s. Each row in the Parquet files contains, for a document:

  • band_hash, the document's band signature, and
  • data, a structure with three fields: the unique document_id, document's minhashes, and document_size.

The transform runs the following processing steps:

  1. Data Loading: combine into a single dataframe all Parquet files in bands/band=b/segment=s.
  2. Clustering: run a group_by operation on the band_hash column that will group documents with the same band signature into clusters.
  3. Similarity Analysis: for each cluster, calculate Jaccard similarity between pairs of documents using their minhashes, and move documents below the specified Jaccard similarity threshold into new clusters.
  4. Duplicate Identification: in clusters with more than one document remaining, retain the largest document with the smallest document id, and mark as duplicates all other documents in the cluster.
  5. Persist Results: save the duplicate clusters in a file.

Duplicate List Generation

The Cluster Analysis step identifies duplicates across multiple bands, meaning a document can be marked as a duplicate in one or more bands (e.g., if two documents are identical, one will be marked as a duplicate in all bands). This transform consolidates all duplicate information from each band segment into a single file, providing a unified record of duplicates detected across the dataset.

Data Cleaning

This transform processes the original dataset using the list of duplicate documents generated by the Duplicate List Generation step. It imports each file in the original dataset into a table and produces a new dataset. The directory structure of the input dataset is preserved, but the contents of the output files depend on the selected operating mode:

  1. Annotate - add a new duplicate column to the dataset, that contains a d for documents marked as duplicates, and is empty for non-duplicates
  2. Filter duplicates - removes all documents identified as duplicates from the dataset.
  3. Filter non-duplicates - removes from the dataset all documents that were not marked as duplicates, leaving only the duplicates.

The output dataset reflects the selected mode, providing flexibility for downstream processing.

Input Columns Used by This Transform

Input Column Name Data Type Description
Column specified by the contents_column configuration argument str Column that stores document text
Column specified by the document_id_column configuration argument int64 Column that stores document ID

Output Columns Annotated by This Transform

Output Column Name Data Type Description
duplicate str Column added if fuzzy dedup runs in 'annotate' mode. Value is 'd' for duplicate documents, empty for non-duplicates

Configuration and Usage

Fuzzy Deduplication Transform

The set of dictionary keys holding Fuzzy Dedup configuration for values are as follows:

--input_folder INPUT_FOLDER
                    Input folder path
--output_folder OUTPUT_FOLDER
                    Output folder path
--operation_mode {filter_duplicates,filter_non_duplicates,annotate}
                    operation mode for data cleanup: filter out duplicates/non-duplicates, or annotate duplicate documents
--contents_column CONTENTS_COLUMN
                    name of the column that stores document text
--document_id_column DOCUMENT_ID_COLUMN
                    name of the column that stores document ID
--seed SEED         seed of the random number generator
--num_permutations NUM_PERMUTATIONS
                    number of permutations to use for minhash calculation
--num_bands NUM_BANDS
                    number of bands to use for band hash calculation
--num_minhashes_per_band NUM_MINHASHES_PER_BAND
                    number of minhashes to use in each band
--word_shingle_size WORD_SHINGLE_SIZE
                    number of words included in one shingle
--jaccard_similarity_threshold JACCARD_SIMILARITY_THRESHOLD
                    jaccard similarity threshold above which two documents are similar
--num_segments NUM_SEGMENTS
                    the number of segments dividing the hashing space for each band (for scalability)
--duplicate_list_location DUPLICATE_LIST_LOCATION
                    path to the file with all the duplicate document ids
--services SERVICES   Comma-separated list of services to run (e.g., SignatureCalculation,ClusterAnalysis,GetDuplicateList,DataCleaning)
--use_s3 USE_S3       use s3
--s3_cred S3_CRED     ast string of options for s3 credentials
--shingle_option SHINGLE_OPTION
                    Option used for shingling

Signature Calculation Transform

The set of dictionary keys holding SignatureCalcTransform configuration for values are as follows:

--minhash_document_id_column MINHASH_DOCUMENT_ID_COLUMN
                    name of the column storing the unique ID assigned to each document
--minhash_contents_column MINHASH_CONTENTS_COLUMN
                    name of the column storing the contents of each document
--minhash_seed MINHASH_SEED
                    the seed used to instantiate the random number generator
--minhash_num_permutations MINHASH_NUM_PERMUTATIONS
                    number of permutations (minhashes) calculated for each document
--minhash_word_shingle_size MINHASH_WORD_SHINGLE_SIZE
                    the size of the word shingles calculated for each document
--minhash_num_bands MINHASH_NUM_BANDS
                    the number of bands to use in the banding technique
--minhash_num_minhashes_per_band MINHASH_NUM_MINHASHES_PER_BAND
                    the number of minhashes to use in each band
--minhash_num_segments MINHASH_NUM_SEGMENTS
                    the number of segments across which we divide the hashing space for each band
--minhash_shingle_option MINHASH_SHINGLE_OPTION
                    Shingling option ('word' or 'char')

Cluster Analysis Transform

The set of dictionary keys holding ClusterAnalysisTransform configuration for values are as follows:

--cluster_jaccard_similarity_threshold CLUSTER_JACCARD_SIMILARITY_THRESHOLD
                      Jaccard similarity threshold above which two documents are duplicates
--cluster_num_bands CLUSTER_NUM_BANDS
                      The number of bands used in the banding technique
--cluster_num_segments CLUSTER_NUM_SEGMENTS
                      The number of segments dividing the hashing space for each band

Get Duplicates List Transform

This transform currently has no configuration parameters.

Data Cleaning Transform

The set of dictionary keys holding DataCleaningTransform configuration for values are as follows:

  --fdclean_document_id_column FDCLEAN_DOCUMENT_ID_COLUMN
                        name of the column storing the unique ID assigned to each document
  --fdclean_operation_mode {filter_duplicates,filter_non_duplicates,annotate}
                        operation mode: filter out duplicates/non-duplicates, or annotate duplicate documents

Running the samples

To run the samples, use the following make target to create a virtual environment:

make venv

Subsequently, the main orchestration program can run with:

source venv/bin/activate
cd src
python fdedup_transform_python.py

Alternatively the transforms included in fuzzy dedup can be launched independently:

source venv/bin/activate
cd src
python signature_calc_local_python.py
python cluster_analysis_local_python.py
python get_duplicate_list_local_python.py
python data_cleaning_local_python.py

After running the transforms, execute:

ls output

To see results of the transform.

Code example

TBD (link to the notebook will be provided)

Transforming data using the transform image

To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.

Testing

For testing fuzzy deduplication in a pure python runtime, use the following make targets. To launch integration tests for all the component transforms of fuzzy dedup (signature calculation, cluster analysis, get duplicate list and data cleaning) use:

make test-src

To test the creation of the Docker image for fuzzy dedup transform and the capability to run a local program inside that image, use:

make test-image

Further Resources

The following is a list of references to research articles and github repositories that inspired the module's design:

  1. Jure Leskovec, Anand Rajaraman, Jeff Ullman, Mining of Massive Datasets, Chapter 3: Finding Similar Items
  2. G Penedo et al., The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
  3. Datatrove github repo