Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.
- Nelson Bore ([email protected])
- Constantin Adam ([email protected])
The fdedup transform eliminates documents that are highly similar to each other (but not necessarily identical) from a set of Parquet files. This ensures that the resulting dataset contains only unique or sufficiently distinct entries.
Fuzzy dedup is a complex process made up of a pipeline that performs four main steps:
- Signature Calculation: creates a set of minhashes for each document, and uses them to create band signatures for the document.
- Cluster Analysis: groups documents into clusters based on matching band signatures. Within each cluster, it retains only the documents that have a Jaccard similarity above a specified threshold, and it identifies which documents to keep as unique and which ones to mark as duplicates.
- Duplicate List Generation: combines the similarity clusters identified in each band to create a single, unified list of duplicate documents.
- Data Cleaning: processes the documents by either filtering out duplicates or adding annotations to distinguish duplicates from non-duplicates.
Each one of these steps is described in more detail below.
This transform computes num_permutations
minhashes and num_bands
signatures for each document in the dataset, by
following these processing steps:
- Shingle Generation: create a set of character or word shingles, using a specified window length. Character shingles are more effective at detecting similar documents, but require more computational resources compared to word shingles.
- Minhash Calculation: using the shingles as input, compute
num_permutations
minhashes for each document. - Band Signature Calculation: divide the minhashes into
num_bands
, where each band containsnum_minhashes_per_band
minhashes. For each document, generate a unique signature for every band.
The values for num_bands
and num_minhashes_per_band
determine the likelihood that documents with a certain Jaccard
similarity will be marked as duplicates. A Jupyter notebook in the utils folder generates a graph of this
probability function, helping users explore how different settings for num_bands
and num_minhashes_per_band
impact
the deduplication process.
To help distribute the workload and speed up processing of the next steps, the hash space of each band is divided into
num_segments
segments. The band signatures, the minhashes, the document ids, and lengths are stored in an organized
output folder structure bands/band=b/segment=s
, where b
is the band number and s
is the segment number.
This transform leverages segmented processing to analyze the data generated by the Signature Calculation step
efficiently and in parallel. Each worker processes a specific segment s
of a band b
by loading and analyzing all
Parquet files from the folder bands/band=b/segment=s
. Each row in the Parquet files contains, for a document:
band_hash
, the document's band signature, anddata
, a structure with three fields: the uniquedocument_id
, document'sminhashes
, anddocument_size
.
The transform runs the following processing steps:
- Data Loading: combine into a single dataframe all Parquet files in
bands/band=b/segment=s
. - Clustering: run a
group_by
operation on theband_hash
column that will group documents with the same band signature into clusters. - Similarity Analysis: for each cluster, calculate Jaccard similarity between pairs of documents using their minhashes, and move documents below the specified Jaccard similarity threshold into new clusters.
- Duplicate Identification: in clusters with more than one document remaining, retain the largest document with the smallest document id, and mark as duplicates all other documents in the cluster.
- Persist Results: save the duplicate clusters in a file.
The Cluster Analysis step identifies duplicates across multiple bands, meaning a document can be marked as a duplicate in one or more bands (e.g., if two documents are identical, one will be marked as a duplicate in all bands). This transform consolidates all duplicate information from each band segment into a single file, providing a unified record of duplicates detected across the dataset.
This transform processes the original dataset using the list of duplicate documents generated by the Duplicate List Generation step. It imports each file in the original dataset into a table and produces a new dataset. The directory structure of the input dataset is preserved, but the contents of the output files depend on the selected operating mode:
- Annotate - add a new
duplicate
column to the dataset, that contains ad
for documents marked as duplicates, and is empty for non-duplicates - Filter duplicates - removes all documents identified as duplicates from the dataset.
- Filter non-duplicates - removes from the dataset all documents that were not marked as duplicates, leaving only the duplicates.
The output dataset reflects the selected mode, providing flexibility for downstream processing.
Input Column Name | Data Type | Description |
---|---|---|
Column specified by the contents_column configuration argument | str | Column that stores document text |
Column specified by the document_id_column configuration argument | int64 | Column that stores document ID |
Output Column Name | Data Type | Description |
---|---|---|
duplicate | str | Column added if fuzzy dedup runs in 'annotate' mode. Value is 'd' for duplicate documents, empty for non-duplicates |
The set of dictionary keys holding Fuzzy Dedup configuration for values are as follows:
--input_folder INPUT_FOLDER
Input folder path
--output_folder OUTPUT_FOLDER
Output folder path
--operation_mode {filter_duplicates,filter_non_duplicates,annotate}
operation mode for data cleanup: filter out duplicates/non-duplicates, or annotate duplicate documents
--contents_column CONTENTS_COLUMN
name of the column that stores document text
--document_id_column DOCUMENT_ID_COLUMN
name of the column that stores document ID
--seed SEED seed of the random number generator
--num_permutations NUM_PERMUTATIONS
number of permutations to use for minhash calculation
--num_bands NUM_BANDS
number of bands to use for band hash calculation
--num_minhashes_per_band NUM_MINHASHES_PER_BAND
number of minhashes to use in each band
--word_shingle_size WORD_SHINGLE_SIZE
number of words included in one shingle
--jaccard_similarity_threshold JACCARD_SIMILARITY_THRESHOLD
jaccard similarity threshold above which two documents are similar
--num_segments NUM_SEGMENTS
the number of segments dividing the hashing space for each band (for scalability)
--duplicate_list_location DUPLICATE_LIST_LOCATION
path to the file with all the duplicate document ids
--services SERVICES Comma-separated list of services to run (e.g., SignatureCalculation,ClusterAnalysis,GetDuplicateList,DataCleaning)
--use_s3 USE_S3 use s3
--s3_cred S3_CRED ast string of options for s3 credentials
--shingle_option SHINGLE_OPTION
Option used for shingling
The set of dictionary keys holding SignatureCalcTransform configuration for values are as follows:
--minhash_document_id_column MINHASH_DOCUMENT_ID_COLUMN
name of the column storing the unique ID assigned to each document
--minhash_contents_column MINHASH_CONTENTS_COLUMN
name of the column storing the contents of each document
--minhash_seed MINHASH_SEED
the seed used to instantiate the random number generator
--minhash_num_permutations MINHASH_NUM_PERMUTATIONS
number of permutations (minhashes) calculated for each document
--minhash_word_shingle_size MINHASH_WORD_SHINGLE_SIZE
the size of the word shingles calculated for each document
--minhash_num_bands MINHASH_NUM_BANDS
the number of bands to use in the banding technique
--minhash_num_minhashes_per_band MINHASH_NUM_MINHASHES_PER_BAND
the number of minhashes to use in each band
--minhash_num_segments MINHASH_NUM_SEGMENTS
the number of segments across which we divide the hashing space for each band
--minhash_shingle_option MINHASH_SHINGLE_OPTION
Shingling option ('word' or 'char')
The set of dictionary keys holding ClusterAnalysisTransform configuration for values are as follows:
--cluster_jaccard_similarity_threshold CLUSTER_JACCARD_SIMILARITY_THRESHOLD
Jaccard similarity threshold above which two documents are duplicates
--cluster_num_bands CLUSTER_NUM_BANDS
The number of bands used in the banding technique
--cluster_num_segments CLUSTER_NUM_SEGMENTS
The number of segments dividing the hashing space for each band
This transform currently has no configuration parameters.
The set of dictionary keys holding DataCleaningTransform configuration for values are as follows:
--fdclean_document_id_column FDCLEAN_DOCUMENT_ID_COLUMN
name of the column storing the unique ID assigned to each document
--fdclean_operation_mode {filter_duplicates,filter_non_duplicates,annotate}
operation mode: filter out duplicates/non-duplicates, or annotate duplicate documents
To run the samples, use the following make
target to create a virtual environment:
make venv
Subsequently, the main orchestration program can run with:
source venv/bin/activate
cd src
python fdedup_transform_python.py
Alternatively the transforms included in fuzzy dedup can be launched independently:
source venv/bin/activate
cd src
python signature_calc_local_python.py
python cluster_analysis_local_python.py
python get_duplicate_list_local_python.py
python data_cleaning_local_python.py
After running the transforms, execute:
ls output
To see results of the transform.
TBD (link to the notebook will be provided)
To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.
For testing fuzzy deduplication in a pure python runtime, use the following make
targets. To launch integration tests
for all the component transforms of fuzzy dedup (signature calculation, cluster analysis, get duplicate list and data
cleaning) use:
make test-src
To test the creation of the Docker image for fuzzy dedup transform and the capability to run a local program inside that image, use:
make test-image
The following is a list of references to research articles and github repositories that inspired the module's design: