Two stage encoder based solution for paraphrase detection, achieved 1st rank in CRI-COMP-2022 Text Paraphrase Competition (leaderboard page)
This repository contains two staged solution for CRI-Competition 2022 using encoders. First stage uses fine-tuned bi-encoder to generate predictions. Second stage uses pretrained transformers to generate final dense prediction pairs.
The idea is that cross encoders give better predictions/performance in sentence similarity task however in the provided dataset of around 100k samples, simply comparing scores for all the pairs will lead to around 10B pairs (simply impractical). And thus, we first generate sparse prediction pairs using much efficient bi-encoder (fine-tuned in our case) using sentence-transformer library along with custom post processing function. Finally we generate dense predictions using pretrained cross encoder and final post processing.
- I used pre-trained sentence transformer model:
paraphrase-multilingual-mpnet-base-v2
- Fine tuned for 2 epochs on the training dataset with some target smoothing to avoid overfitting
- Used fine-tuned model to extract embeddings for each sentence in the validation dataset
- Used cosine similarity to compare the embeddings and find the closest pairs
- Applied threshold filtering based on validation dataset to cover most of the sentences (high recall) but not necessarily best F1-score result (since it will lower the recall but increase the precision which is not enough for stage 2)
- As there can only be one pair with unique sentence ID, I filtered with maximum likelihood of pair confidence generated by model.
- I used pre-trained cross encoder model:
stsb-roberta-large
- Cross encoder directly takes pair as an input and outputs the pair confidence score
- Used intermediate generated pairs to generate dense predictions using cross encoder
- Applied thresholding to filter the pairs with high confidence score
- Maximum likehood pair per unique Sentence ID is used
- finally one extra post processing condition is added: if score(id1,id2) > score(id2,id3) include id1,id2 pair else include id2,id3 pair
- Install the required libraries using
pip install -r requirements.txt
- Put the required dataset files inside
data/
directory (includesindices
,validation_text
,training_text
,training/info
,validation_eval_new.csv
etc) - Run the training script
python stage_1_training.py
- Update the fine-tuned bi-encoder model path inside
stage_1_inference.py
file - Generate the sparse prediction using script
python stage_1_inference.py
- Generate the final dense predictions using script
python stage_2_inference.py
- Compare the results using evaluation script
python evaluation.py --ans data/validation_eval_new.csv --evl cross_enc_result_0.66.csv
- The results are not exactly the same always due to some randomness introduced in stage 1 training sample generation. However, after a few experiments you should be able to achieve good score.
- The results are as follows (random run, not my best):
Recall: 0.7294955122253173
Precision: 0.8105226960110041
F1-score: 0.7678775044795569
This still achieves the best score on leaderboard with rank 1.