Skip to content

Latest commit

 

History

History
46 lines (39 loc) · 3.57 KB

File metadata and controls

46 lines (39 loc) · 3.57 KB

Text-Paraphrase-Detection-Winner-Solution

Two stage encoder based solution for paraphrase detection, achieved 1st rank in CRI-COMP-2022 Text Paraphrase Competition (leaderboard page)
This repository contains two staged solution for CRI-Competition 2022 using encoders. First stage uses fine-tuned bi-encoder to generate predictions. Second stage uses pretrained transformers to generate final dense prediction pairs.

main solution The idea is that cross encoders give better predictions/performance in sentence similarity task however in the provided dataset of around 100k samples, simply comparing scores for all the pairs will lead to around 10B pairs (simply impractical). And thus, we first generate sparse prediction pairs using much efficient bi-encoder (fine-tuned in our case) using sentence-transformer library along with custom post processing function. Finally we generate dense predictions using pretrained cross encoder and final post processing.


Stage 1: Fine-tuned Bi-Encoder

stage 1

  • I used pre-trained sentence transformer model: paraphrase-multilingual-mpnet-base-v2
  • Fine tuned for 2 epochs on the training dataset with some target smoothing to avoid overfitting
  • Used fine-tuned model to extract embeddings for each sentence in the validation dataset
  • Used cosine similarity to compare the embeddings and find the closest pairs
  • Applied threshold filtering based on validation dataset to cover most of the sentences (high recall) but not necessarily best F1-score result (since it will lower the recall but increase the precision which is not enough for stage 2)
  • As there can only be one pair with unique sentence ID, I filtered with maximum likelihood of pair confidence generated by model.

Stage 2: Pretrained Cross Encoder

stage 2 solution

  • I used pre-trained cross encoder model: stsb-roberta-large
  • Cross encoder directly takes pair as an input and outputs the pair confidence score
  • Used intermediate generated pairs to generate dense predictions using cross encoder
  • Applied thresholding to filter the pairs with high confidence score
  • Maximum likehood pair per unique Sentence ID is used
  • finally one extra post processing condition is added: if score(id1,id2) > score(id2,id3) include id1,id2 pair else include id2,id3 pair

Steps to reproduce the solution score:

  • Install the required libraries using pip install -r requirements.txt
  • Put the required dataset files inside data/ directory (includes indices, validation_text, training_text, training/info, validation_eval_new.csv etc)
  • Run the training script python stage_1_training.py
  • Update the fine-tuned bi-encoder model path inside stage_1_inference.py file
  • Generate the sparse prediction using script python stage_1_inference.py
  • Generate the final dense predictions using script python stage_2_inference.py
  • Compare the results using evaluation script python evaluation.py --ans data/validation_eval_new.csv --evl cross_enc_result_0.66.csv
  • The results are not exactly the same always due to some randomness introduced in stage 1 training sample generation. However, after a few experiments you should be able to achieve good score.
  • The results are as follows (random run, not my best):
Recall: 0.7294955122253173
Precision: 0.8105226960110041
F1-score: 0.7678775044795569

This still achieves the best score on leaderboard with rank 1.