Zero-Shot and Few-Shot methods for NER in biomedical domain. This repository is a resulting code from research collaboration between Bayer Pharma and The Institute for Artificial Intelligence Research and Development of Serbia.
Each of the datasets has been converted using a specific script to a format where the named entity (NE) has been transformed to 1, while everything else has been labeled as 0.
The conversion process can be implemented using the following steps:
- Load the original dataset.
- Define a conversion function that maps NE to 1 and other entities to 0.
- Apply the conversion function to each dataset separately.
- Merge the converted datasets into a single dataset.
- CHEMDNER
- CDR-Chemical
- NCBI-Disease
- CDR-Disease
- JNLPBA
- n2c2/i2b2
To preprocess the dataseт, follow the steps outlined below:
-
Splitting into Train, Validation, and Test Sets: Dataset is divided into three subsets: a training set, a validation set, and a test set. The training set will be used to train the model, the validation set will be used for tuning hyperparameters and evaluating the model during training, and the test set will be used for final evaluation.
-
Creating Transformer Encodings: Generate transformer encodings for the dataset. These encodings should include two fields: "class" and "text". The "class" field will contain the labels for each instance, and the "text" field will contain the corresponding input text. These encodings are necessary for feeding the data into a transformer-based model.
-
Aligning Labels with BERT Tokens: Perform label alignment with BERT tokens. This step involves mapping the labels to align with the corresponding tokens generated by the BERT tokenizer. Ensure that the labels are correctly aligned with the appropriate tokens to maintain the integrity of the dataset.
-
Transforming into Torch Dataset: Convert the preprocessed dataset into a Torch dataset, which will serve as the input for your model. The Torch dataset provides compatibility with PyTorch. This conversion allows for seamless integration of the dataset with your model and facilitates efficient training and evaluation.
By following these preprocessing steps, you will have a properly prepared dataset ready for training your model. Ensure that each step is executed accurately and thoroughly to ensure the reliability and effectiveness of your results.
In order to train the model, the following steps are performed:
-
Omitting Examples of One Class in Training Set: During training, examples from one class are omitted from the training set. This technique helps the model focus on learning patterns and distinguishing features of the remaining classes. By excluding examples of a specific class, the model is forced to generalize and make accurate predictions on unseen instances of that class.
-
Using the Entire Dataset for Validation: The entire dataset, including all classes, is used for validation. This ensures that the model's performance is evaluated on a diverse set of examples, representing all the classes present in the dataset. Validation helps in monitoring the model's generalization ability and detecting potential overfitting.
-
Testing on Unseen Class Examples (Zero-Shot or Few-Shot Learning:) Testing is conducted on examples from a class that has not been seen during training, known as zero-shot testing. This evaluation scenario assesses the model's ability to generalize and make predictions on entirely new classes. Alternatively, the model can be fine-tuned or retrained with a small number (1, 10, or 100) of examples from the target class, known as few-shot learning. This approach allows the model to adapt and improve its performance specifically for the target class.
Please cite: Košprdić, Miloš, et al. "From Zero to Hero: Harnessing Transformers for Biomedical Named Entity Recognition in Zero-and Few-Shot Contexts." Available at SSRN 4463335. https://arxiv.org/abs/2305.04928
Bibtex:
@misc{košprdić2023zero,
title={From Zero to Hero: Harnessing Transformers for Biomedical Named Entity Recognition in Zero- and Few-shot Contexts},
author={Miloš Košprdić and Nikola Prodanović and Adela Ljajić and Bojana Bašaragin and Nikola Milošević},
year={2023},
eprint={2305.04928},
archivePrefix={arXiv},
primaryClass={cs.CL}
}