Zero-Shot BioNER

Zero-Shot and Few-Shot methods for NER in biomedical domain. This repository is a resulting code from research collaboration between Bayer Pharma and The Institute for Artificial Intelligence Research and Development of Serbia.

Dataset Conversion

Each of the datasets has been converted using a specific script to a format where the named entity (NE) has been transformed to 1, while everything else has been labeled as 0.

The conversion process can be implemented using the following steps:

Load the original dataset.
Define a conversion function that maps NE to 1 and other entities to 0.
Apply the conversion function to each dataset separately.
Merge the converted datasets into a single dataset.

Converted datasets

Chemical NER

CHEMDNER
CDR-Chemical

Disease NER

NCBI-Disease
CDR-Disease

Gene/Protein NER

JNLPBA

Drugs

n2c2/i2b2

Merged dataset is in table below

Preprocessing Task Steps

To preprocess the dataseт, follow the steps outlined below:

Splitting into Train, Validation, and Test Sets: Dataset is divided into three subsets: a training set, a validation set, and a test set. The training set will be used to train the model, the validation set will be used for tuning hyperparameters and evaluating the model during training, and the test set will be used for final evaluation.
Creating Transformer Encodings: Generate transformer encodings for the dataset. These encodings should include two fields: "class" and "text". The "class" field will contain the labels for each instance, and the "text" field will contain the corresponding input text. These encodings are necessary for feeding the data into a transformer-based model.
Aligning Labels with BERT Tokens: Perform label alignment with BERT tokens. This step involves mapping the labels to align with the corresponding tokens generated by the BERT tokenizer. Ensure that the labels are correctly aligned with the appropriate tokens to maintain the integrity of the dataset.
Transforming into Torch Dataset: Convert the preprocessed dataset into a Torch dataset, which will serve as the input for your model. The Torch dataset provides compatibility with PyTorch. This conversion allows for seamless integration of the dataset with your model and facilitates efficient training and evaluation.

By following these preprocessing steps, you will have a properly prepared dataset ready for training your model. Ensure that each step is executed accurately and thoroughly to ensure the reliability and effectiveness of your results.

Model Training and Testing

In order to train the model, the following steps are performed:

Omitting Examples of One Class in Training Set: During training, examples from one class are omitted from the training set. This technique helps the model focus on learning patterns and distinguishing features of the remaining classes. By excluding examples of a specific class, the model is forced to generalize and make accurate predictions on unseen instances of that class.
Using the Entire Dataset for Validation: The entire dataset, including all classes, is used for validation. This ensures that the model's performance is evaluated on a diverse set of examples, representing all the classes present in the dataset. Validation helps in monitoring the model's generalization ability and detecting potential overfitting.
Testing on Unseen Class Examples (Zero-Shot or Few-Shot Learning:) Testing is conducted on examples from a class that has not been seen during training, known as zero-shot testing. This evaluation scenario assesses the model's ability to generalize and make predictions on entirely new classes. Alternatively, the model can be fine-tuned or retrained with a small number (1, 10, or 100) of examples from the target class, known as few-shot learning. This approach allows the model to adapt and improve its performance specifically for the target class.

Citation

Please cite: Košprdić, Miloš, et al. "From Zero to Hero: Harnessing Transformers for Biomedical Named Entity Recognition in Zero-and Few-Shot Contexts." Available at SSRN 4463335. https://arxiv.org/abs/2305.04928

Bibtex:

@misc{košprdić2023zero,
      title={From Zero to Hero: Harnessing Transformers for Biomedical Named Entity Recognition in Zero- and Few-shot Contexts}, 
      author={Miloš Košprdić and Nikola Prodanović and Adela Ljajić and Bojana Bašaragin and Nikola Milošević},
      year={2023},
      eprint={2305.04928},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
datasets		datasets
MergingDatasets_withTrainValidTestSplitting.ipynb		MergingDatasets_withTrainValidTestSplitting.ipynb
NER_preprocessing.ipynb		NER_preprocessing.ipynb
NER_testing.ipynb		NER_testing.ipynb
NER_training.ipynb		NER_training.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zero-Shot BioNER

Dataset Conversion

Converted datasets

Chemical NER

Disease NER

Gene/Protein NER

Drugs

Merged dataset is in table below

Preprocessing Task Steps

Model Training and Testing

Citation

About

Releases

Packages

Contributors 4

Languages

br-ai-ns-institute/Zero-ShotNER

Folders and files

Latest commit

History

Repository files navigation

Zero-Shot BioNER

Dataset Conversion

Converted datasets

Chemical NER

Disease NER

Gene/Protein NER

Drugs

Merged dataset is in table below

Preprocessing Task Steps

Model Training and Testing

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages