Skip to content

tigerchen52/long_tail_kbc

Repository files navigation

Knowledge Base Completion for Long-Tail Entities

In this paper, we propose an unsupervised framework for knowledge base completion and a new dataset named MALT for evaluation. The main benefits of our approach are:

  • fully prompt-based. To extract a new relation, the only thing needed in this framework is to design a prompt.
  • can deal with multi-token and ambiguous entities
  • work well on long-tail entities

Our method employs two different LMs in a two-stage pipeline as shown in the above Figure. The first stage generates candidate answers to input prompts and gives cues to retrieve informative sentences from Wikipedia and other sources. The second stage validates the candidates and disambiguates the retained answer strings onto entities in the underlying KG (e.g., mapping “Yves Desrosiers” to Yves Desrosiers (guitarist)).

Usage

The repo structure is shown below. The two_stage_pipeline.py file is the code of our two-stage framework, the GENRE path is a submodule that for the corroboration step, and the data path contains the MALT dataset

ROOT:.
¦   candidate_generation.py
¦   corroboration.py
¦   evaluate.py
¦   LICENSE
¦   README.md
¦   template.py
¦   two_stage_pipeline.py
¦   utils.py
¦   
+---GENRE
¦
+---data
¦       entity_name_qid.txt
¦       gold_wikidata.json
¦       malt_eval.txt
¦       malt_hold_out.txt
¦       wikipedia.json
¦
+---figure
¦       framework.png
¦       malt.png
¦       prompt.png
¦       

Data Preparation

We developed a new dataset with an emphasis on the long-tail challenge, called MALT) (for “Multi-token, Ambiguous, Long-Tailed facts”). After downloading, put the MALT file in the root path. There are five files in the MALT dataset:

  • malt_eval.txt contains entity IDs for evaluation
  • malt_hold_out.txt contains entity IDs for adjusting the hyper-parameters
  • gold_wikidata.json contains the gold facts
  • wikipedia.json contains the corresponding Wikipedia pages
  • entity_name_qid.txt

The below table shows the stats of the MALT dataset, which contains many multi-token, ambiguous, and long-tail facts.

Run Example

Given the input document:

Lhasa de Sela said that the song was about inner happiness and "feeling my feet in the earth, having a place in the world, of things taking care of themselves.“ In May 2009, her collaboration with Patrick Watson was released.

We'd like to extract the collaborators for the singer "Lhasa de Sela".

python two_stage_pipeline.py -run_example True

After, the output is shown below:

( Lhasa de Sela, collaborator, Patrick Watson, 0.4763992584808626 )
( Lhasa de Sela, collaborator, Patrick Watson (musician), 0.3224404241174992 )
( Lhasa de Sela, collaborator, Patrick Watson (producer), 0.2401321410226018 )

Run MALT

python two_stage_pipeline.py -run_example False

After, you can get two files:

  • extracted_facts.txt stores the facts extracted by our framework
  • score.txt stores the evaluation scores (precision, recall, and f1)

Resource

Acknowledgements

This work was partially funded by ANR-20-CHIA0012-01 (“NoRDF”).

License

The MALT dataset is licensed under CC BY-SA 4.0 and the code is licensed under the MIT license

Releases

No releases published

Packages

No packages published

Languages