In this paper, we propose an unsupervised framework for knowledge base completion and a new dataset named MALT for evaluation. The main benefits of our approach are:
- fully prompt-based. To extract a new relation, the only thing needed in this framework is to design a prompt.
- can deal with multi-token and ambiguous entities
- work well on long-tail entities
The repo structure is shown below. The two_stage_pipeline.py
file is the code of our two-stage framework, the GENRE path is a submodule that for the corroboration step, and the data
path contains the MALT dataset
ROOT:.
¦ candidate_generation.py
¦ corroboration.py
¦ evaluate.py
¦ LICENSE
¦ README.md
¦ template.py
¦ two_stage_pipeline.py
¦ utils.py
¦
+---GENRE
¦
+---data
¦ entity_name_qid.txt
¦ gold_wikidata.json
¦ malt_eval.txt
¦ malt_hold_out.txt
¦ wikipedia.json
¦
+---figure
¦ framework.png
¦ malt.png
¦ prompt.png
¦
We developed a new dataset with an emphasis on the long-tail challenge, called MALT) (for “Multi-token, Ambiguous, Long-Tailed facts”). After downloading, put the MALT file in the root path. There are five files in the MALT dataset:
malt_eval.txt
contains entity IDs for evaluationmalt_hold_out.txt
contains entity IDs for adjusting the hyper-parametersgold_wikidata.json
contains the gold factswikipedia.json
contains the corresponding Wikipedia pagesentity_name_qid.txt
The below table shows the stats of the MALT dataset, which contains many multi-token, ambiguous, and long-tail facts.
Given the input document:
Lhasa de Sela said that the song was about inner happiness and "feeling my feet in the earth, having a place in the world, of things taking care of themselves.“ In May 2009, her collaboration with Patrick Watson was released.
We'd like to extract the collaborators for the singer "Lhasa de Sela".
python two_stage_pipeline.py -run_example True
After, the output is shown below:
( Lhasa de Sela, collaborator, Patrick Watson, 0.4763992584808626 )
( Lhasa de Sela, collaborator, Patrick Watson (musician), 0.3224404241174992 )
( Lhasa de Sela, collaborator, Patrick Watson (producer), 0.2401321410226018 )
python two_stage_pipeline.py -run_example False
After, you can get two files:
extracted_facts.txt
stores the facts extracted by our frameworkscore.txt
stores the evaluation scores (precision, recall, and f1)
This work was partially funded by ANR-20-CHIA0012-01 (“NoRDF”).
The MALT dataset is licensed under CC BY-SA 4.0 and the code is licensed under the MIT license