This repository covers the coding part of the thesis. You will find here the scripts used during the work, details on how to reproduce the experimental results and how you can use the system. Some generative operations are computationally demanding, so below we have provided links to directly download the previously generated files. However, the reproducing steps are still mentioned in this document.
You can download the following files that were used during the work by clicking on them:
- First generated KG
- Second generated KG
- Second generated KG v2
- Third generated KG
- First benchmark dataset
- Second benchmark dataset
- Long-format questions used for to generate the first benchmark
- Embeddings of documents from second KG
- Embeddings of documents from third KG
Evaluation-related files are stored in this repository under the root directory. In the filenames:
- "KG2" denotes the second knowledge graph
- "KG3" denotes the third knowledge graph
- "b1" denotes the first benchmark
- "b2" denotes the second benchmark
The following files store the rank of each sample instance decided by the retrieval methods in different settings:
evaluation_results_2nd_retrieval_method_KG2_5_5.json
evaluation_results_2nd_retrieval_method_KG2_10_10.json
evaluation_results_2nd_retrieval_method_KG2_20_20.json
evaluation_results_2nd_retrieval_method_KG3_5_5.json
evaluation_results_2nd_retrieval_method_KG3_10_10.json
evaluation_results_2nd_retrieval_method_KG3_20_20.json
evaluation_results_bm25_KG2_b1.json
evaluation_results_bm25_KG2_b2.json
evaluation_results_bm25_KG3_b1.json
evaluation_results_bm25_KG3_b2.json
evaluation_results_embedding-based_KG2_b1.json
evaluation_results_embedding-based_KG2_b2.json
evaluation_results_embedding-based_KG3_b1.json
evaluation_results_embedding-based_KG3_b2.json
To print the MRR and Hits@K for k in {10,20,50,100} run the script genral_working_directory/[email protected]
.
Make sure that the script and the json files mentioned above are at the same directory when executing the script.
Using the following scripts, you can try the RAG models:
RAG1.py
→ The first RAG model, similar to naive RAG.RAG2.py
→ The second RAG model where we use CEL to re-rank the initially retrieved instances.
Both systems use embedding-based retriever since is the best performing retrieval model. The output/response is a set of IRIs representing images and a textual description generated by the LLM that gives a short explanation about the selected images. To execute the scripts you will need the LLM and embedding services enabled. The services are hosted in the servers provided by the Paderborn University and is out of the scope of this repository because it requires permission from responsible parties at the university to enable.
To reproduce the results and necessary data structures only the general_working_directory
should be considered. The other directories
can be ignored.
Prerequisites before reproducing the kg generation and evaluation results:
Install python v3.10.13 or later.
Clone the repository, create a virtual environment and install dependencies:
# 1. clone
git clone https://github.com/dice-group/MRAG-KG.git
# 2. setup virtual environment
python -m venv venv
# 3. activate the virtual environment
source venv/bin/activate # for Unix and macOS
.\venv\Scripts\activate # for Windows
# 4. install dependencies
pip install -r requirements.txt
Move to the general working directory:
cd general_working_directory
The rest of this document, gives some details about each part of the thesis and describes the reproducing steps.
Overview and quick navigation:
- KG generation
- Benchmark generation
- Ranking made by the first retrieval method
- Ranking made by the second retrieval method
- MRR and Hits@K
- CEL models evaluation
- TSNE plot
We have generated 3 knowledge graphs, given the Fashionpedia dataset. Here we explain how the KGs are generated.
Fashionpedia provides the following data that we make use of:
The first file is a zip file containing a folder which holds all the images of fashionpedia dataset.
The second file is a json file that contain the fashionpedia ontology. In this ontology the main individual is an "annotation" which holds the data for a certain image. An annotation describes some part of the image (a garment/garment part) by specifying the category of the garment and its attributes, above other information.
The first generation step consist of creating an RDF KG which representing the data given in instances_attributes_train2020.json
.
The structure of the fashionpedia ontology is given below:
{
"info": info,
"categories": [category],
"attributes": [attribute],
"images": [image],
"annotations": [annotation],
"licenses": [license]
}
info{
"year" : int,
"version" : str,
"description" : str,
"contributor" : str,
"url" : str,
"date_created" : datetime,
}
category{
"id" : int,
"name" : str,
"supercategory" : str, # parent of this label
"level": int, # levels in the taxonomy
"taxonomy_id": string,
}
attribute{
"id" : int,
"name" : str,
"supercategory" : str, # parent of this label
"level": int, # levels in the taxonomy
"taxonomy_id": string,
}
image{
"id" : int,
"width" : int,
"height" : int,
"file_name" : str,
"license" : int,
"time_captured": string,
"original_url": string,
"isstatic": int, 0: the original_url is not a static url,
"kaggle_id": str,
}
annotation{
"id" : int,
"image_id" : int,
"category_id" : int,
"attribute_ids": [int],
"segmentation" : [polygon] or [rle]
"bbox" : [x,y,width,height], # int
"area" : int
"iscrowd": int (1 or 0)
}
polygon: [x1, y1, x2, y2, ...], where x, y are the coordinates of vertices, int
rle: {"size", (height, widht), "counts": str}
license{
"id" : int,
"name" : str,
"url" : str
}
In the script named first_kg_generation.py
we use rdflib to create a graph which
we populate by adding axioms via the same libray.
- First we add a class for each of the following items: "info", "category", "attribute", "image", "annotation" and "license".
- Then we add object properties for connections that are done using "id".
For example an annotation has an
"image_id"
which is referring to the image it belongs. Therefore, for the class annotation we will create an object property"hasImage"
. The same is done for each id-connected entity. - For the rest of the data that an entry has, we create a datatype property to represent them in the knowledge base.
- The last step consist of adding the individuals by going through each entry in the dataset and adding the respective classes and properties to it.
By the end of the 4th step, the first knowledge base generation is completed.
For the second knowledge base generation, second_kg_generation.py
script is used.
For the second generation we want the only individuals to be images. Therefore,
we have only one class, which is Image
.
These image individuals contain all the information from the annotations
belonging to that image.
That means that an image can contain more than one wearable items that is described by an annotation. So basically, we have merged together all the information there is for an image.
There are only data properties on this dataset, no object properties, because we only describe literal data for images and there is no need to have a relation between these images.
In this generation we have included only the apparel-descriptive information and omitted the rest, except the file_name and width & height. All the information for an annotation that belongs to the image is merged together in a string and attached to the image using a data property.
A structure of the data is given below:
image{
"file_name" : str,
"width" : int,
"height" : int,
"descriptions": {
"desc1": str,
"desc2": str,
...
}
}
For the sake of understanding we are showing this in a json format, but this data exist only in RDF/XML format. Each annotation is represented by a "has_description" property denoted as "desc1", "desc2", "..." in the example above.
We use a LLM to generate a short description about each image in the dataset. This description is then added to each instance in the second KG using a data property.
* The service for the LLM required before running the following scripts.
python first_kg_generation.py
→ Generated the first KG (From JSON to RDF/XML)python second_kg_generation.py
→ Generated the second KG (Subgraph Summarization)python third_kg_generation.py
→ Generated the third KG (Enrichment with Multimodal LLM-generated Context)python second_kg_generation_v2.py
→ Generate the second KG v2 (KG for CEL)
We crate two benchmark datasets. The goal is to represent each instance (image) of KG by a question/query generated by an LLM. This benchmark makes it possible to evaluate the retrival models where for an asked question we expect the relevant instance as specified in the benchmark.
The second benchmark is restructured to group similar question together by finding k-nearest neighbor for each of them. Each instance in the fist benchmark now can be mapped to the k-nearest neighbor instances in terms of question similarity, expanding the set of relevant instances for a given question.
* Services for the LLM and embedding model required before running the following scripts.
python question_generation.py
→ Generates a question for each instance and stores the data inquestions.json
.python 1st_benchmark_generation.py
→ Generates multiple simpler questions based on the single question per instance that was generated in the previous step. Result is the first benchmark dataset (first_benchmark.json
) where each instance is mapped to a string that contains multiple questions divided by*
and sometimes divided by numerical values (enumerated).python 2nd_benchmark_generation.py
→ For each instance in the first benchmark select a random question and generate embedding, find k-nearest neighbor using embeddings and store them for each instance. Results is the second benchmark dataset (second_benchmark.csv
). The embeddings of questions are also stored (filename:question_embeddings.csv
).
We test 2 retrieval models in our first retrieval method, the embedding-based retriever and the BM25 retriever. For the embedding-based retriever we first need to generate embeddings for the documents. A document is the concatenation of all descriptions for each instance in the KG. We generate embeddings for documents in the second KG and documents in the third KG.
* The service for embedding model required before running the following scripts.
python docs_embedding_generation.py -kg_path fashionpedia-second-generation.owl
python docs_embedding_generation.py -kg_path fashionpedia-third-generation.owl
python embedding-based_retriever_evaluation_b1.py -embeddings embeddings_second_kg.csv
python embedding-based_retriever_evaluation_b1.py -embeddings embeddings_third_kg.csv
python embedding-based_retriever_evaluation_b2.py -embeddings embeddings_second_kg.csv
python embedding-based_retriever_evaluation_b2.py -embeddings embeddings_third_kg.csv
python bm25_evaluation_b1.py -kg_path fashionpedia-second-generation.owl
python bm25_evaluation_b1.py -kg_path fashionpedia-third-generation.owl
python bm25_evaluation_b2.py -kg_path fashionpedia-second-generation.owl
python bm25_evaluation_b2.py -kg_path fashionpedia-third-generation.owl
In the second retrieval method we perform a document re-ranking after initially ranking them using the embedding-based retrieval method. The re-raking process includes learning a class expression using a CEL model and classifying instances on the KG. The top k documents of the classified instances are used to generate a summary that will be encoded into a vector space and used to rank the classified instances based on cosine-similarity of the summary and the classified documents where a higher similarity score indicates a higher relevance.
* Services for the LLM and embedding model required before running the following scripts.
python 2nd_retrieval_method_evaluation.py -lp_size 40 -kg_path fashionpedia-second-generation.owl
python 2nd_retrieval_method_evaluation.py -lp_size 20 -kg_path fashionpedia-second-generation.owl
python 2nd_retrieval_method_evaluation.py -lp_size 10 -kg_path fashionpedia-second-generation.owl
python 2nd_retrieval_method_evaluation.py -lp_size 40 -kg_path fashionpedia-third-generation.owl
python 2nd_retrieval_method_evaluation.py -lp_size 20 -kg_path fashionpedia-third-generation.owl
python 2nd_retrieval_method_evaluation.py -lp_size 10 -kg_path fashionpedia-third-generation.owl
After generating all the ranking files, we use a single script to calculate and print the MRR and Hits@k for k in {10,20,50,100}.
python [email protected]
We also evaluate the performance of 3 CEL models from Ontolearn in classifying clusters of instances (identified by KNN algorithm - second benchmark).
python cel_evaluation.py -model tdl
python cel_evaluation.py -model celoe
python cel_evaluation.py -model drill
The TSNE plot shows clusters created by the evaluation sample instances. To reproduce the plot use the following command:
python TSNE_plot.py