Official implementation for our paper A Comparative Analysis of Retrievability and PageRank Measures with code and results notebooks. In this paper, we compute rank correlation coefficents (Kendall's
The accessibility of documents within a collection holds a pivotal role in Information Retrieval, signifying the ease of locating specific content in a collection of documents. This accessibility can be achieved via two distinct avenues. The first is through some retrieval model using a keyword or other feature-based search, and the other is where a document can be navigated using links associated with them, if available. Metrics such as PageRank, Hub, and Authority illuminate the pathways through which documents can be discovered within the network of content while the concept of Retrievability is used to quantify the ease with which a document can be found by a retrieval model. In this paper, we compare these two perspectives, PageRank and retrievability, as they quantify the importance and discoverability of content in a corpus. Through empirical experimentation on benchmark datasets, we demonstrate a subtle similarity between retrievability and PageRank particularly distinguishable for larger datasets.
The retrievability measure quantifies how easily a document can be retrieved in a specific configuration of an Information Retrieval (IR) system. The retrievability score, denoted as
Where:
-
$r(d)$ : The retrievability score of document$d$ . -
$d \in D$ : The document $ d $ is a member of the collection$D$ . -
$Q$ : The set of all possible queries that could be answered by the collection$D$ . This set encompasses the entire range of queries that the IR system might encounter. -
$o_q$ : The opportunity weight associated with query$q$ , which quantifies the likelihood or frequency of selecting query$q$ from the set$Q$ . -
$f(k_{dq}, c)$ : A function that measures the effectiveness of retrieving document$d$ in response to query$q$ , where$k_{dq}$ denotes the retrieval rank of document$d$ to query$q$ , and$c$ represents rank cutoff after which user ceases to examine the ranked list.
The PageRank formula is as follows:
Where:
-
$PR(A)$ : The PageRank score of page$A$ . -
$d$ : The damping factor, typically set to 0.85. It balances the influence of following links on the current page and the randomness of jumping to any page on the web. -
$n$ : The total number of pages that link to page$A$ . -
$PR(T_i)$ : The PageRank score of page$T_i$ , a page that links to$A$ . -
$C(T_i)$ : The number of outgoing links from page$T_i$ . -
$\frac{PR(T_i)}{C(T_i)}$ : The share of the PageRank score that page$A$ receives from page$T_i$ , where the score is divided by the number of outgoing links from$T_i$ .
Dataset | Kendall's |
Spearman's |
RBO |
---|---|---|---|
WT10g | 0.0487 | 0.0730 | 0.5173 |
Wikipedia | 0.1532 | 0.2247 | 0.5633 |
We observe, while lower Kendall’s
@inproceedings{10.1145/3632754.3632760,
author = {Sinha, Aman and Mall, Priyanshu Raj and Roy, Dwaipayan and Ghosh, Kripabandhu},
title = {A Comparative Analysis of Retrievability and PageRank Measures},
year = {2024},
isbn = {9798400716324},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3632754.3632760},
doi = {10.1145/3632754.3632760},
booktitle = {Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation},
pages = {67–72},
numpages = {6},
location = {Panjim, India},
series = {FIRE '23}
}