October 19, 2017
This document describes:
- How to generate the official scores for a KB/SF submission for 2017; and
- How to interpret scores output.
- How to read the debug file
Refer to README-Usage for details on how to use various scripts
The scorer runs over submission in SF format. For participants of KB variant of ColdStart, the KB would need to be transformed to SF format which would then be used by the scorer. README-Usage provides a detail of how to generate a valid input file for evaluation of slot filling component from a KB submission. We treat redundant responses differently depending on if we are scoring an SF submission or an SF file generated from a submitted KB (referred to as KB->SF). Redundant responses are considered spurious if coming from an SF submission and are ignored otherwise. This is why we have to specify the post-policy decision differently. Both the cases are described separately below:
We are reporting AP-based score as described in the task description in addition to the scores produced last year. In particular, we are reporting (1) macro averages AP scores for SF and LDC-MEAN and, (2) same as last year, micro and macro averages P/R/F1 over SF and LDC-MAX scores, and macro average P/R/F1 over LDC-MEAN score. The scores based on last year specifications would ignore the redundants by default. In order to compute scores that are comparable to last year see Section 3.2 below.
The micro-average P/R/F score computes a single P/R/F1 by summing counts across all selected queries. The macro-average P/R/F score computes P/R/F1 for each query, and finally takes the mean of the query-level P/R/F1 scores for queries that have a known answer. (Note that because the macro-average score ignores queries that have no known answer, a separate metric is needed to evaluate queries with no known answer.)
perl CS-Score.pl -output_file CSrun_score -queries queryids.txt -error_file CSrun_score.errlog -expand CSSF17 tac_kbp_2017_cold_start_evaluation_queries.xml CSrun.valid.ldc.tab.txt pool.assessed.fqec
Notice that the queries file used here is the LDC-level queries file tac_kbp_2017_cold_start_evaluation_queries.xml
. In this case the scorer will use the expansion string CSSF17
as specified by the -expand CSSF17
switch to convert LDC queries into SF queries internally which will be used to score SF submissions. Given this the scorer will produce SF, LDC-MAX and LDC-MEAN scores.
You may also invoke the scorer without the -expand CSSF17
switch and by using SF queries file (instead of the LDC queries file), for example, tac_kbp_2017_cold_start_slot_filling_evaluation_queries.xml
or a language specific SF queries file, for example,
tac_kbp_2017_chinese_cold_start_slot_filling_evaluation_queries.xml or tac_kbp_2017_english_cold_start_slot_filling_evaluation_queries.xml or tac_kbp_2017_spanish_cold_start_slot_filling_evaluation_queries.xml
In this case only SF scores are going to be produced.
Also notice that we have used the -queries queryids_file
switch to produce scores using only selected queries as specified by the queryids_file.
You may optionally specify the bootstrap resample file (made available to you) using the -samples switch. In which cases the scorer will produces bootstrap sample statistics and confidence intervals using the percentile method.
The scores will be produced in a set of files whose prefix is specified through the -output_file CSrun_score
switch.
A subset of the following files would be produced depending on selected options:
- CSrun_score.params
- CSrun_score.confidence
- CSrun_score.debug
- CSrun_score.errlog
- CSrun_score.ap (New in 2017; contains AP-based scores)
- CSrun_score.ldcmax
- CSrun_score.ldcmean
- CSrun_score.sample
- CSrun_score.samplescores
- CSrun_score.sf
- CSrun_score.summary
Its worth mentioning here that the scorer also produces a debug file CSrun_score.debug
as output. The objective of this file is to increase transparency of the scoring system. You may use it to further your research by examining the assessments applied to each response in your submission. In particular, each response in the submission file is mapped with corresponding response-assessment from the assessment file (empty if the response is not assessed), and assessments at PREPOLICY and POSTPOLICY levels.
In addition to the information included in the debug file last year, this year we are adding
(1) the TARGET_QUERY_ID (which is the hash to be used for next round query) with each submission, (2) FQNODEID (which is the fully-qualified KB nodeid) with each submission line, and (3) the debug information for computing AP-based score.
PREPOLICY assessments refers to correctness of the response as decided by the assessor whereas POSTPOLICY assessment categorizes the responses into RIGHT, WRONG and IGNORE as specified by the switches -right
, -wrong
and -ignore
. RIGHT, WRONG and IGNORE counts are used primarily by the scorer to produce scores.
policy options are a colon-separated list drawn from the following: CORRECT: Number of assessed correct responses. Legal choice for -right. DUPLICATE: Number of duplicate responses. Legal choice for -right, -wrong and -ignore. INCORRECT: Number of assessed incorrect responses. Legal choice for -wrong. INCORRECT_PARENT: Number of responses that had incorrect (grand-)parent. Legal choice for -wrong and -ignore. INEXACT: Number of assessed inexact responses. Legal choice for -right, -wrong and -ignore. UNASSESSED: Number of unassessed responses. Legal choice for -wrong and -ignore.
Scoring an SF submission differs from that of a KB->SF submission because of the way we treat redundants. Multiple justifications are not allowed in an SF submission so it should not return a redundant response, same as last year. Therefore, we change the post-policy decision accordingly. o
An example command to score such a submission is shown below. Note that this also applies to the case when we want to compute a last-year comparable score for a KB->SF submission.
perl CS-Score.pl -output_file CSrun_score -queries queryids.txt -error_file CSrun_score.errlog -expand CSSF17 -justifications 1:1 -ignore UNASSESSED -wrong INCORRECT:INCORRECT_PARENT:INEXACT:DUPLICATE tac_kbp_2017_cold_start_evaluation_queries.xml CSrun.valid.ldc.tab.txt pool.assessed.fqec
The scorer may produce either SF score variant or all of the following depending on the options selected. Aggregates reported for the variants are:
| | Aggregates Reported |
| |-------------------------------|
| | Micro-average | Macro-average |
| |---------------|---------------|
| Score Variant | AP | P/R/F1 | AP | P/R/F1 |
|---------------|----|----------|-----|---------|
| SF | No | Yes | Yes |Yes |
|---------------|----|----------|-----|---------|
| LDC-MAX | No | Yes | No |Yes |
|---------------|----|----------|-----|---------|
| LDC-MEAN | No | No | Yes |Yes |
Aggregates are computed for the hops individually and also as combined across all hops.
Micro-averages are computed as:
Total_Precision = Total_Right / (Total_Right + Total_Wrong)
Total_Recall = Total_Right / Total_GT
Total_F1 = 2 * Total_Precision * Total_Recall / (Total_Precision + Total_Recall)
Note that Total_GT
is the total number of true answers. This is the total number of equivalence classes found for the query. For single valued slots Total_GT value of 1 is used even if there are more than one equivalence classes in the assessment file.
Macro-averages are computed as the mean Precision, mean Recall, mean F1 and mean AP.
The official score for 2017 is an AP-based score for which the following score variants are produced:
1. SF Slot-filling score variant considering all entrypoints as a separate query,
2. LDC-MEAN LDC level score variant in which AP for each LDC query is the mean of APs of all entrypoints for that LDC query.
The scorer may produce the following score variants for P/R/F1 depending on the query file used for scoring:
1. SF Slot-filling score variant considering all entrypoints as a separate query,
2. LDC-MAX LDC level score variant considering the run's best entrypoint per LDC query on the basis of F1 score combined across both hops,
3. LDC-MEAN LDC level score variant in which the Precision, Recall, and F1 for each LDC query is the mean Precision, mean Recall, and mean F1 for all entrypoints for that LDC query.
In order to have the scorer produce LDC-MAX and LDC-MEAN score variants, the scorer must use:
a. LDC queries file (for example, `tac_kbp_2017_cold_start_evaluation_queries.xml`), and
b. -expand switch with correct expansion string (for example, `-expand CSSF17`).
Summary of scores is presented towards the end of the scorer output.
The official score for 2017 is an AP-based score which is computed as described below.
Before we descibe the official score computation, lets understand that Average Precision is a rank based measure that is used to report quality of a ranked list. Let there be five known correct answers to a query for which we got the following ranked list: RNRRNR (where the left-most item is represents the first result and R represents relevant and N represents no-relevant).
Rank | R/N | Precision at Rank |
---|---|---|
1 | R | 1/1 |
2 | N | - |
3 | R | 2/3 |
4 | R | 3/4 |
5 | N | - |
6 | R | 4/6 |
AP = (1 + 2/3 + 3/4 + 4/)/5
Similarly, AP of RRRRNN is (1+1+1+1)/5 and that of NNRRRR is (1/3+2/4+3/5+4/6)/5.
The official score computation differs from the AP computation because for our ranking of the nodes the V value is not binary, as shown below. Our ranked list in the example given below is 0.6667, 1, 0, 0, 0, 0.
Rank | V | Precision at Rank |
---|---|---|
1 | 0.6667 | 0.6667/1 |
2 | 1.0000 | 1.6667/2 |
3 | 0.0000 | - |
4 | 0.0000 | - |
5 | 0.0000 | - |
6 | 0.0000 | - |
Therefore the official AP-based score is (0.6667 + 1.6667/2)/4.
=============================================================================================
QUERY_ID: CSSF17_ENG_afddc4ed21
LEVEL: 1
AP: 0.3750
NUM_GROUND_TRUTH: 4
GROUND TRUTH:
CSSF17_ENG_afddc4ed21:1:1
CSSF17_ENG_afddc4ed21:1:2
CSSF17_ENG_afddc4ed21:2:1
CSSF17_ENG_afddc4ed21:2:2
RANKING:
........
RANK NODEID CONFIDENCE MAPPED_EC V
---------------------------------------------------------------------------------------------
1 CSSF17_ENG_afddc4ed21:Entity102:Entity111 0.7396 CSSF17_ENG_afddc4ed21:1:2 0.6667
2 CSSF17_ENG_afddc4ed21:Entity102:Entity110 0.6001 CSSF17_ENG_afddc4ed21:1:1 1.0000
3 CSSF17_ENG_afddc4ed21:Entity103:Entity111 0.4653 - 0.0000
4 CSSF17_ENG_afddc4ed21:Entity104:Entity110 0.4581 - 0.0000
5 CSSF17_ENG_afddc4ed21:Entity104:Entity111 0.4513 - 0.0000
6 CSSF17_ENG_afddc4ed21:Entity103:Entity110 0.4172 - 0.0000
=============================================================================================
As mentioned above that the scorer produces a subset of the following files depending on the options selected.
CSrun_score.params Lists when, how, and with what options the scorer was invoked
CSrun_score.errlog Reports errors encounterd while scoring, if any
CSrun_score.ap Lists SF and LDC-MEAN variants of the AP scores including the summary at the end
CSrun_score.debug Includes response level pre-policy and post-policy assessments and AP computation steps
CSrun_score.sf Lists counts and scores based on SF variant
CSrun_score.ldcmax Lists counts and scores based on LDC-MAX variant
CSrun_score.ldcmean Lists counts and scores based on LDC-MEAN variant
CSrun_score.summary Lists summary of the P/R/F1 scores
CSrun_score.sample Lists the bootstrap resample in format matching the samplesscores
CSrun_score.samplescores Lists the bootstrap resample statistics
CSrun_score.confidence Reports confidence intervals based on the bootstrap resampling statistics using the percentile method
By default, for each query and hop level, the scorer outputs the following counts:
GT Total number of ground truth answers (equivalence classes) as found by the assessors,
Submitted Number of responses in the submission,
Correct Number of responses in the submission that were assessed as Correct,
Incorrect Number of responses in the submission that were assessed as Wrong,
Inexact Number of responses in the submission that were assessed as Inexact,
PIncorrect Number of responses in the submission that had incorrect ancestor,
Unassessed Number of responses in the submission that were not assessed,
Dup Number of responses that were assessed as Correct but found to be duplicate of another Correct in the same submission,
Right Number of responses in the submission counted as Right, as specified by argument to swtich `-right`,
Wrong Number of responses in the submission counted as Wrong, as specified by argument to swtich `-wrong`,
Ignored Number of responses in the submission that were ignored for the purpose of score computation, as specified by argument to the switch `-ignore`,
AP Computed as described in Section 4.3 above,
Precision Computed as: Right / (Right + Wrong),
Recall Computed as: Right / GT,
F1 Computed as: 2 * Precision * Recall / (Precision + Recall).
where Right, Wrong, and Ignored are computed based on selected post-policy decision, specified using switches -right, -wrong, and -ignore
(See the usage for detail).
Note that scores for round # 2 (or hop-1) are reported at the equivalence class (EC) level and not at the generated query level. For example, the scores given for CSSF15_ENG_0458206f71:2
are the scores corresponding to the entity found as answer for round # 1 (or hop-0) query CSSF15_ENG_0458206f71
which was placed by assessors in equivalence class 2. Similarly, the scores given for CSSF15_ENG_0458206f71:0
are the scores corresponding to all hop-1 answers which correspond to incorrect hop-0 fillers.
The debug file has been extended in 2017. Major changes include:
(1) Addition of debug information for AP computation, and (2) Storing TARGET_QID and FQNODEID with each submission entry.
This section describes the content of the debug file in details. The file can be broken down into two major parts:
(1) Per-query response pre/post-policy assessment (2) AP computation debug information
The debug file begins with per-query response pre-and-post-policy assessment information where each query response is mapped to its pre-and-post-policy assessment, in particular, if the response has an assessment, the line from the assessment file is paired with it.
An example line is shown below:
FQNODEID: CSSF17_ENG_afddc4ed21:Entity104
SUBMISSION: CSSF17_ENG_afddc4ed21 per:children SCORER_TS_SPLIT SIMPSONS_012:67-112 Lisa Simpson PER SIMPSONS_012:81-94 0.9864 :Entity104
TARGET_QID: CSSF17_ENG_afddc4ed21_1fa227e2e937
ASSESSMENT: CSSF17_ENG_afddc4ed21_0_011 CSSF17_ENG_afddc4ed21:per:children SIMPSONS_012:67-112 Lisa Simpson SIMPSONS_012:81-94 C NAM C CSSF17_ENG_afddc4ed21:1 NAM
PREPOLICY ASSESSMENT: CORRECT
POSTPOLICY ASSESSMENT: IGNORE,REDUNDANT
These fields are described here:
FQNODEID Fully-qualified KB node ID corresponding to the response in case of KB->SF submission (or the automatically generated node ID in case of SF submission)
SUBMISSION A response line in the submission file
TARGET_QID Query ID of the next round query
ASSESSMENT Line from the assessment file that corresponds to this answer, if any
PREPOLICY ASSESSMENT The assessment of the response in the assessment file
POSTPOLICY ASSESSMENT The categorization of the response for the purpose of scoring; MOTE that the categorization of REDUNDANT responses affects only P/R/F1
AP computation debug information is printed at the end of the debug file beginning with line:
AP COMPUTATION DEBUG INFO BEGINS:
This section presents debug information for each query-and-hop separately. An example is shown below:
QUERY_ID: CSSF17_ENG_afddc4ed21
LEVEL: 1
AP: 1.0000
NUM_GROUND_TRUTH: 4
GROUND TRUTH:
CSSF17_ENG_afddc4ed21:1:1
CSSF17_ENG_afddc4ed21:1:2
CSSF17_ENG_afddc4ed21:2:1
CSSF17_ENG_afddc4ed21:2:2
RANKING:
........
RANK NODEID CONFIDENCE MAPPED_EC V
1 CSSF17_ENG_afddc4ed21:Entity103:Entity111 0.7516 CSSF17_ENG_afddc4ed21:1:2 1.0000
2 CSSF17_ENG_afddc4ed21:Entity103:Entity110 0.6633 CSSF17_ENG_afddc4ed21:1:1 1.0000
3 CSSF17_ENG_afddc4ed21:Entity102:Entity111 0.5120 CSSF17_ENG_afddc4ed21:2:2 1.0000
4 CSSF17_ENG_afddc4ed21:Entity102:Entity110 0.5031 CSSF17_ENG_afddc4ed21:2:1 1.0000
5 CSSF17_ENG_afddc4ed21:Entity104:Entity110 0.3694 - 0.0000
6 CSSF17_ENG_afddc4ed21:Entity104:Entity111 0.3573 - 0.0000
For each query-and-hop, we provide the following:
QUERY_ID The query ID
LEVEL The hop number
AP AP-based score computed using the V values in list given in the corresponding RANKING section
NUM_GROUND_TRUTH The number of correct answers (used as the denomerator for AP computation)
GROUND TRUTH The list of equivalence classes assigned by LDC *1
RANKING The ranked list used for AP computation *2
*1: The number of equivalence classes as assigned by LDC might differ from NUM_GROUND_TRUTH (for e.g. in case of singluar valued slots when LDC found two ages of a person, in which case, NUM_GROUND_TRUTH would be set to 1) *2: The ranked list provides a list of NODES with (1) its confidence, (2) the equivalence class which the node is aligned to (MAPPED_EC), and (3) the value V
In 2017, KB teams were asked to include multiple justifications for a relation whereas the format for SF teams remained the same as that of last year. Therefore, in order to support an apple-to-apple comparison, between a KB submission and an SF submission, a switch -justifications
was added to the scorer. The default value of this switch is 1:3
which means that one justification is allowed per document, and that there could be a total of three justifications. The scorer can be made to use -justifications 1:1
over KB runs to make the scores comparable to SF runs. For the purpose of this README, we will refer to -justifications 1:3
as K=3
and -justifications 1:1
as K=1
.
Briefly,
- F1 score does not give credit for retrieving extra correct justifications, and
- AP score pernalizes if a system fails to retrieve min(K,N) justifications where K is as defined above and N is the number of known correct justifications.
Computation of F1 generally does not depend on the value of K however in some cases the scores may differ. This section explains the case when we get different F1 scores, for differnet K values, using an example.
Lets assume that a KB system retrieves two justification (for a KB node) in response to a non-single-valued-slot query such that the most confident justification is WRONG while the other one is RIGHT. When scoring this submission using K=3, one WRONG and one RIGHT response would be used for computing F1 whereas when using K=1 the only response considered would be the most confident one which is WRONG causing the difference.