Skip to content

Commit

Permalink
update the compute_overall.py script and the example data for deepsee…
Browse files Browse the repository at this point in the history
…k model
  • Loading branch information
gmftbyGMFTBY committed Nov 11, 2024
1 parent a32badb commit 5eca479
Show file tree
Hide file tree
Showing 5 changed files with 47 additions and 1 deletion.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -162,5 +162,5 @@ cython_debug/
data/*
data/
example_data/*
!example_data/prediction_v1.3.tgz
!example_data/deepseek-67b-chat-hf-prediction.tgz
inference/outputs
9 changes: 9 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,6 +190,15 @@ The evaluation results of GPT-4 under `save_dir` is `jsonl`, and each line conta
* 10 denotes the best performance
* 8 denotes the comparable performance with our human-annotated high-quality critiques, and scores higher than 8 denotes the better performance of evaluated critiques.


To compute the overall score, please run this command:
```bash
# the results of GPT-4 on the test set
python compute_overall.py --mode obj --feedback_overall 63.54 --correction_overall 69.67 --comp_feedback_overall 57.33 --meta_feedback_overall 62.90
```
* replacing the `mode` with `sub` could compute the overall score for subjective evaluation
* ensure that at least one dimension's overall score is provided among the four dimensions

## Benchmark Results

The subjective evaluation results of some representation LLMs are shown:
Expand Down
37 changes: 37 additions & 0 deletions critic_bench/compute_overall.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import argparse
import numpy as np
import math


'''This script computes the overall score for the objective and subjective evaluations'''

parser = argparse.ArgumentParser()
parser.add_argument("--mode", help="must be obj or sub, representing the objective and subjective evaluation", default='obj')
parser.add_argument("--feedback_overall", help="the overall score of feedback dimension", default=math.inf, type=float)
parser.add_argument("--comp_feedback_overall", help="the overall score of comparison dimension", default=math.inf, type=float)
parser.add_argument("--correction_overall", help="the overall score of correction dimension", default=math.inf, type=float)
parser.add_argument("--meta_feedback_overall", help="the overall score of meta-feedback dimension", default=math.inf, type=float)
args = parser.parse_args()


def normalize(score):
# normlize the corrections for objective evaluation
return (score + 100) / 2


if __name__ == "__main__":
scores = []
for index, score in enumerate([
args.feedback_overall,
args.comp_feedback_overall,
args.correction_overall,
args.meta_feedback_overall
]):
if score != math.inf:
if args.mode == 'obj' and index in [0, 3]:
score = normalize(score)
scores.append(score)
print(scores)

assert len(scores) > 0, 'No valid scores'
print('Overall Scores:', round(np.mean(scores), 4))
Binary file added example_data/deepseek-67b-chat-hf-prediction.tgz
Binary file not shown.
Binary file removed example_data/prediction_v1.3.tgz
Binary file not shown.

0 comments on commit 5eca479

Please sign in to comment.