[v2] Refactor evaluators and Abstasks #1707

Samoed · 2025-01-04T18:03:56Z

I've made some refactoring for the task evaluators, except for Retrieval, which still requires a significant overhaul.

Additionally, I have some suggestions for classification:

Should we limit it to a single scoring method to avoid reproduction issues, since scoring methods are currently passed in the constructor?
Should we hardcode split names to avoid reproduction issues, as they are also passed in the constructor?
Should n_experiments and k become part of the task definitions without requiring changes, to avoid reproduction issues since these are currently passed?

Checklist

Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

KennethEnevoldsen

Great additions!

Should we hardcode split names to avoid reproduction issues, as they are also passed in the constructor?

Yes

Should we limit it to a single scoring method to avoid reproduction issues, since scoring methods are currently passed in the constructor?

Hmm not sure what you refer to here

KennethEnevoldsen · 2025-01-08T15:27:46Z

mteb/abstasks/AbsTaskClusteringFast.py

@@ -276,6 +261,7 @@ def clustering_downsample(
    dataset: DatasetDict, seed: int, max_samples_in_cluster: int = 2048
 ) -> DatasetDict:
    """In cases where it is not possible to convert the dataset to a fast version, we can downsample the dataset to speed up the evaluation.
+    Only used in ArXivHierarchicalClusteringP2P


we could probably just reupload it and remove this part then

Moved this function to ArXivHierarchicalClusteringP2P.v2, because ArXivHierarchicalClusteringP2P uses same dataset

mteb/abstasks/AbsTaskMultilabelClassification.py

mteb/abstasks/AbsTaskReranking.py

mteb/abstasks/MultilingualTask.py

KennethEnevoldsen · 2025-01-08T15:32:40Z

mteb/evaluation/evaluators/BitextMiningEvaluator.py

@@ -31,6 +32,7 @@ def __init__(
        self.pairs = pair_columns
        self.n = len(sentences)
        self.sentences = sentences
+        # TODO used only by BUCC


you are probably already thinking this, but let us just re-upload it

KennethEnevoldsen · 2025-01-08T15:33:14Z

mteb/evaluation/evaluators/ClassificationEvaluator.py


        self.k = k

-    def __call__(self, model, test_cache=None):
+    def __call__(
+        self, model: Encoder, *, encode_kwargs: dict[str, Any] = {}, test_cache=None


is test_cache a Path | None?

No, this is embeddings of test split

mteb/mteb/evaluation/evaluators/ClassificationEvaluator.py

Lines 145 to 153 in 8d033f3

if test_cache is None:

X_test = model.encode(

self.sentences_test,

task_name=self.task_name,

**self.encode_kwargs,

)

test_cache = X_test

else:

X_test = test_cache

For each experiment there we sample N examples per label for train and use full split for testing

mteb/evaluation/evaluators/SummarizationEvaluator.py

Samoed · 2025-01-08T15:44:43Z

Should we limit it to a single scoring method to avoid reproduction issues, since scoring methods are currently passed in the constructor?

For now in classification can be used KNN, PytorchKNN and LogReg (default) and I think we can leave only LogReg

mteb/abstasks/MultilingualTask.py

mteb/evaluation/evaluators/ClassificationEvaluator.py

KennethEnevoldsen · 2025-01-08T21:40:47Z

I think we can leave only LogReg

Agree - Set it as a class attribute, so tasks can redefine it if needed (similar to evaluator for summeval)

Then people can just redefine the class if needed.

# Conflicts: # mteb/abstasks/AbsTaskClusteringFast.py # mteb/abstasks/AbsTaskSummarization.py

Samoed · 2025-01-09T20:23:32Z

I’ve made AbsTaskClassification the parent class for AbsTaskMultilabelClassification because they share the evaluate and _calculate_metrics_from_split functions. I considered making AbsTaskClustering the parent class for AbsTaskClusteringFast, but since they only share _calculate_metrics_from_split, I don’t think it’s as important.

Samoed added 2 commits January 4, 2025 20:51

refactor evaluators and tasks

277e0bc

remove slow/fast loading

ce4038b

Samoed requested a review from KennethEnevoldsen January 7, 2025 09:50

Samoed added 3 commits January 7, 2025 14:50

fix imports

2a2f5a0

fix summ evaluator

d3e5d44

fix evaluator

0d3ea3f

KennethEnevoldsen reviewed Jan 8, 2025

View reviewed changes

orionw reviewed Jan 8, 2025

View reviewed changes

mteb/abstasks/MultilingualTask.py Show resolved Hide resolved

orionw reviewed Jan 8, 2025

View reviewed changes

mteb/evaluation/evaluators/ClassificationEvaluator.py Show resolved Hide resolved

Samoed added 9 commits January 9, 2025 21:27

Merge branch 'refs/heads/v2.0.0' into refactor_evaluators

46cfd43

# Conflicts: # mteb/abstasks/AbsTaskClusteringFast.py # mteb/abstasks/AbsTaskSummarization.py

make classification parent class for AbsTaskMultilabelClassification

115dd8d

fix descriptive stat

99f4287

fix import

f898b90

add typehint

aa5aa0c

remove clustering_downsample

59ea8be

fix tests

d3b2076

remove prints

cec31df

remove all inits

eacce2e

Samoed added 2 commits January 9, 2025 23:32

fix all abstasks

5dc039d

fix tests

cf4a5b3

Samoed requested a review from KennethEnevoldsen January 10, 2025 09:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v2] Refactor evaluators and Abstasks #1707

[v2] Refactor evaluators and Abstasks #1707

Samoed commented Jan 4, 2025

KennethEnevoldsen left a comment

KennethEnevoldsen Jan 8, 2025

Samoed Jan 10, 2025

KennethEnevoldsen Jan 8, 2025

KennethEnevoldsen Jan 8, 2025

Samoed Jan 9, 2025

Samoed commented Jan 8, 2025

KennethEnevoldsen commented Jan 8, 2025

Samoed commented Jan 9, 2025 •

edited

Loading

	if test_cache is None:
	X_test = model.encode(
	self.sentences_test,
	task_name=self.task_name,
	**self.encode_kwargs,
	)
	test_cache = X_test
	else:
	X_test = test_cache

[v2] Refactor evaluators and Abstasks #1707

Are you sure you want to change the base?

[v2] Refactor evaluators and Abstasks #1707

Conversation

Samoed commented Jan 4, 2025

Checklist

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

KennethEnevoldsen Jan 8, 2025

Choose a reason for hiding this comment

Samoed Jan 10, 2025

Choose a reason for hiding this comment

KennethEnevoldsen Jan 8, 2025

Choose a reason for hiding this comment

KennethEnevoldsen Jan 8, 2025

Choose a reason for hiding this comment

Samoed Jan 9, 2025

Choose a reason for hiding this comment

Samoed commented Jan 8, 2025

KennethEnevoldsen commented Jan 8, 2025

Samoed commented Jan 9, 2025 • edited Loading

Samoed commented Jan 9, 2025 •

edited

Loading