Embeddings underperform compared to Llama cpp #353

codinguncut · 2024-11-30T14:41:29Z

I benchmarked document embeddings of ollama==0.4.2 vs. llama_cpp_python==0.2.69.
I used pretrained LLM models to create document embeddings and scikit-learn LogisticRegression to classify the documents.

The Llama results are in the same general ballpark, but especially Qwen2.5 1.5b is performing much worse in ollama python than in llama-cpp-python.
The classification code is exactly the same between libraries, and I assume the models pulled are similar too. I don't know what causes the difference, whether it is a difference in pooling, or quantification, or random sampling error.

On a separate note, Llama-cpp-python is also 4x faster than ollama python.

These are my results:

ollama==0.4.2, ollama.embed(model=model_name, input=[...])

llama3.2:1b: 63.0% (60s)
qwen2.5:1.5b: 50.0% (59s)
llama3.2:3b: 63.5% (100s)
qwen2.5:3b: 60.0% (95s)

llama_cpp_ptyon==0.2.69

Llama-3.2-1B-Instruct-Q4_0.gguf: 66% (12s) (LLAMA_POOLING_TYPE_MEAN)
qwen2-1_5b-instruct-q4_0.gguf: 71% (16s) (LLAMA_POOLING_TYPE_MEAN)
Llama-3.2-3B-Instruct-Q4_0.gguf: 70% (26s) (LLAMA_POOLING_TYPE_MEAN)
qwen2.5-3b-instruct-q8_0.gguf: 66% (26s) (LLAMA_POOLING_TYPE_MEAN)

To check for a potential model mismatch I also pulled the same models used as in llama-cpp and ran them in ollama python:

hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF:Q4_0: 64.5% (61s)
hf.co/Qwen/Qwen2-1.5B-Instruct-GGUF:q4_0: 69.0% (60s)
hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_0: 68.0% (101s)
hf.co/Qwen/Qwen2.5-3B-Instruct-GGUF:q8_0: 54.0% (89s)

In the above iteration Qwen2-1.5B is doing much better, but Qwen2.5-3B is still performing much worse.

Full source code below:

from datasets import load_dataset
from setfit import sample_dataset
indices = [2, 9] # setfit accuracy 55.9%

def map_label(elt):
    elt["label"] = indices.index(elt["label"])
    return elt
dataset = load_dataset("contemmcm/victorian_authorship")

# select two authors for binary classification and rename label column
dataset = (dataset
           .rename_column('author', 'label')
           .filter(lambda elt: elt["label"] in indices)
           .map(map_label))

# select 100 examples per label
dataset["train"] = sample_dataset(dataset["train"], label_column="label",
                                  num_samples=100)
dataset["test"] = sample_dataset(dataset["test"], label_column="label",
                                  num_samples=100)
dataset.save_to_disk("victorian_2_9")

from datasets import load_from_disk
from sklearn.linear_model import LogisticRegression
import ollama
import time

models = [
    'llama3.2:1b',
    'qwen2.5:1.5b',
    'smollm2:1.7b',
    'gemma2:2b',
    'phi:2.7b',  # phi-2
    'llama3.2:3b',
    'qwen2.5:3b',
]

for model_name in models:
    print(f"pulling {model_name}...", end=" ")
    ollama.pull(model_name)
    print(f"done")

    dataset = load_from_disk("victorian_2_9")

    start_time = time.time()
    prompt = "Who is the author of the following text:\n\nText: "
    print(f"embedding...", end=" ")
    X_train = ollama.embed(model=model_name, input=[prompt + ex["text"] for ex in dataset["train"]])["embeddings"]
    X_test = ollama.embed(model=model_name, input=[prompt + ex["text"] for ex in dataset["test"]])["embeddings"]
    y_train = [ex["label"] for ex in dataset["train"]]
    y_test = [ex["label"] for ex in dataset["test"]]
    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f"done")
    
    # Initialize and train the classifier
    classifier = LogisticRegression()
    classifier.fit(X_train, y_train)
    
    # Evaluate the classifier
    accuracy = classifier.score(X_test, y_test)
    print(f"{model_name}: {accuracy*100:.1f}% ({elapsed_time:.0f}s)")

The text was updated successfully, but these errors were encountered:

jmorganca self-assigned this Nov 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embeddings underperform compared to Llama cpp #353

Embeddings underperform compared to Llama cpp #353

codinguncut commented Nov 30, 2024 •

edited

Loading

Embeddings underperform compared to Llama cpp #353

Embeddings underperform compared to Llama cpp #353

Comments

codinguncut commented Nov 30, 2024 • edited Loading

codinguncut commented Nov 30, 2024 •

edited

Loading