Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embeddings underperform compared to Llama cpp #353

Open
codinguncut opened this issue Nov 30, 2024 · 0 comments
Open

Embeddings underperform compared to Llama cpp #353

codinguncut opened this issue Nov 30, 2024 · 0 comments
Assignees

Comments

@codinguncut
Copy link

codinguncut commented Nov 30, 2024

I benchmarked document embeddings of ollama==0.4.2 vs. llama_cpp_python==0.2.69.
I used pretrained LLM models to create document embeddings and scikit-learn LogisticRegression to classify the documents.

The Llama results are in the same general ballpark, but especially Qwen2.5 1.5b is performing much worse in ollama python than in llama-cpp-python.
The classification code is exactly the same between libraries, and I assume the models pulled are similar too. I don't know what causes the difference, whether it is a difference in pooling, or quantification, or random sampling error.

On a separate note, Llama-cpp-python is also 4x faster than ollama python.

These are my results:

ollama==0.4.2, ollama.embed(model=model_name, input=[...])

llama3.2:1b: 63.0% (60s)
qwen2.5:1.5b: 50.0% (59s)
llama3.2:3b: 63.5% (100s)
qwen2.5:3b: 60.0% (95s)

llama_cpp_ptyon==0.2.69

Llama-3.2-1B-Instruct-Q4_0.gguf: 66% (12s) (LLAMA_POOLING_TYPE_MEAN)
qwen2-1_5b-instruct-q4_0.gguf: 71% (16s) (LLAMA_POOLING_TYPE_MEAN)
Llama-3.2-3B-Instruct-Q4_0.gguf: 70% (26s) (LLAMA_POOLING_TYPE_MEAN)
qwen2.5-3b-instruct-q8_0.gguf: 66% (26s) (LLAMA_POOLING_TYPE_MEAN)

To check for a potential model mismatch I also pulled the same models used as in llama-cpp and ran them in ollama python:

hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF:Q4_0: 64.5% (61s)
hf.co/Qwen/Qwen2-1.5B-Instruct-GGUF:q4_0: 69.0% (60s)
hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_0: 68.0% (101s)
hf.co/Qwen/Qwen2.5-3B-Instruct-GGUF:q8_0: 54.0% (89s)

In the above iteration Qwen2-1.5B is doing much better, but Qwen2.5-3B is still performing much worse.

Full source code below:

from datasets import load_dataset
from setfit import sample_dataset
indices = [2, 9] # setfit accuracy 55.9%

def map_label(elt):
    elt["label"] = indices.index(elt["label"])
    return elt
dataset = load_dataset("contemmcm/victorian_authorship")

# select two authors for binary classification and rename label column
dataset = (dataset
           .rename_column('author', 'label')
           .filter(lambda elt: elt["label"] in indices)
           .map(map_label))

# select 100 examples per label
dataset["train"] = sample_dataset(dataset["train"], label_column="label",
                                  num_samples=100)
dataset["test"] = sample_dataset(dataset["test"], label_column="label",
                                  num_samples=100)
dataset.save_to_disk("victorian_2_9")

from datasets import load_from_disk
from sklearn.linear_model import LogisticRegression
import ollama
import time

models = [
    'llama3.2:1b',
    'qwen2.5:1.5b',
    'smollm2:1.7b',
    'gemma2:2b',
    'phi:2.7b',  # phi-2
    'llama3.2:3b',
    'qwen2.5:3b',
]

for model_name in models:
    print(f"pulling {model_name}...", end=" ")
    ollama.pull(model_name)
    print(f"done")

    dataset = load_from_disk("victorian_2_9")

    start_time = time.time()
    prompt = "Who is the author of the following text:\n\nText: "
    print(f"embedding...", end=" ")
    X_train = ollama.embed(model=model_name, input=[prompt + ex["text"] for ex in dataset["train"]])["embeddings"]
    X_test = ollama.embed(model=model_name, input=[prompt + ex["text"] for ex in dataset["test"]])["embeddings"]
    y_train = [ex["label"] for ex in dataset["train"]]
    y_test = [ex["label"] for ex in dataset["test"]]
    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f"done")
    
    # Initialize and train the classifier
    classifier = LogisticRegression()
    classifier.fit(X_train, y_train)
    
    # Evaluate the classifier
    accuracy = classifier.score(X_test, y_test)
    print(f"{model_name}: {accuracy*100:.1f}% ({elapsed_time:.0f}s)")
@jmorganca jmorganca self-assigned this Nov 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants