release v0.3.9

UKGovernmentBEIS · May 14, 2024 · 671a239 · 671a239
1 parent aec9b20
commit 671a239
Show file tree

Hide file tree

Showing 63 changed files with 2,171 additions and 1,461 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,14 @@
 # Changelog
 
+## v0.3.9 (14 May 2024)
+
+- Add `ollama` local model provider.
+- Add `multi_scorer()` and `majority_vote()` functions for combining multiple scorers into a single score.
+- Add support for multiple model graders in `model_graded_qa()`.
+- Raise `TypeError` for solvers and scorers not declared as `async`.
+- Fallback to standard parase if `NaN` or `Inf` is encountered while reading log file header.
+- Remove deprecated support for matching partial model names (e.g. "gpt" or "claude").
+
 ## v0.3.8 (07 May 2024)
 
 - Exclude null config values from listings in log viewer.

diff --git a/README.md b/README.md
@@ -18,4 +18,4 @@ $ cd inspect_ai
 $ pip install -e ".[dev]"
 ```
 
-If you use VS Code, you should be sure to have installed the recommended extensions (Python, Ruff, and MyPy). Note that you'll be promoted to install these when you open the project in VS Code.
+If you use VS Code, you should be sure to have installed the recommended extensions (Python, Ruff, and MyPy). Note that you'll be prompted to install these when you open the project in VS Code.
diff --git a/docs/eval-suites.qmd b/docs/eval-suites.qmd
@@ -46,7 +46,7 @@ if __name__ == "__main__":
     eval(security_guide, model="google/gemini-1.0-pro")
 ```
 
-Doing this allows your source file to be both a Python script that is convenient to run during development as well as be a Python module that tasks can be read from without executing the eval. There is no real downside to this, and it's a good way in general to write all of your eval scripts and notebooks (see the docs on [\_\_main\_\_](https://docs.python.org/3/library/main.html) for additional details.)
+Doing this allows your source file to be both a Python script that is convenient to run during development as well as be a Python module that tasks can be read from without executing the eval. There is no real downside to this, and it's a good way in general to write all of your eval scripts and notebooks (see the docs on [\_\_main\_\_](https://docs.python.org/3/library/__main__.html) for additional details.)
 
 ## Use Cases
 

diff --git a/docs/index.qmd b/docs/index.qmd
@@ -75,7 +75,7 @@ $ inspect eval ctf.py --model together/Qwen/Qwen1.5-72B-Chat
 ```
 :::
 
-In addition to the model providers shown above, Inspect also supports models hosted on Azure AI, AWS Bedrock, and Cloudflare. See the documentation on [Models](#sec-models) for additional details.
+In addition to the model providers shown above, Inspect also supports models hosted on Azure AI, AWS Bedrock, and Cloudflare, as well as local models with Ollama. See the documentation on [Models](#sec-models) for additional details.
 
 ## Hello, Inspect {#sec-hello-inspect}
 

diff --git a/docs/models.qmd b/docs/models.qmd
@@ -11,13 +11,18 @@ Inspect has built in support for a variety of language model API providers and c
 | Google       | `pip install google-generativeai` | `GOOGLE_API_KEY`                                                       |
 | Mistral      | `pip install mistralai`           | `MISTRAL_API_KEY`                                                      |
 | Hugging Face | `pip install transformers`        | `HF_TOKEN`                                                             |
+| Ollama       | `pip install openai`              | None required                                                          |
 | TogetherAI   | `pip install openai`              | `TOGETHER_API_KEY`                                                     |
 | AWS Bedrock  | `pip install boto3`               | `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_DEFAULT_REGION` |
 | Azure AI     | None required                     | `AZURE_API_KEY` and `INSPECT_EVAL_MODEL_BASE_URL`                      |
 | Cloudflare   | None required                     | `CLOUDFLARE_ACCOUNT_ID` and `CLOUDFLARE_API_TOKEN`                     |
 
 : {tbl-colwidths="\[18,45,37\]"}
 
+::: {.callout-note appearance="minimal"}
+Note that some providers ([Ollama](https://github.com/ollama/ollama/blob/main/docs/openai.md) and [TogetherAI](https://docs.together.ai/docs/openai-api-compatibility)) support the OpenAI Python package as a client, which is why you need to `pip install openai` for these providers even though you aren't actually interacting with the OpenAI service when you use them.
+:::
+
 ## Using Models
 
 To select a model for use in an evaluation task you specify it using a *model name*. Model names include their API provider and the specific model to use (e.g. `openai/gpt-4`) Here are the supported providers along with example model names and links to documentation on all available models:
@@ -29,6 +34,7 @@ To select a model for use in an evaluation task you specify it using a *model na
 | Google       | `google/gemini-1.0-pro`           | [Google Models](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models)             |
 | Mistral      | `mistral/mistral-large-latest`    | [Mistral Models](https://docs.mistral.ai/platform/endpoints/)                                   |
 | Hugging Face | `hf/openai-community/gpt2`        | [Hugging Face Models](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending) |
+| Ollama       | `ollama/llama3`                   | [Ollama Models](https://ollama.com/library)                                                     |
 | TogetherAI   | `together/lmsys/vicuna-13b-v1.5`  | [TogetherAI Models](https://docs.together.ai/docs/inference-models#chat-models)                 |
 | AWS Bedrock  | `bedrock/meta.llama2-70b-chat-v1` | [AWS Bedrock Models](https://aws.amazon.com/bedrock/)                                           |
 | Azure AI     | `azureai/azure-deployment-name`   | [Azure AI Models](https://ai.azure.com/explore/models)                                          |
@@ -71,6 +77,7 @@ Each model also can use a different base URL than the default (e.g. if running t
 | Google      | `GOOGLE_BASE_URL`     |
 | Mistral     | `MISTRAL_BASE_URL`    |
 | TogetherAI  | `TOGETHER_BASE_URL`   |
+| Ollama      | `OLLAMA_BASE_URL`     |
 | AWS Bedrock | `BEDROCK_BASE_URL`    |
 | Azure AI    | `AZUREAI_BASE_URL`    |
 | Cloudflare  | `CLOUDFLARE_BASE_URL` |
@@ -310,13 +317,14 @@ The additional `model_args` are forwarded as follows for the various providers:
 | Google       | `genai.configure`                      |
 | Mistral      | `MistralAsyncClient`                   |
 | Hugging Face | `AutoModelForCausalLM.from_pretrained` |
+| Ollama       | `AsyncOpenAI`                          |
 | TogetherAI   | `AsyncOpenAI`                          |
 | AzureAI      | Chat HTTP Post Body                    |
 | Cloudflare   | Chat HTTP Post Body                    |
 
 : {tbl-colwidths="\[30,70\]"}
 
-See the OpenAI, Anthropic, Google, Mistral, Hugging Face, TogetherAI, Azure AI, and Cloudflare provider documentation for more information on the additional options available.
+See the OpenAI, Anthropic, Google, Mistral, Hugging Face, Ollama, TogetherAI, Azure AI, and Cloudflare provider documentation for more information on the additional options available.
 
 ## Custom Models
 
@@ -358,4 +366,4 @@ model = get_model("custom/name-of-model")
 eval(math, model = "custom/name-of-model")
 ```
 
-In this example, the `model_name` argument passed to `__init__()` will be "name-of-model".
+In this example, the `model_name` argument passed to `__init__()` will be "name-of-model".
diff --git a/docs/scorers.qmd b/docs/scorers.qmd
@@ -60,7 +60,7 @@ Task(
 )
 ```
 
-### Model Graded
+## Model Graded
 
 Model graded scorers are well suited to assessing open ended answers as well as factual answers that are embedded in a longer narrative. The built-in model graded scorers can be customised in several ways—you can also create entirely new model scorers (see the model graded example below for a starting point).
 
@@ -87,7 +87,25 @@ The default model graded QA scorer is tuned to grade answers to open ended quest
 
 The `model_graded_fact()` scorer works identically to `model_graded_qa()`, and simply provides an alternate `template` oriented around judging whether a fact is included in the model output.
 
-If you want to understand how the default templates for `model_graded_qa()` and `model_graded_fact()` work, see their [source code](https://github.com/AI-Safety-Institute/inspect_ai/blob/main/src/inspect_ai/scorer/_model.py).
+If you want to understand how the default templates for `model_graded_qa()` and `model_graded_fact()` work, see their [source code](https://github.com/UKGovernmentBEIS/inspect_ai/blob/main/src/inspect_ai/scorer/_model.py).
+
+### Multiple Models
+
+The built-in model graded scorers also support using multiple grader models (whereby the final grade is chosen by majority vote). For example, here we specify that 3 models should be used for grading:
+
+```python
+model_graded_qa(
+    models = [
+        "google/gemini-1.0-pro",
+        "anthropic/claude-3-opus-20240229" 
+        "together/meta-llama/Llama-3-70b-chat-hf",
+    ]
+)
+```
+
+The implementation of multiple grader models takes advantage of the `multi_scorer()` and `majority_vote()` functions, both of which can be used in your own scorers (as described in the [Multi Scorer](#sec-multi-scorer) section below).
+
+
 
 ## Custom Scorers
 
@@ -239,6 +257,43 @@ Note that the call to `model_grader.generate()` is done with `await`—this is c
 
 Note also we use the `input_text` property of the `TaskState` to access a string version of the original user input to substitute it into the grading template. Using the `input_text` has two benefits: (1) It is guaranteed to cover the original input from the dataset (rather than a transformed prompt in `messages`); and (2) It normalises the input to a string (as it could have been a message list).
 
+## Multi Scorer
+
+It's possible to use multiple scorers in parallel, then combine their output into a final overall score. This is done using the `multi_scorer()` function. For example, this is roughly how the built in model graders use multiple models for grading:
+
+```python
+multi_scorer(
+    scorers = [model_graded_qa(model=model) for model in models],
+    reducer = majority_vote
+)
+```
+
+Use of `multi_scorer()` requires both a list of scorers as well as a _reducer_ which is a function that takes a list of scores and turns it into a single score. In this case we use the built in `majority_vote()` reducer which returns the score that appeared most frequently in the answers.
+
+You can imagine a variety of different strategies for reducing scores (take the average, take the high or low, majority vote, etc.). For example, here's a reducer that computes the average score:
+
+```python
+import numpy as np
+
+def average_score(scores: list[Score]) -> Score:
+    values = [score.as_float() for score in scores]
+    avg = np.mean(values).item()
+    return Score(
+        value=avg,
+        explanation=f"average of {', '.join(values)}"
+    )
+```
+
+Which might be used like this:
+
+```python
+multi_scorer(
+    scorers = [model_graded_qa(model=model) for model in models],
+    reducer = average_score
+)
+```
+
+
 ## Metrics
 
 Each scorer provides one or more built-in metrics (typically `accuracy` and `bootstrap_std`). In addition, you can specify other metrics (either built-in or custom) to compute when defining a `Task`:

diff --git a/docs/tools.qmd b/docs/tools.qmd
@@ -9,7 +9,7 @@ Inspect natively supports registering Python functions as tools and providing th
 ::: {.callout-note}
 ### Tools and Agents
 
-One application of tools is to run them within an agent scaffold that pursues an objective over multiple interactions with a model. The scaffold uses the model to help make decisions about which tools to use and when, and orchestrates calls to the model to use the tools. We'll cover how to use agent scaffolds in [Agent Solvers](#agents) below.
+One application of tools is to run them within an agent scaffold that pursues an objective over multiple interactions with a model. The scaffold uses the model to help make decisions about which tools to use and when, and orchestrates calls to the model to use the tools. We'll cover how to use agent scaffolds in [Agent Solvers](#agent-solvers) below.
 :::
 
 ## Tool Basics
@@ -149,7 +149,7 @@ plan = [
 
 ## Web Search
 
-Inspect has a built in `web_search()` tool that provides models with the ability to enhance their context window by performing a search. By default web searches retrieves 10 results from a provider, uses a model to determine if the contents is relevant then returns the top 3 relevant search results to the main model. Here is the definition of the `web_search()` function:
+Inspect has a built in `web_search()` tool that provides models with the ability to enhance their context window by performing a search. By default web searches retrieve 10 results from a provider, uses a model to determine if the contents is relevant then returns the top 3 relevant search results to the main model. Here is the definition of the `web_search()` function:
 
 ``` python
 def web_search(

diff --git a/docs/workflow.qmd b/docs/workflow.qmd
@@ -234,7 +234,7 @@ if __name__ == "__main__"
 ```
 
 ::: {.callout-note appearance="minimal"}
-If you aren't familiar with the `__name__ == "__main__"` idiom, see the docs on [\_\_main\_\_](https://docs.python.org/3/library/main.html) for additional details.
+If you aren't familiar with the `__name__ == "__main__"` idiom, see the docs on [\_\_main\_\_](https://docs.python.org/3/library/__main__.html) for additional details.
 :::
 
 Now we can take the same script and use it with `inspect eval` (while leaving our exploratory code intact and protected by the `__main__` check):
@@ -255,7 +255,7 @@ We refer to notebooks above but show scripts in all of the examples. Everything
 
 1.  You can use the `__name__ == "__main__"` check to protect cells that should only be run in exploratory mode.
 
-2.  You can pass a notebook to `insect eval` just the same as a script (including passing task parameters)
+2.  You can pass a notebook to `inspect eval` just the same as a script (including passing task parameters)
 
 For example, imagine that all of the code shown above for `security.py` was in `security.ipynb`. You could run the eval and optionally pass a task parameter as follows:
 

diff --git a/src/inspect_ai/__init__.py b/src/inspect_ai/__init__.py
@@ -6,7 +6,7 @@
 from inspect_ai._eval.list import list_tasks
 from inspect_ai._eval.registry import task
 from inspect_ai._eval.score import score, score_async
-from inspect_ai._eval.task import Task, TaskInfo, Tasks
+from inspect_ai._eval.types import Task, TaskInfo, Tasks
 from inspect_ai._util.constants import PKG_NAME
 
 __version__ = importlib_version(PKG_NAME)

diff --git a/src/inspect_ai/_cli/list.py b/src/inspect_ai/_cli/list.py
@@ -11,7 +11,7 @@
 from inspect_ai._cli.common import CommonOptions, common_options, resolve_common_options
 from inspect_ai._cli.util import parse_cli_args
 from inspect_ai._eval.list import list_tasks
-from inspect_ai._eval.task import TaskInfo
+from inspect_ai._eval.types import TaskInfo
 from inspect_ai.log import list_eval_logs
 
 

diff --git a/src/inspect_ai/_cli/score.py b/src/inspect_ai/_cli/score.py
@@ -6,6 +6,7 @@
 from inspect_ai._display import display
 from inspect_ai._display.logger import init_logger
 from inspect_ai._eval.loader import load_tasks
+from inspect_ai._eval.score import task_score
 from inspect_ai._util.constants import SCORED_SUFFIX
 from inspect_ai._util.dotenv import init_dotenv
 from inspect_ai.log._file import JSONRecorder
@@ -76,7 +77,7 @@ async def score(
     score_task = load_tasks([task], model)[0]
 
     # re-score the task
-    eval_log = await score_task.score(eval_log)
+    eval_log = await task_score(score_task, eval_log)
 
     # re-write the log (w/ a -score suffix if requested)
     scored = f"{SCORED_SUFFIX}.json"

diff --git a/src/inspect_ai/_eval/eval.py b/src/inspect_ai/_eval/eval.py
@@ -26,8 +26,10 @@
 from inspect_ai.util._context import init_async_context
 
 from .loader import resolve_tasks
-from .log import EvalLogger
-from .task import Tasks, TaskSpec, task_file, task_run_dir
+from .task.log import TaskLogger
+from .task.run import task_run
+from .task.util import task_file, task_run_dir
+from .types import Tasks, TaskSpec
 
 log = logging.getLogger(__name__)
 
@@ -130,34 +132,35 @@ async def eval_async(
 ) -> list[EvalLog]:
     r"""Evaluate tasks using a Model (async).
 
-    tasks: (Tasks): Task(s) to evaluate. If None, attempt
-        to evaluate a task in the current working directory
-    model (str | Model | None): Model for evaluation. If not
-        specified uses the current eval's model, or failing that
-        the value of the INSPECT_EVAL_MODEL environment variable.
-    model_base_url: (str | None): Base URL for communicating
-        with the model API.
-    model_args (dict[str,Any]): Model creation parameters
-    task_args (dict[str,Any]): Task arguments
-    plan (Solver | list[Solver] | None): Alternative plan
-        for evaluating task(s). Optional (uses task plan by default).
-    log_level (str | None): "debug", "http", "info", "warning", "error",
-        or "critical" (defaults to "info")
-    log_dir (str | None): Output path for logging results
-        (defaults to file log in ./logs directory).
-    limit (int | tuple[int, int] | None): Limit evaluated samples
-        (defaults to all samples).
-    epochs (int | None): Number of times to repeat evaluation of
-        samples (defaults to 1)
-    max_messages (int | None): Maximum number of messages to allow
-        in a task conversation.
-    max_subprocesses (int | None): Maximum number of subprocesses to
-        run in parallel (default is os.cpu_count())
-    log_samples: (bool | None): Log detailed samples and scores (defaults to True)
-    log_images: (bool | None): Log base64 encoded version of images,
-        even if specified as a filename or URL (defaults to True)
-    score (bool): Score output (defaults to True)
-    **kwargs (GenerateConfigArgs): Model generation options.
+    Args:
+        tasks: (Tasks): Task(s) to evaluate. If None, attempt
+            to evaluate a task in the current working directory
+        model (str | Model | None): Model for evaluation. If not
+            specified uses the current eval's model, or failing that
+            the value of the INSPECT_EVAL_MODEL environment variable.
+        model_base_url: (str | None): Base URL for communicating
+            with the model API.
+        model_args (dict[str,Any]): Model creation parameters
+        task_args (dict[str,Any]): Task arguments
+        plan (Solver | list[Solver] | None): Alternative plan
+            for evaluating task(s). Optional (uses task plan by default).
+        log_level (str | None): "debug", "http", "info", "warning", "error",
+            or "critical" (defaults to "info")
+        log_dir (str | None): Output path for logging results
+            (defaults to file log in ./logs directory).
+        limit (int | tuple[int, int] | None): Limit evaluated samples
+            (defaults to all samples).
+        epochs (int | None): Number of times to repeat evaluation of
+            samples (defaults to 1)
+        max_messages (int | None): Maximum number of messages to allow
+            in a task conversation.
+        max_subprocesses (int | None): Maximum number of subprocesses to
+            run in parallel (default is os.cpu_count())
+        log_samples: (bool | None): Log detailed samples and scores (defaults to True)
+        log_images: (bool | None): Log base64 encoded version of images,
+            even if specified as a filename or URL (defaults to True)
+        score (bool): Score output (defaults to True)
+        **kwargs (GenerateConfigArgs): Model generation options.
 
     Returns:
         List of EvalLog (one for each task)
@@ -214,7 +217,7 @@ async def eval_async(
     )
 
     run_id = uuid()
-    loggers: list[EvalLogger] = []
+    loggers: list[TaskLogger] = []
     results: list[EvalLog] = []
     for index, name, version, task in zip(
         range(0, len(task_names)), task_names, task_versions, eval_tasks
@@ -227,10 +230,10 @@ async def eval_async(
             task_eval_config.max_messages = task.max_messages
 
         # create and track the logger
-        logger = EvalLogger(
+        logger = TaskLogger(
             task_name=name,
             task_version=version,
-            task_file=task_file(task, True),
+            task_file=task_file(task, relative=True),
             task_run_dir=task_run_dir(task),
             task_id=task_id if task_id else uuid(),
             run_id=run_id,
@@ -245,7 +248,8 @@ async def eval_async(
         loggers.append(logger)
 
         # run the eval
-        result = await task.run(
+        result = await task_run(
+            task=task,
             sequence=(index + 1, len(task_names)),
             model=model,
             logger=logger,
-Original file line number
+Diff line change
@@ Expand Up / @@ -75,7 +75,7 @@ $ inspect eval ctf.py --model together/Qwen/Qwen1.5-72B-Chat @@
     ```
     :::
-    In addition to the model providers shown above, Inspect also supports models hosted on Azure AI, AWS Bedrock, and Cloudflare. See the documentation on [Models](#sec-models) for additional details.
+    In addition to the model providers shown above, Inspect also supports models hosted on Azure AI, AWS Bedrock, and Cloudflare, as well as local models with Ollama. See the documentation on [Models](#sec-models) for additional details.
     ## Hello, Inspect {#sec-hello-inspect}
@@ Expand Down @@