Skip to content

Commit

Permalink
release v0.3.9
Browse files Browse the repository at this point in the history
  • Loading branch information
aisi-inspect committed May 14, 2024
1 parent aec9b20 commit 671a239
Show file tree
Hide file tree
Showing 63 changed files with 2,171 additions and 1,461 deletions.
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,14 @@
# Changelog

## v0.3.9 (14 May 2024)

- Add `ollama` local model provider.
- Add `multi_scorer()` and `majority_vote()` functions for combining multiple scorers into a single score.
- Add support for multiple model graders in `model_graded_qa()`.
- Raise `TypeError` for solvers and scorers not declared as `async`.
- Fallback to standard parase if `NaN` or `Inf` is encountered while reading log file header.
- Remove deprecated support for matching partial model names (e.g. "gpt" or "claude").

## v0.3.8 (07 May 2024)

- Exclude null config values from listings in log viewer.
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ $ cd inspect_ai
$ pip install -e ".[dev]"
```

If you use VS Code, you should be sure to have installed the recommended extensions (Python, Ruff, and MyPy). Note that you'll be promoted to install these when you open the project in VS Code.
If you use VS Code, you should be sure to have installed the recommended extensions (Python, Ruff, and MyPy). Note that you'll be prompted to install these when you open the project in VS Code.
2 changes: 1 addition & 1 deletion docs/eval-suites.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ if __name__ == "__main__":
eval(security_guide, model="google/gemini-1.0-pro")
```

Doing this allows your source file to be both a Python script that is convenient to run during development as well as be a Python module that tasks can be read from without executing the eval. There is no real downside to this, and it's a good way in general to write all of your eval scripts and notebooks (see the docs on [\_\_main\_\_](https://docs.python.org/3/library/main.html) for additional details.)
Doing this allows your source file to be both a Python script that is convenient to run during development as well as be a Python module that tasks can be read from without executing the eval. There is no real downside to this, and it's a good way in general to write all of your eval scripts and notebooks (see the docs on [\_\_main\_\_](https://docs.python.org/3/library/__main__.html) for additional details.)

## Use Cases

Expand Down
2 changes: 1 addition & 1 deletion docs/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ $ inspect eval ctf.py --model together/Qwen/Qwen1.5-72B-Chat
```
:::

In addition to the model providers shown above, Inspect also supports models hosted on Azure AI, AWS Bedrock, and Cloudflare. See the documentation on [Models](#sec-models) for additional details.
In addition to the model providers shown above, Inspect also supports models hosted on Azure AI, AWS Bedrock, and Cloudflare, as well as local models with Ollama. See the documentation on [Models](#sec-models) for additional details.

## Hello, Inspect {#sec-hello-inspect}

Expand Down
12 changes: 10 additions & 2 deletions docs/models.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,18 @@ Inspect has built in support for a variety of language model API providers and c
| Google | `pip install google-generativeai` | `GOOGLE_API_KEY` |
| Mistral | `pip install mistralai` | `MISTRAL_API_KEY` |
| Hugging Face | `pip install transformers` | `HF_TOKEN` |
| Ollama | `pip install openai` | None required |
| TogetherAI | `pip install openai` | `TOGETHER_API_KEY` |
| AWS Bedrock | `pip install boto3` | `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_DEFAULT_REGION` |
| Azure AI | None required | `AZURE_API_KEY` and `INSPECT_EVAL_MODEL_BASE_URL` |
| Cloudflare | None required | `CLOUDFLARE_ACCOUNT_ID` and `CLOUDFLARE_API_TOKEN` |

: {tbl-colwidths="\[18,45,37\]"}

::: {.callout-note appearance="minimal"}
Note that some providers ([Ollama](https://github.com/ollama/ollama/blob/main/docs/openai.md) and [TogetherAI](https://docs.together.ai/docs/openai-api-compatibility)) support the OpenAI Python package as a client, which is why you need to `pip install openai` for these providers even though you aren't actually interacting with the OpenAI service when you use them.
:::

## Using Models

To select a model for use in an evaluation task you specify it using a *model name*. Model names include their API provider and the specific model to use (e.g. `openai/gpt-4`) Here are the supported providers along with example model names and links to documentation on all available models:
Expand All @@ -29,6 +34,7 @@ To select a model for use in an evaluation task you specify it using a *model na
| Google | `google/gemini-1.0-pro` | [Google Models](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models) |
| Mistral | `mistral/mistral-large-latest` | [Mistral Models](https://docs.mistral.ai/platform/endpoints/) |
| Hugging Face | `hf/openai-community/gpt2` | [Hugging Face Models](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending) |
| Ollama | `ollama/llama3` | [Ollama Models](https://ollama.com/library) |
| TogetherAI | `together/lmsys/vicuna-13b-v1.5` | [TogetherAI Models](https://docs.together.ai/docs/inference-models#chat-models) |
| AWS Bedrock | `bedrock/meta.llama2-70b-chat-v1` | [AWS Bedrock Models](https://aws.amazon.com/bedrock/) |
| Azure AI | `azureai/azure-deployment-name` | [Azure AI Models](https://ai.azure.com/explore/models) |
Expand Down Expand Up @@ -71,6 +77,7 @@ Each model also can use a different base URL than the default (e.g. if running t
| Google | `GOOGLE_BASE_URL` |
| Mistral | `MISTRAL_BASE_URL` |
| TogetherAI | `TOGETHER_BASE_URL` |
| Ollama | `OLLAMA_BASE_URL` |
| AWS Bedrock | `BEDROCK_BASE_URL` |
| Azure AI | `AZUREAI_BASE_URL` |
| Cloudflare | `CLOUDFLARE_BASE_URL` |
Expand Down Expand Up @@ -310,13 +317,14 @@ The additional `model_args` are forwarded as follows for the various providers:
| Google | `genai.configure` |
| Mistral | `MistralAsyncClient` |
| Hugging Face | `AutoModelForCausalLM.from_pretrained` |
| Ollama | `AsyncOpenAI` |
| TogetherAI | `AsyncOpenAI` |
| AzureAI | Chat HTTP Post Body |
| Cloudflare | Chat HTTP Post Body |

: {tbl-colwidths="\[30,70\]"}

See the OpenAI, Anthropic, Google, Mistral, Hugging Face, TogetherAI, Azure AI, and Cloudflare provider documentation for more information on the additional options available.
See the OpenAI, Anthropic, Google, Mistral, Hugging Face, Ollama, TogetherAI, Azure AI, and Cloudflare provider documentation for more information on the additional options available.

## Custom Models

Expand Down Expand Up @@ -358,4 +366,4 @@ model = get_model("custom/name-of-model")
eval(math, model = "custom/name-of-model")
```

In this example, the `model_name` argument passed to `__init__()` will be "name-of-model".
In this example, the `model_name` argument passed to `__init__()` will be "name-of-model".
59 changes: 57 additions & 2 deletions docs/scorers.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ Task(
)
```

### Model Graded
## Model Graded

Model graded scorers are well suited to assessing open ended answers as well as factual answers that are embedded in a longer narrative. The built-in model graded scorers can be customised in several ways—you can also create entirely new model scorers (see the model graded example below for a starting point).

Expand All @@ -87,7 +87,25 @@ The default model graded QA scorer is tuned to grade answers to open ended quest

The `model_graded_fact()` scorer works identically to `model_graded_qa()`, and simply provides an alternate `template` oriented around judging whether a fact is included in the model output.

If you want to understand how the default templates for `model_graded_qa()` and `model_graded_fact()` work, see their [source code](https://github.com/AI-Safety-Institute/inspect_ai/blob/main/src/inspect_ai/scorer/_model.py).
If you want to understand how the default templates for `model_graded_qa()` and `model_graded_fact()` work, see their [source code](https://github.com/UKGovernmentBEIS/inspect_ai/blob/main/src/inspect_ai/scorer/_model.py).

### Multiple Models

The built-in model graded scorers also support using multiple grader models (whereby the final grade is chosen by majority vote). For example, here we specify that 3 models should be used for grading:

```python
model_graded_qa(
models = [
"google/gemini-1.0-pro",
"anthropic/claude-3-opus-20240229"
"together/meta-llama/Llama-3-70b-chat-hf",
]
)
```

The implementation of multiple grader models takes advantage of the `multi_scorer()` and `majority_vote()` functions, both of which can be used in your own scorers (as described in the [Multi Scorer](#sec-multi-scorer) section below).



## Custom Scorers

Expand Down Expand Up @@ -239,6 +257,43 @@ Note that the call to `model_grader.generate()` is done with `await`—this is c

Note also we use the `input_text` property of the `TaskState` to access a string version of the original user input to substitute it into the grading template. Using the `input_text` has two benefits: (1) It is guaranteed to cover the original input from the dataset (rather than a transformed prompt in `messages`); and (2) It normalises the input to a string (as it could have been a message list).

## Multi Scorer

It's possible to use multiple scorers in parallel, then combine their output into a final overall score. This is done using the `multi_scorer()` function. For example, this is roughly how the built in model graders use multiple models for grading:

```python
multi_scorer(
scorers = [model_graded_qa(model=model) for model in models],
reducer = majority_vote
)
```

Use of `multi_scorer()` requires both a list of scorers as well as a _reducer_ which is a function that takes a list of scores and turns it into a single score. In this case we use the built in `majority_vote()` reducer which returns the score that appeared most frequently in the answers.

You can imagine a variety of different strategies for reducing scores (take the average, take the high or low, majority vote, etc.). For example, here's a reducer that computes the average score:

```python
import numpy as np

def average_score(scores: list[Score]) -> Score:
values = [score.as_float() for score in scores]
avg = np.mean(values).item()
return Score(
value=avg,
explanation=f"average of {', '.join(values)}"
)
```

Which might be used like this:

```python
multi_scorer(
scorers = [model_graded_qa(model=model) for model in models],
reducer = average_score
)
```


## Metrics

Each scorer provides one or more built-in metrics (typically `accuracy` and `bootstrap_std`). In addition, you can specify other metrics (either built-in or custom) to compute when defining a `Task`:
Expand Down
4 changes: 2 additions & 2 deletions docs/tools.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Inspect natively supports registering Python functions as tools and providing th
::: {.callout-note}
### Tools and Agents

One application of tools is to run them within an agent scaffold that pursues an objective over multiple interactions with a model. The scaffold uses the model to help make decisions about which tools to use and when, and orchestrates calls to the model to use the tools. We'll cover how to use agent scaffolds in [Agent Solvers](#agents) below.
One application of tools is to run them within an agent scaffold that pursues an objective over multiple interactions with a model. The scaffold uses the model to help make decisions about which tools to use and when, and orchestrates calls to the model to use the tools. We'll cover how to use agent scaffolds in [Agent Solvers](#agent-solvers) below.
:::

## Tool Basics
Expand Down Expand Up @@ -149,7 +149,7 @@ plan = [

## Web Search

Inspect has a built in `web_search()` tool that provides models with the ability to enhance their context window by performing a search. By default web searches retrieves 10 results from a provider, uses a model to determine if the contents is relevant then returns the top 3 relevant search results to the main model. Here is the definition of the `web_search()` function:
Inspect has a built in `web_search()` tool that provides models with the ability to enhance their context window by performing a search. By default web searches retrieve 10 results from a provider, uses a model to determine if the contents is relevant then returns the top 3 relevant search results to the main model. Here is the definition of the `web_search()` function:

``` python
def web_search(
Expand Down
4 changes: 2 additions & 2 deletions docs/workflow.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -234,7 +234,7 @@ if __name__ == "__main__"
```

::: {.callout-note appearance="minimal"}
If you aren't familiar with the `__name__ == "__main__"` idiom, see the docs on [\_\_main\_\_](https://docs.python.org/3/library/main.html) for additional details.
If you aren't familiar with the `__name__ == "__main__"` idiom, see the docs on [\_\_main\_\_](https://docs.python.org/3/library/__main__.html) for additional details.
:::

Now we can take the same script and use it with `inspect eval` (while leaving our exploratory code intact and protected by the `__main__` check):
Expand All @@ -255,7 +255,7 @@ We refer to notebooks above but show scripts in all of the examples. Everything

1. You can use the `__name__ == "__main__"` check to protect cells that should only be run in exploratory mode.

2. You can pass a notebook to `insect eval` just the same as a script (including passing task parameters)
2. You can pass a notebook to `inspect eval` just the same as a script (including passing task parameters)

For example, imagine that all of the code shown above for `security.py` was in `security.ipynb`. You could run the eval and optionally pass a task parameter as follows:

Expand Down
2 changes: 1 addition & 1 deletion src/inspect_ai/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
from inspect_ai._eval.list import list_tasks
from inspect_ai._eval.registry import task
from inspect_ai._eval.score import score, score_async
from inspect_ai._eval.task import Task, TaskInfo, Tasks
from inspect_ai._eval.types import Task, TaskInfo, Tasks
from inspect_ai._util.constants import PKG_NAME

__version__ = importlib_version(PKG_NAME)
Expand Down
2 changes: 1 addition & 1 deletion src/inspect_ai/_cli/list.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
from inspect_ai._cli.common import CommonOptions, common_options, resolve_common_options
from inspect_ai._cli.util import parse_cli_args
from inspect_ai._eval.list import list_tasks
from inspect_ai._eval.task import TaskInfo
from inspect_ai._eval.types import TaskInfo
from inspect_ai.log import list_eval_logs


Expand Down
3 changes: 2 additions & 1 deletion src/inspect_ai/_cli/score.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from inspect_ai._display import display
from inspect_ai._display.logger import init_logger
from inspect_ai._eval.loader import load_tasks
from inspect_ai._eval.score import task_score
from inspect_ai._util.constants import SCORED_SUFFIX
from inspect_ai._util.dotenv import init_dotenv
from inspect_ai.log._file import JSONRecorder
Expand Down Expand Up @@ -76,7 +77,7 @@ async def score(
score_task = load_tasks([task], model)[0]

# re-score the task
eval_log = await score_task.score(eval_log)
eval_log = await task_score(score_task, eval_log)

# re-write the log (w/ a -score suffix if requested)
scored = f"{SCORED_SUFFIX}.json"
Expand Down
72 changes: 38 additions & 34 deletions src/inspect_ai/_eval/eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,10 @@
from inspect_ai.util._context import init_async_context

from .loader import resolve_tasks
from .log import EvalLogger
from .task import Tasks, TaskSpec, task_file, task_run_dir
from .task.log import TaskLogger
from .task.run import task_run
from .task.util import task_file, task_run_dir
from .types import Tasks, TaskSpec

log = logging.getLogger(__name__)

Expand Down Expand Up @@ -130,34 +132,35 @@ async def eval_async(
) -> list[EvalLog]:
r"""Evaluate tasks using a Model (async).
tasks: (Tasks): Task(s) to evaluate. If None, attempt
to evaluate a task in the current working directory
model (str | Model | None): Model for evaluation. If not
specified uses the current eval's model, or failing that
the value of the INSPECT_EVAL_MODEL environment variable.
model_base_url: (str | None): Base URL for communicating
with the model API.
model_args (dict[str,Any]): Model creation parameters
task_args (dict[str,Any]): Task arguments
plan (Solver | list[Solver] | None): Alternative plan
for evaluating task(s). Optional (uses task plan by default).
log_level (str | None): "debug", "http", "info", "warning", "error",
or "critical" (defaults to "info")
log_dir (str | None): Output path for logging results
(defaults to file log in ./logs directory).
limit (int | tuple[int, int] | None): Limit evaluated samples
(defaults to all samples).
epochs (int | None): Number of times to repeat evaluation of
samples (defaults to 1)
max_messages (int | None): Maximum number of messages to allow
in a task conversation.
max_subprocesses (int | None): Maximum number of subprocesses to
run in parallel (default is os.cpu_count())
log_samples: (bool | None): Log detailed samples and scores (defaults to True)
log_images: (bool | None): Log base64 encoded version of images,
even if specified as a filename or URL (defaults to True)
score (bool): Score output (defaults to True)
**kwargs (GenerateConfigArgs): Model generation options.
Args:
tasks: (Tasks): Task(s) to evaluate. If None, attempt
to evaluate a task in the current working directory
model (str | Model | None): Model for evaluation. If not
specified uses the current eval's model, or failing that
the value of the INSPECT_EVAL_MODEL environment variable.
model_base_url: (str | None): Base URL for communicating
with the model API.
model_args (dict[str,Any]): Model creation parameters
task_args (dict[str,Any]): Task arguments
plan (Solver | list[Solver] | None): Alternative plan
for evaluating task(s). Optional (uses task plan by default).
log_level (str | None): "debug", "http", "info", "warning", "error",
or "critical" (defaults to "info")
log_dir (str | None): Output path for logging results
(defaults to file log in ./logs directory).
limit (int | tuple[int, int] | None): Limit evaluated samples
(defaults to all samples).
epochs (int | None): Number of times to repeat evaluation of
samples (defaults to 1)
max_messages (int | None): Maximum number of messages to allow
in a task conversation.
max_subprocesses (int | None): Maximum number of subprocesses to
run in parallel (default is os.cpu_count())
log_samples: (bool | None): Log detailed samples and scores (defaults to True)
log_images: (bool | None): Log base64 encoded version of images,
even if specified as a filename or URL (defaults to True)
score (bool): Score output (defaults to True)
**kwargs (GenerateConfigArgs): Model generation options.
Returns:
List of EvalLog (one for each task)
Expand Down Expand Up @@ -214,7 +217,7 @@ async def eval_async(
)

run_id = uuid()
loggers: list[EvalLogger] = []
loggers: list[TaskLogger] = []
results: list[EvalLog] = []
for index, name, version, task in zip(
range(0, len(task_names)), task_names, task_versions, eval_tasks
Expand All @@ -227,10 +230,10 @@ async def eval_async(
task_eval_config.max_messages = task.max_messages

# create and track the logger
logger = EvalLogger(
logger = TaskLogger(
task_name=name,
task_version=version,
task_file=task_file(task, True),
task_file=task_file(task, relative=True),
task_run_dir=task_run_dir(task),
task_id=task_id if task_id else uuid(),
run_id=run_id,
Expand All @@ -245,7 +248,8 @@ async def eval_async(
loggers.append(logger)

# run the eval
result = await task.run(
result = await task_run(
task=task,
sequence=(index + 1, len(task_names)),
model=model,
logger=logger,
Expand Down
Loading

0 comments on commit 671a239

Please sign in to comment.