Skip to content

[bot] Add Ollama Python SDK integration for chat, generate, and embed instrumentation #389

@braintrust-bot

Description

@braintrust-bot

Summary

The Ollama Python SDK (ollama) is the official Python client for Ollama, the leading platform for running LLMs locally. It provides execution APIs for chat completions (chat()), text generation (generate()), and embeddings (embed()) with a unique API surface distinct from OpenAI's format. This repository has zero instrumentation for any Ollama SDK surface — no integration, no wrapper, no patcher, no auto_instrument() support.

The ollama package has 9.9k GitHub stars, is used by ~34.8k downstream projects, and is actively maintained (latest: v0.6.2, April 29, 2026). It is one of the most widely used interfaces for local LLM inference in the Python ecosystem.

While Ollama also exposes an OpenAI-compatible HTTP endpoint, most Python users interact through the native ollama SDK which has its own request/response schemas. wrap_openai() cannot be used with the native ollama.Client or module-level functions. The AgentScope integration in this repo patches agentscope.model.OllamaChatModel.__call__ (an AgentScope wrapper), not the Ollama SDK itself — direct ollama.chat() calls produce no Braintrust spans.

What needs to be instrumented

The ollama package (v0.6.2) exposes these execution surfaces via module-level functions, Client, and AsyncClient, none of which are instrumented:

Chat (highest priority)

SDK Method Description Streaming Return type
ollama.chat(model, messages, ...) Chat completions with conversation history and tool use stream=True returns iterator of dicts dict with message, model, eval_count, prompt_eval_count

Response shape: Returns a dict with message (role + content), model, eval_count (completion tokens), prompt_eval_count (prompt tokens), total_duration, load_duration, prompt_eval_duration, eval_duration. Token usage and latency metrics are directly available.

Tool calling: Supports tools parameter for function calling. Tool calls appear in message.tool_calls.

Generate

SDK Method Description Streaming Return type
ollama.generate(model, prompt, ...) Text generation from a prompt stream=True returns iterator of dicts dict with response, model, eval_count, prompt_eval_count

Embed

SDK Method Description Return type
ollama.embed(model, input) Generate embeddings from text (single or batch) dict with embeddings, model

All methods have corresponding Client instance methods and AsyncClient async variants with identical signatures.

Implementation notes

Module-level and client-level API: The ollama package exposes both module-level convenience functions (ollama.chat(...)) and class-based clients (Client().chat(...), AsyncClient().chat(...)). Both need instrumentation.

Patching strategy: The module-level functions delegate to a default Client instance. Patching Client.chat, Client.generate, Client.embed and corresponding AsyncClient methods should cover both usage patterns.

Streaming: Both chat and generate support stream=True, returning iterators of partial response dicts. The integration must accumulate chunks and finalize the span when the stream is exhausted.

Rich timing metrics: Ollama responses include total_duration, load_duration, prompt_eval_duration, and eval_duration in nanoseconds, providing fine-grained latency data beyond what most cloud providers expose.

Parameters relevant for span metadata: model, options (contains temperature, top_p, top_k, num_predict, stop, seed), format (structured output), tools, keep_alive.

No API key: Ollama runs locally, so there's no API key to sanitize in VCR cassettes. However, testing requires a running Ollama server with models pulled.

Proposed span shape

chat() / generate()

Span field Content
input messages (chat) or prompt (generate), system, tools
output message (chat) or response (generate)
metadata provider: "ollama", model, options (temperature, etc.)
metrics tokens, prompt_tokens, completion_tokens, time_to_first_token (streaming), Ollama-specific durations

embed()

Span field Content
input input text(s)
output Embedding dimensions/count
metadata provider: "ollama", model

No coverage in any instrumentation layer

  • No integration directory (py/src/braintrust/integrations/ollama/)
  • No wrapper function (e.g. wrap_ollama())
  • No patcher in any existing integration (the AgentScope _OllamaChatModelPatcher patches AgentScope's model wrapper, not the ollama SDK)
  • No nox test session (test_ollama)
  • No version entry in py/src/braintrust/integrations/versioning.py
  • No mention in py/src/braintrust/integrations/__init__.py

A grep for ollama across py/src/braintrust/integrations/ returns only agentscope/patchers.py which patches AgentScope's own OllamaChatModel class, not the ollama SDK.

Braintrust docs status

not_found — Ollama is not listed on the Braintrust tracing guide or the integrations directory. The custom providers page documents using Ollama's OpenAI-compatible endpoint via the proxy, but this does not cover native ollama SDK calls.

Upstream references

Local repo files inspected

  • py/src/braintrust/integrations/ — no ollama/ directory exists on main
  • py/src/braintrust/wrappers/ — no Ollama wrapper
  • py/noxfile.py — no test_ollama session
  • py/src/braintrust/integrations/__init__.py — Ollama not listed in integration registry
  • py/src/braintrust/integrations/versioning.py — no Ollama version matrix
  • py/src/braintrust/integrations/agentscope/patchers.py — patches agentscope.model.OllamaChatModel, not the native ollama SDK

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions