Skip to content

Larger semantic chunks sent to Ollama embeddings cause indexing failure #86

@jani-laakso

Description

@jani-laakso

What happened?

CCE aborts indexing because it sends oversized semantic chunks directly to Ollama’s /api/embed endpoint.
NOTE: This was perfectly valid Go code, around 70 lines, just contained enough longer type names / strings to blow up chunks to oversized.

During cce init, indexing starts normally, but the first embedding batch fails with:

Embedding failed: Client error '400 Bad Request' for url 'http://localhost:11434/api/embed'

After locally patching CCE to print the Ollama response body, the actual backend error is:

{"error":"the input length exceeds the context length"}

The failed request batch contained large chunks, for example:

text count: 64
chunk sizes: 1147, 898, 0, 281, 195, 1029, 3814, 11115, 54309, ...

The oversized chunks came from normal repository files, not minified/generated one-line blobs. Examples included:

  • A Markdown document or large Markdown section emitted as one chunk of around 54k characters.
  • Another Markdown section emitted as one chunk of around 11k characters.
  • Normal multi-line source-code functions emitted as single chunks around 5–7k characters.

This appears to happen because CCE treats semantic units, such as whole Markdown sections or whole source-code functions, as embedding inputs without applying a defensive maximum size before calling the embedding backend.

This makes indexing fail completely for repositories that contain large-but-valid documentation sections or source-code functions. Empty chunks also appear to be included in the embedding batch, as shown by a 0-length chunk in the same request.

What did you expect?

CCE should split or cap oversized chunks before sending them to the embedding backend.

Expected behavior:

  • cce init should complete indexing instead of aborting on one oversized chunk.
  • CCE should not send any input to /api/embed that can exceed the selected model’s context length.
  • Empty chunks should be skipped.
  • Large semantic chunks should be split into smaller embedding-safe child chunks.
  • Split chunks should preserve parent metadata such as file path, symbol or heading name, and line range where available.
  • If a chunk cannot be embedded, CCE should warn and continue indexing the rest of the repository instead of failing the whole indexing run.

Conceptually, semantic chunks and embedding inputs should be treated separately:

semantic chunk: whole function, class, Markdown section, etc.
embedding input: smaller model-safe slice of that semantic chunk

A reasonable fallback would be something like:

MAX_EMBED_CHARS = 3000
OVERLAP_CHARS = 200

if chunk is empty:
    skip
elif chunk length <= MAX_EMBED_CHARS:
    embed as-is
else:
    split into child chunks with overlap
    preserve parent metadata

Token-aware splitting would be better, but even a conservative character cap would prevent hard failures with Ollama embedding models.

Steps to reproduce

  1. Install CCE and initialize it in a repository with normal but large Markdown documentation sections or source-code functions.

  2. Use Ollama as the embedding backend with nomic-embed-text.

  3. Verify Ollama is running and the model is available:

    ollama list
    curl -s http://localhost:11434/api/version

4. Run:

   ```bash
   cce init
   ```

5. Observe that indexing starts, then fails during the first embedding batch:

   ```text
   Embedding failed: Client error '400 Bad Request' for url 'http://localhost:11434/api/embed'
   ```

6. Patch CCE locally to print the Ollama response body before `resp.raise_for_status()` in the embedder.

7. Run again:

   ```bash
   cce init
   ```

8. Observe the real Ollama error:

   ```json
   {"error":"the input length exceeds the context length"}
   ```

9. Also observe that the failed embedding request contains oversized chunks, for example chunks around 11k and 54k characters, plus at least one empty chunk.

```


### Relevant logs or error output

```shell
jani@io$ cce init

  Code Context Engine  ·  dlm
  ────────────────────────────────────────────

  Detecting embedding backend... ready (ollama, 768-d, nomic-embed-text)
  Ollama detected — LLM summarization enabled.

  ✓ Git hooks installed  (3 hooks, auto-updates on commit)
  ✓ MCP server already configured in .mcp.json
  ✓ MCP server already configured for OpenAI Codex
    ~/.codex/config.toml  →  [mcp_servers.cce-dlm-9ef807]
  ✓ Memory hooks already configured
  · Memory capture not yet active — `cce serve` hasn't been started for this project.
    Run `cce serve` in a separate terminal so the loopback hook server starts;
    until it's running, hooks fire successfully but capture is silently dropped.
    Verify any time with `cce sessions status`.
  ✓ .gitignore updated with CCE entries

  Indexing project...
    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  50/1559 files  3%
    Embedding 322 chunks (batch 1/32)…
    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  0/322 chunks embedded  0%Embedding failed: Client error '400 Bad Request' for url 'http://localhost:11434/api/embed'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400
Traceback (most recent call last):
  File "/home/jani/.local/share/uv/tools/code-context-engine/lib/python3.12/site-packages/context_engine/indexer/pipeline.py", line 465, in _embed_and_ingest
    emb.embed(batch_chunks, progress_fn=embed_progress_fn)
  File "/home/jani/.local/share/uv/tools/code-context-engine/lib/python3.12/site-packages/context_engine/indexer/embedder.py", line 503, in embed
    self._embed_all(miss_chunks, batch_size, progress_fn=progress_fn)
  File "/home/jani/.local/share/uv/tools/code-context-engine/lib/python3.12/site-packages/context_engine/indexer/embedder.py", line 543, in _embed_all
    for i, vec in enumerate(iterator(texts, batch_size=batch_size)):
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jani/.local/share/uv/tools/code-context-engine/lib/python3.12/site-packages/context_engine/indexer/embedder.py", line 354, in iter_embed
    for vec in self._embed_batch(texts[i : i + batch_size]):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jani/.local/share/uv/tools/code-context-engine/lib/python3.12/site-packages/context_engine/indexer/embedder.py", line 342, in _embed_batch
    resp.raise_for_status()
  File "/home/jani/.local/share/uv/tools/code-context-engine/lib/python3.12/site-packages/httpx/_models.py", line 829, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '400 Bad Request' for url 'http://localhost:11434/api/embed'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400

  ✗ Error: Embedding failed: Client error '400 Bad Request' for url 'http://localhost:11434/api/embed'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400
  ✓ Indexed 0 chunks from 0 files

  Done!  Restart your AI coding agent to activate CCE.
```

### Python version

3.13.12

### OS

Linux Debian

### CCE version

cce, version 0.4.21

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions