Skip to content

[labs] Chroma rejects None metadata from LlamaIndex chunking pipeline (ValueError in add_chunks_to_collection) #52

@ManojRamani

Description

@ManojRamani

Problem

Running the setup notebooks that build the ChromaDB vector store fails with:

```
ValueError: Expected metadata value to be a str, int, float or bool, got None which is a NoneType in add.
```

Triggered during `ChromaDBWrapperClient.add_chunks_to_collection(chunks)` → `collection.add(metadatas=[chunk.metadata ...])`.

Root cause

In the `SentenceSplitterChunkingStrategy.to_ragchunks(...)` helper, each `RAGChunk`'s metadata is built as:

```python
metadata={
**node.metadata,
'relative_path': self._extract_relative_path(node.metadata['file_path'])
}
```

Two sources of `None`:

  1. `_extract_relative_path(...)` explicitly returns `None` when the regex doesn't match the `input_dir` prefix.
  2. LlamaIndex `Document` / `Node` metadata for markdown files routinely carries `None` values for fields that can't be inferred (e.g. `creation_date`, `last_modified_date` depending on filesystem / loader version).

Chroma's `validate_metadata` rejects any `None` value, so the entire batch `add` fails.

Affected notebooks

Both files share the same `SentenceSplitterChunkingStrategy` + `ChromaDBWrapperClient` code path:

  • `labs/module2/notebooks/1_setup.ipynb` (cell-6 builds metadata, cell-9/11 calls `add_chunks_to_collection`)
  • `labs/module3/notebooks/1_setup.ipynb` (same code)

Downstream labs (`module2/2_prompt_chaining` .. `6_evaluator_optimizer`, `module3/2_agent_memory` .. `4_agent_retrieval`) all expect the persisted `data/chroma` store this setup notebook creates, so the failure blocks most of modules 2 and 3 for students running from a clean clone.

Proposed fix

One-line strip of `None` values before handing metadata to Chroma, inside `to_ragchunks`:

```python
def to_ragchunks(self, nodes: List[Node]) -> List[RAGChunk]:
chunks = []
for node in nodes:
metadata = {
**node.metadata,
'relative_path': self.extract_relative_path(node.metadata['file_path'])
}
# Chroma rejects None metadata values; drop them.
metadata = {k: v for k, v in metadata.items() if v is not None}
chunks.append(RAGChunk(id
=node.node_id, text=node.text, metadata=metadata))
return chunks
```

Alternatively, a tiny helper `_clean_metadata(m: dict) -> dict` could be added to `labs_common` so both module2 and module3 setup notebooks import it instead of duplicating the filter.

How to reproduce

  1. Fresh clone + `uv sync`.
  2. Open `labs/module2/notebooks/1_setup.ipynb`.
  3. Run cells top-to-bottom (the `%%bash` cell clones OpenSearch docs).
  4. The cell that does `chroma_os_docs_collection.add_chunks_to_collection(chunks)` raises the `ValueError`.

Scope

This is a pre-existing LlamaIndex ↔ Chroma compatibility issue, unrelated to the legacy-model-ID cleanup tracked in #50 / #51. Filing separately so the fix can be scoped cleanly.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions