[labs] Chroma rejects None metadata from LlamaIndex chunking pipeline (ValueError in add_chunks_to_collection)

## Problem

Running the setup notebooks that build the ChromaDB vector store fails with:

\`\`\`
ValueError: Expected metadata value to be a str, int, float or bool, got None which is a NoneType in add.
\`\`\`

Triggered during \`ChromaDBWrapperClient.add_chunks_to_collection(chunks)\` → \`collection.add(metadatas=[chunk.metadata ...])\`.

## Root cause

In the \`SentenceSplitterChunkingStrategy.to_ragchunks(...)\` helper, each \`RAGChunk\`'s metadata is built as:

\`\`\`python
metadata={
    **node.metadata,
    'relative_path': self._extract_relative_path(node.metadata['file_path'])
}
\`\`\`

Two sources of \`None\`:

1. \`_extract_relative_path(...)\` explicitly returns \`None\` when the regex doesn't match the \`input_dir\` prefix.
2. LlamaIndex \`Document\` / \`Node\` metadata for markdown files routinely carries \`None\` values for fields that can't be inferred (e.g. \`creation_date\`, \`last_modified_date\` depending on filesystem / loader version).

Chroma's \`validate_metadata\` rejects any \`None\` value, so the entire batch \`add\` fails.

## Affected notebooks

Both files share the same \`SentenceSplitterChunkingStrategy\` + \`ChromaDBWrapperClient\` code path:

- \`labs/module2/notebooks/1_setup.ipynb\` (cell-6 builds metadata, cell-9/11 calls \`add_chunks_to_collection\`)
- \`labs/module3/notebooks/1_setup.ipynb\` (same code)

Downstream labs (\`module2/2_prompt_chaining\` .. \`6_evaluator_optimizer\`, \`module3/2_agent_memory\` .. \`4_agent_retrieval\`) all expect the persisted \`data/chroma\` store this setup notebook creates, so the failure blocks most of modules 2 and 3 for students running from a clean clone.

## Proposed fix

One-line strip of \`None\` values before handing metadata to Chroma, inside \`to_ragchunks\`:

\`\`\`python
def to_ragchunks(self, nodes: List[Node]) -> List[RAGChunk]:
    chunks = []
    for node in nodes:
        metadata = {
            **node.metadata,
            'relative_path': self._extract_relative_path(node.metadata['file_path'])
        }
        # Chroma rejects None metadata values; drop them.
        metadata = {k: v for k, v in metadata.items() if v is not None}
        chunks.append(RAGChunk(id_=node.node_id, text=node.text, metadata=metadata))
    return chunks
\`\`\`

Alternatively, a tiny helper \`_clean_metadata(m: dict) -> dict\` could be added to \`labs_common\` so both module2 and module3 setup notebooks import it instead of duplicating the filter.

## How to reproduce

1. Fresh clone + \`uv sync\`.
2. Open \`labs/module2/notebooks/1_setup.ipynb\`.
3. Run cells top-to-bottom (the \`%%bash\` cell clones OpenSearch docs).
4. The cell that does \`chroma_os_docs_collection.add_chunks_to_collection(chunks)\` raises the \`ValueError\`.

## Scope

This is a pre-existing LlamaIndex ↔ Chroma compatibility issue, **unrelated** to the legacy-model-ID cleanup tracked in #50 / #51. Filing separately so the fix can be scoped cleanly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[labs] Chroma rejects None metadata from LlamaIndex chunking pipeline (ValueError in add_chunks_to_collection) #52

Problem

Root cause

Affected notebooks

Proposed fix

How to reproduce

Scope

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[labs] Chroma rejects None metadata from LlamaIndex chunking pipeline (ValueError in add_chunks_to_collection) #52

Description

Problem

Root cause

Affected notebooks

Proposed fix

How to reproduce

Scope

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions