Unstructured cookbook#207
Conversation
|
|
|
Hi @SIDDHAARTHAA , thank you very much for working on this pr. Were you able to resolve all the comments from ai ? also can you please record a short 1 min video demonstrating the demo of implementation. |
|
Hey @yatharthk2, thanks for checking out. I’m working on it. |
|
@SIDDHAARTHAA can you share a short 1 min video demo as well? |
|
Attached a short demo video covering the Unstructured cookbook flow, the new smoke tests, and the local validation run passing. Screencast.From.2026-05-14.12-25-56.mp4 |
|
@SIDDHAARTHAA I dont see the ingestion pipeline in the demo and only pytests passing |
There was a problem hiding this comment.
Pull request overview
Adds a new “Unstructured + Moss” cookbook that demonstrates parsing raw files with Unstructured, chunking extracted content, preserving metadata, and incrementally upserting stable chunk IDs into a Moss index (with an optional post-ingestion query).
Changes:
- Added a runnable ingestion script (
ingest.py) that partitions files, chunks content, attaches normalized metadata, and upserts batches into Moss. - Added cookbook documentation + environment template + sample documents for a quick start.
- Added a lightweight unittest-based smoke/integration test that stubs external dependencies to validate chunk IDs/metadata and batch upsert behavior.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| examples/cookbook/unstructured/ingest.py | Implements the end-to-end ingestion + upsert + query flow for Unstructured → Moss. |
| examples/cookbook/unstructured/README.md | Documents setup, usage, and stored document shape/ID strategy. |
| examples/cookbook/unstructured/pyproject.toml | Declares cookbook dependencies and build packaging metadata. |
| examples/cookbook/unstructured/.env.example | Provides required Moss environment variable template. |
| examples/cookbook/unstructured/test_integration.py | Adds smoke tests for metadata normalization/stable IDs and upsert batching behavior. |
| examples/cookbook/unstructured/sample_docs/onboarding.html | Sample HTML input document for ingestion. |
| examples/cookbook/unstructured/sample_docs/release-notes.txt | Sample text input document for ingestion. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| _PREVIOUS_MODULES = _install_dependency_stubs() | ||
|
|
||
| ingest = importlib.import_module("ingest") | ||
|
|
||
| _restore_modules(_PREVIOUS_MODULES) | ||
|
|
| try: | ||
| await client.create_index(index_name, batches[0]) | ||
| print(f"Created index '{index_name}' with {len(batches[0])} chunks") | ||
| start = 1 | ||
| except RuntimeError as exc: | ||
| if "already exists" not in str(exc).lower(): | ||
| raise | ||
| print(f"Index '{index_name}' already exists; upserting chunks") | ||
| start = 0 | ||
|
|
Pull Request Checklist
Please ensure that your PR meets the following requirements:
Description
Adds a focused Unstructured + Moss cookbook for parsing raw files, chunking extracted content, preserving metadata, and incrementally upserting chunks into a Moss index for semantic search.
The cookbook includes a runnable ingestion script, sample documents, environment template, setup/usage documentation, and a lightweight smoke test for the ingestion pipeline.
Validation performed:
python -m py_compile examples/cookbook/unstructured/ingest.pycd examples/cookbook/unstructured && python -m unittest test_integration.pyFixes #203
Type of Change