AI Assisted files by pradeepto · Pull Request #34 · redhat-data-and-ai/unstructured-data-controller

pradeepto · 2026-03-07T14:21:13Z

Adding CLAUDE.md
Adding docs/ARCHITECHTURE.md - Claude analysed and generated architecture description of unstructured data controller
docs/LANGCHAIN_CHUNKING.md - Claude analysed and generated how langchain-go is used for 3 chunking strategies.

…for the 3 chunking strategies.

piyush-garg · 2026-03-09T19:48:17Z

docs/ARCHITECTURE.md

+
+## System Architecture
+
+The Unstructured Data Controller sits at the center of a larger Dataverse AI platform, orchestrating the end-to-end pipeline from raw document ingestion to processed, chunked data ready for retrieval-augmented generation (RAG) workflows.


no platform

piyush-garg · 2026-03-09T19:48:42Z

docs/ARCHITECTURE.md

+Users upload documents in various formats (DOC, PDF, PPT, HTML, and other unstructured data formats) to supported data sources:
+
+- **S3 Buckets** — Primary ingestion source. Documents are uploaded to a configured S3 bucket with a prefix matching the data product name.
+- **Google Drive** — Planned integration for ingesting documents from Google Drive.


this is not there yet

piyush-garg · 2026-03-09T19:49:10Z

docs/ARCHITECTURE.md

+- Chunks converted documents using LangChain text splitters
+- Uploads processed chunks to destination storage (Snowflake internal stages)
+
+#### BYO Document Processing


piyush-garg · 2026-03-09T19:49:52Z

docs/ARCHITECTURE.md

+
+The platform provides three core processing stages, each with multiple strategy options:
+
+| Stage | Providers | Description |


we dont have embeding and data cleaning, other two stages also have single providers - docling anf langchain

piyush-garg · 2026-03-09T19:50:01Z

docs/ARCHITECTURE.md

+| **Vector Embeddings Generation** | Gemini, models.corp | Generates vector embeddings from chunks for similarity search |
+| **Data Cleaning** | LangChain | Cleans and normalizes text data before processing |
+
+**Additional Processing Strategies** include Knowledge Graph construction and Agentic Graph RAG for more advanced document understanding and retrieval patterns.


piyush-garg · 2026-03-09T19:50:17Z

docs/ARCHITECTURE.md

+
+### Vector Database / Processed Documents Storage
+
+Processed and chunked documents are stored in various backends:


right now it is only s3 bucket

pradeepto added 3 commits March 7, 2026 14:11

Adding CLAUDE.md file to the project.

3900d95

Adding Claude generated ARCHITECHTURE.md

e85d90b

Adding LANGCHAIN_CHUNKING.md which decribes how langchain-go is used …

d0892b7

…for the 3 chunking strategies.

piyush-garg reviewed Mar 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI Assisted files#34

AI Assisted files#34
pradeepto wants to merge 3 commits intoredhat-data-and-ai:mainfrom
pradeepto:ai-assisted-files

pradeepto commented Mar 7, 2026

Uh oh!

piyush-garg Mar 9, 2026

Uh oh!

piyush-garg Mar 9, 2026

Uh oh!

piyush-garg Mar 9, 2026

Uh oh!

piyush-garg Mar 9, 2026

Uh oh!

piyush-garg Mar 9, 2026

Uh oh!

piyush-garg Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		## System Architecture

		The Unstructured Data Controller sits at the center of a larger Dataverse AI platform, orchestrating the end-to-end pipeline from raw document ingestion to processed, chunked data ready for retrieval-augmented generation (RAG) workflows.


		The platform provides three core processing stages, each with multiple strategy options:

		\| Stage \| Providers \| Description \|


		### Vector Database / Processed Documents Storage

		Processed and chunked documents are stored in various backends:

Conversation

pradeepto commented Mar 7, 2026

Uh oh!

piyush-garg Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

piyush-garg Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

piyush-garg Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

piyush-garg Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

piyush-garg Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

piyush-garg Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants