Skip to content

AI Assisted files#34

Open
pradeepto wants to merge 3 commits intoredhat-data-and-ai:mainfrom
pradeepto:ai-assisted-files
Open

AI Assisted files#34
pradeepto wants to merge 3 commits intoredhat-data-and-ai:mainfrom
pradeepto:ai-assisted-files

Conversation

@pradeepto
Copy link
Copy Markdown

  • Adding CLAUDE.md
  • Adding docs/ARCHITECHTURE.md - Claude analysed and generated architecture description of unstructured data controller
  • docs/LANGCHAIN_CHUNKING.md - Claude analysed and generated how langchain-go is used for 3 chunking strategies.


## System Architecture

The Unstructured Data Controller sits at the center of a larger Dataverse AI platform, orchestrating the end-to-end pipeline from raw document ingestion to processed, chunked data ready for retrieval-augmented generation (RAG) workflows.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no platform

Users upload documents in various formats (DOC, PDF, PPT, HTML, and other unstructured data formats) to supported data sources:

- **S3 Buckets** — Primary ingestion source. Documents are uploaded to a configured S3 bucket with a prefix matching the data product name.
- **Google Drive** — Planned integration for ingesting documents from Google Drive.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not there yet

- Chunks converted documents using LangChain text splitters
- Uploads processed chunks to destination storage (Snowflake internal stages)

#### BYO Document Processing
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not yet


The platform provides three core processing stages, each with multiple strategy options:

| Stage | Providers | Description |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont have embeding and data cleaning, other two stages also have single providers - docling anf langchain

| **Vector Embeddings Generation** | Gemini, models.corp | Generates vector embeddings from chunks for similarity search |
| **Data Cleaning** | LangChain | Cleans and normalizes text data before processing |

**Additional Processing Strategies** include Knowledge Graph construction and Agentic Graph RAG for more advanced document understanding and retrieval patterns.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not yet


### Vector Database / Processed Documents Storage

Processed and chunked documents are stored in various backends:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now it is only s3 bucket

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants