Semantic Text Chunking Webapp

A web application for semantically chunking text files while preserving context for LLM training.

Features

Upload and process text files
Chunk text into 50-token segments
Maintain semantic coherence using DeepSeek
Preserve context with 2-sentence windows
Export processed chunks for LLM training

Project Structure

semantic-chunking-webapp/
├── backend/
│   ├── app/
│   │   ├── api/
│   │   │   ├── routes.py
│   │   │   └── chunking.py
│   │   ├── models/
│   │   │   └── chunk.py
│   │   └── utils/
│   │       ├── tokenizer.py
│   │       └── deepseek.py
│   ├── requirements.txt
│   └── Dockerfile
├── frontend/
│   ├── src/
│   │   ├── components/
│   │   ├── pages/
│   │   └── utils/
│   ├── package.json
│   └── Dockerfile
└── docker-compose.yml

Technical Stack

Backend: FastAPI
Frontend: React
ML Model: DeepSeek for semantic analysis
Database: MongoDB for storing chunks
Containerization: Docker

Setup Instructions

Clone the repository:

git clone https://github.com/yourusername/semantic-chunking-webapp.git
cd semantic-chunking-webapp

Set up environment variables:

cp .env.example .env
# Configure your DeepSeek API key and other settings

Run with Docker Compose:

docker-compose up --build

Implementation Details

Text Processing Pipeline

File Upload: Frontend accepts text files
Tokenization: Splits text into ~50 token chunks
Semantic Analysis: DeepSeek ensures chunk coherence
Context Window: Adds 2-sentence context to each chunk
Export: Generates training-ready dataset

API Endpoints

POST /api/upload: Upload text files
POST /api/process: Process uploaded files
GET /api/chunks: Retrieve processed chunks
GET /api/export: Download processed dataset

Contributing

Fork the repository
Create feature branch
Commit changes
Open pull request

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Text Chunking Webapp

Features

Project Structure

Technical Stack

Setup Instructions

Implementation Details

Text Processing Pipeline

API Endpoints

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Semantic Text Chunking Webapp

Features

Project Structure

Technical Stack

Setup Instructions

Implementation Details

Text Processing Pipeline

API Endpoints

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages