Skip to content

domvmd/semantic-chunking-webapp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

Semantic Text Chunking Webapp

A web application for semantically chunking text files while preserving context for LLM training.

Features

  • Upload and process text files
  • Chunk text into 50-token segments
  • Maintain semantic coherence using DeepSeek
  • Preserve context with 2-sentence windows
  • Export processed chunks for LLM training

Project Structure

semantic-chunking-webapp/
├── backend/
│   ├── app/
│   │   ├── api/
│   │   │   ├── routes.py
│   │   │   └── chunking.py
│   │   ├── models/
│   │   │   └── chunk.py
│   │   └── utils/
│   │       ├── tokenizer.py
│   │       └── deepseek.py
│   ├── requirements.txt
│   └── Dockerfile
├── frontend/
│   ├── src/
│   │   ├── components/
│   │   ├── pages/
│   │   └── utils/
│   ├── package.json
│   └── Dockerfile
└── docker-compose.yml

Technical Stack

  • Backend: FastAPI
  • Frontend: React
  • ML Model: DeepSeek for semantic analysis
  • Database: MongoDB for storing chunks
  • Containerization: Docker

Setup Instructions

  1. Clone the repository:
git clone https://github.com/yourusername/semantic-chunking-webapp.git
cd semantic-chunking-webapp
  1. Set up environment variables:
cp .env.example .env
# Configure your DeepSeek API key and other settings
  1. Run with Docker Compose:
docker-compose up --build

Implementation Details

Text Processing Pipeline

  1. File Upload: Frontend accepts text files
  2. Tokenization: Splits text into ~50 token chunks
  3. Semantic Analysis: DeepSeek ensures chunk coherence
  4. Context Window: Adds 2-sentence context to each chunk
  5. Export: Generates training-ready dataset

API Endpoints

  • POST /api/upload: Upload text files
  • POST /api/process: Process uploaded files
  • GET /api/chunks: Retrieve processed chunks
  • GET /api/export: Download processed dataset

Contributing

  1. Fork the repository
  2. Create feature branch
  3. Commit changes
  4. Open pull request

License

MIT License

About

A web application for semantically chunking text files with context preservation for LLM training

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors