A web application for semantically chunking text files while preserving context for LLM training.
- Upload and process text files
- Chunk text into 50-token segments
- Maintain semantic coherence using DeepSeek
- Preserve context with 2-sentence windows
- Export processed chunks for LLM training
semantic-chunking-webapp/
├── backend/
│ ├── app/
│ │ ├── api/
│ │ │ ├── routes.py
│ │ │ └── chunking.py
│ │ ├── models/
│ │ │ └── chunk.py
│ │ └── utils/
│ │ ├── tokenizer.py
│ │ └── deepseek.py
│ ├── requirements.txt
│ └── Dockerfile
├── frontend/
│ ├── src/
│ │ ├── components/
│ │ ├── pages/
│ │ └── utils/
│ ├── package.json
│ └── Dockerfile
└── docker-compose.yml
- Backend: FastAPI
- Frontend: React
- ML Model: DeepSeek for semantic analysis
- Database: MongoDB for storing chunks
- Containerization: Docker
- Clone the repository:
git clone https://github.com/yourusername/semantic-chunking-webapp.git
cd semantic-chunking-webapp- Set up environment variables:
cp .env.example .env
# Configure your DeepSeek API key and other settings- Run with Docker Compose:
docker-compose up --build- File Upload: Frontend accepts text files
- Tokenization: Splits text into ~50 token chunks
- Semantic Analysis: DeepSeek ensures chunk coherence
- Context Window: Adds 2-sentence context to each chunk
- Export: Generates training-ready dataset
POST /api/upload: Upload text filesPOST /api/process: Process uploaded filesGET /api/chunks: Retrieve processed chunksGET /api/export: Download processed dataset
- Fork the repository
- Create feature branch
- Commit changes
- Open pull request
MIT License