This project trains a machine learning model to provide real-time feedback on text to help users write more concise, "TLDR-style" summaries. As a user types into a textbox, words or phrases that the model considers "fluff" or non-essential are highlighted in red.
-
Data Preprocessing: The project uses the
trl-lib/tldrdataset. A one-time preprocessing script (src/preprocess_data.py) runs first. It usesspaCyto analyze each post and its summary. It identifies key phrases (like noun chunks) in the post that are semantically similar to the summary. These are considered "essential". This process generates a new labeled dataset saved as a CSV. -
Modeling: The problem is framed as a Token Classification task. We fine-tune a
DistilBERTmodel on the preprocessed data to classify each word (token) from the input text as eitherESSENTIALorFLUFF. -
Application: A FastAPI backend serves the fine-tuned model. A simple HTML/JS frontend sends the user's text to the backend as they type. The backend returns chunks of the original text labeled as
FLUFForESSENTIAL, which the frontend then uses to apply highlighting.
This is CRITICAL. The following must be installed on your host machine (your computer, not in Docker):
- An NVIDIA GPU.
- The latest NVIDIA Drivers for your GPU and OS.
- The NVIDIA Container Toolkit, which allows Docker to access your GPU. Follow the official installation guide for your Linux distribution.
- Build and Start the Services:
Open your terminal in the project root and run:
docker-compose up --build