🚇 Metro Document Classifier

A BERT-based document classification system for metro/railway organizations. Automatically categorizes PDF documents into six operational categories using a fine-tuned bert-base-uncased model, served through a Flask web application with a real-time processing queue.

🏷️ Categories

The model classifies documents into six categories:

#	Category
1	Technical & Engineering Documents
2	Passenger & Public-Facing Documents
3	Financial & Procurement Documents
4	Human Resources & Administrative Documents
5	Safety, Security & Regulatory Documents
6	Strategic & Project Management Documents

⚙️ Setup & Installation

1. Clone the repository

git clone https://github.com/Kr1491/metro-doc-classifier.git
cd metro-doc-classifier

2. Create a virtual environment

python -m venv venv
source venv/bin/activate        # macOS/Linux
venv\Scripts\activate           # Windows

3. Install dependencies

pip install -r requirements.txt

4. Run the app

python app.py

Then open http://localhost:5000 in your browser.

The model is hosted on 🤗 HuggingFace Hub and is downloaded automatically on first run. No manual setup needed.

🖥️ How It Works

User uploads PDFs
      │
      ▼
Flask receives files → adds to thread-safe queue
      │
      ▼
Background worker picks up each file
      │
      ├─ Extracts text from page 1 via PyMuPDF (fitz)
      │
      ├─ Tokenizes with BERT tokenizer (max 512 tokens)
      │
      ├─ Runs inference → softmax → predicted class + confidence %
      │
      └─ Saves PDF into uploads/<category>/ folder
      │
      ▼
Frontend polls /file-status every 900ms → updates UI live

🚀 Deployment Notes

For production, replace app.run(debug=True) with a WSGI server like Gunicorn:
```
gunicorn -w 1 -b 0.0.0.0:8000 app:app
```
Use -w 1 (single worker) to avoid multiple instances of the background thread.
The file_status dict is in-memory only — it resets on restart. For persistence, swap it with a SQLite DB or Redis.

📓 Training

The model was fine-tuned on a custom metro railway document dataset using bert-base-uncased via HuggingFace Transformers. See notebooks/training.ipynb for the full training pipeline including:

Dataset preparation & label encoding
Tokenization & DataLoader setup
Fine-tuning loop with evaluation
Saving the model in HuggingFace format (save_pretrained)

The trained model is hosted on HuggingFace Hub at Kr1491/metro-bert-classifier.

📄 License

MIT License — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
notebooks		notebooks
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚇 Metro Document Classifier

🏷️ Categories

⚙️ Setup & Installation

1. Clone the repository

2. Create a virtual environment

3. Install dependencies

4. Run the app

🖥️ How It Works

🚀 Deployment Notes

📓 Training

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚇 Metro Document Classifier

🏷️ Categories

⚙️ Setup & Installation

1. Clone the repository

2. Create a virtual environment

3. Install dependencies

4. Run the app

🖥️ How It Works

🚀 Deployment Notes

📓 Training

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages