This repository contains the Natural Language Processing (NLP) subsystem for the larger Sutrava project. It automatically extracts, classifies, and structures software requirements from unstructured stakeholder communications (emails, Slack messages, Jira comments, meeting transcripts) using advanced Text Mining, NLP, and LLM techniques.
Our end-to-end pipeline processes raw communication and transforms it into actionable, prioritized, and grouped software requirements.
Raw Text (email / Slack / transcript)
│
▼
[1] Preprocessing & Segmentation
│
▼
[2] Requirement Detection ← DistilBERT binary classifier (Req / Not Req)
│ keep requirements only
▼
[3] Named Entity Recognition ← spaCy 3 + BERT transformer NER
(ACTOR, FEATURE, QUALITY_ATTRIBUTE, CONDITION, PRIORITY)
│
▼
[4] Clustering & Structuring ← Semantic Embeddings
(Groups similar requirements together)
│
▼
[5] Summarization & Prioritization ← LLM Auditor & Semantic Corrector
(Generates concise summaries & ranks by importance)
│
▼
[6] Explainability & Output Generation
(Provides transparency into model decisions & structured JSON/CSV)
project/
├── preprocessing/ # Text cleaning and sentence segmentation
├── requirement_classifier/ # Part 1: DistilBERT Requirement Detection
├── ner_model/ # Part 2: Named Entity Recognition (spaCy + BERT)
├── clustering/ # Part 3: Requirement grouping & semantic embeddings
├── structuring/ # Part 4: Requirement structuring & formatting
├── summarization/ # Part 5: Cluster summarization
├── prioritization/ # Part 6: LLM-based prioritization & semantic correction
├── explainability/ # Part 7: Model decision transparency & insights
├── output_generator/ # Final report generation (JSON/CSV)
├── inference_pipeline/ # Full end-to-end pipeline execution
├── evaluation/ # Combined evaluation metrics & reports
├── data/ # Datasets for classification and NER
├── requirements.txt # Project dependencies
└── README.md # ← You are here
pip install -r requirements.txt
# Download spaCy English tokenizer (used for data conversion)
python -m spacy download en_core_web_smpython -m requirement_classifier.train \
train_csv data/requirement_classification/train.csv \
test_csv data/requirement_classification/test.csv \
output_dir requirement_classifier/saved_model \
epochs 5 --batch_size 16 --lr 2e-5# Convert JSON annotations to spaCy binary format
python data/ner/convert_to_spacy.py
# Train NER model (CPU)
python -m spacy train ner_model/config.cfg \
--output ner_model/output \
--paths.train data/ner/train.spacy \
--paths.dev data/ner/dev.spacy(Add --gpu-id 0 to train on GPU)
Execute the full pipeline across all modules (from raw text to prioritized, clustered, and structured outputs):
# Validate End-to-End functionality
python run_e2e_validation.py
# Or run inference on sample text / mock Jira JSON
python -m inference_pipeline.example_run
python -m inference_pipeline.example_run_jsonpython -m evaluation.run_all_evaluations- Base model:
distilbert-base-uncased - Task: Binary classification (Requirement / Not Requirement)
- How it works: Uses transformer encoder layers to produce contextualised embeddings. A linear classification head maps the
[CLS]token to predicted classes.
- Base model:
bert-base-uncasedviaspacy-transformers - Entities:
ACTOR,FEATURE,QUALITY_ATTRIBUTE,CONDITION,PRIORITY_INDICATOR - How it works: Uses a TransformerListener to pool sub-word representations back to spaCy token level, passing them to a transition-based NER parser.
- Clustering: Generates dense vector embeddings for each extracted requirement and groups them using clustering algorithms to identify duplicated or related features.
- Prioritization: Utilizes LLM-based agents (
llm_auditor.py,semantic_corrector.py,final_arbiter.py) to resolve conflicting requirements, assess business value, and rank them automatically. - Explainability: Generates traceable rationale behind the classification and prioritization decisions to build trust with human engineers.
Input (Raw Text):
"Users are complaining that login is too slow during peak hours. We should improve login speed."
Structured JSON Output:
{
"cluster_id": 1,
"summary": "Improve system login performance",
"priority": "HIGH",
"requirements": [
{
"sentence": "Users are complaining that login is too slow during peak hours.",
"entities": {
"ACTOR": "Users",
"FEATURE": "login",
"QUALITY_ATTRIBUTE": "too slow",
"CONDITION": "during peak hours"
},
"confidence": 0.9712
}
]
}