Skip to content

Devanshii-git/NLP_REQ_ENG

Repository files navigation

NLP Engine for Sutrava: AI-driven Requirements Engineering

This repository contains the Natural Language Processing (NLP) subsystem for the larger Sutrava project. It automatically extracts, classifies, and structures software requirements from unstructured stakeholder communications (emails, Slack messages, Jira comments, meeting transcripts) using advanced Text Mining, NLP, and LLM techniques.


Pipeline Architecture

Our end-to-end pipeline processes raw communication and transforms it into actionable, prioritized, and grouped software requirements.

Raw Text (email / Slack / transcript)
        │
        ▼
[1] Preprocessing & Segmentation
        │
        ▼
[2] Requirement Detection          ← DistilBERT binary classifier (Req / Not Req)
        │  keep requirements only
        ▼
[3] Named Entity Recognition       ← spaCy 3 + BERT transformer NER
  (ACTOR, FEATURE, QUALITY_ATTRIBUTE, CONDITION, PRIORITY)
        │
        ▼
[4] Clustering & Structuring       ← Semantic Embeddings
  (Groups similar requirements together)
        │
        ▼
[5] Summarization & Prioritization ← LLM Auditor & Semantic Corrector
  (Generates concise summaries & ranks by importance)
        │
        ▼
[6] Explainability & Output Generation
  (Provides transparency into model decisions & structured JSON/CSV)

Project Structure

project/
├── preprocessing/                 # Text cleaning and sentence segmentation
├── requirement_classifier/        # Part 1: DistilBERT Requirement Detection
├── ner_model/                     # Part 2: Named Entity Recognition (spaCy + BERT)
├── clustering/                    # Part 3: Requirement grouping & semantic embeddings
├── structuring/                   # Part 4: Requirement structuring & formatting
├── summarization/                 # Part 5: Cluster summarization
├── prioritization/                # Part 6: LLM-based prioritization & semantic correction
├── explainability/                # Part 7: Model decision transparency & insights
├── output_generator/              # Final report generation (JSON/CSV)
├── inference_pipeline/            # Full end-to-end pipeline execution
├── evaluation/                    # Combined evaluation metrics & reports
├── data/                          # Datasets for classification and NER
├── requirements.txt               # Project dependencies
└── README.md                      # ← You are here

Quick Start

1. Install dependencies

pip install -r requirements.txt
# Download spaCy English tokenizer (used for data conversion)
python -m spacy download en_core_web_sm

2. Train the Requirement Classifier

python -m requirement_classifier.train \
    train_csv data/requirement_classification/train.csv \
    test_csv  data/requirement_classification/test.csv \
    output_dir requirement_classifier/saved_model \
    epochs 5 --batch_size 16 --lr 2e-5

3. Prepare NER Data and Train

# Convert JSON annotations to spaCy binary format
python data/ner/convert_to_spacy.py

# Train NER model (CPU)
python -m spacy train ner_model/config.cfg \
    --output ner_model/output \
    --paths.train data/ner/train.spacy \
    --paths.dev   data/ner/dev.spacy

(Add --gpu-id 0 to train on GPU)

4. Run the Full End-to-End Pipeline

Execute the full pipeline across all modules (from raw text to prioritized, clustered, and structured outputs):

# Validate End-to-End functionality
python run_e2e_validation.py

# Or run inference on sample text / mock Jira JSON
python -m inference_pipeline.example_run
python -m inference_pipeline.example_run_json

5. Evaluate Models

python -m evaluation.run_all_evaluations

Model Details

1. Requirement Detection (DistilBERT)

  • Base model: distilbert-base-uncased
  • Task: Binary classification (Requirement / Not Requirement)
  • How it works: Uses transformer encoder layers to produce contextualised embeddings. A linear classification head maps the [CLS] token to predicted classes.

2. Named Entity Recognition (spaCy + BERT)

  • Base model: bert-base-uncased via spacy-transformers
  • Entities: ACTOR, FEATURE, QUALITY_ATTRIBUTE, CONDITION, PRIORITY_INDICATOR
  • How it works: Uses a TransformerListener to pool sub-word representations back to spaCy token level, passing them to a transition-based NER parser.

3. Advanced Processing Modules

  • Clustering: Generates dense vector embeddings for each extracted requirement and groups them using clustering algorithms to identify duplicated or related features.
  • Prioritization: Utilizes LLM-based agents (llm_auditor.py, semantic_corrector.py, final_arbiter.py) to resolve conflicting requirements, assess business value, and rank them automatically.
  • Explainability: Generates traceable rationale behind the classification and prioritization decisions to build trust with human engineers.

Example Output

Input (Raw Text):

"Users are complaining that login is too slow during peak hours. We should improve login speed."

Structured JSON Output:

{
  "cluster_id": 1,
  "summary": "Improve system login performance",
  "priority": "HIGH",
  "requirements": [
    {
      "sentence": "Users are complaining that login is too slow during peak hours.",
      "entities": {
        "ACTOR": "Users",
        "FEATURE": "login",
        "QUALITY_ATTRIBUTE": "too slow",
        "CONDITION": "during peak hours"
      },
      "confidence": 0.9712
    }
  ]
}

About

An end-to-end NLP pipeline for automated Requirements Engineering. It utilizes a DistilBERT classifier to detect requirements from raw text and a custom transformer-based NER model to extract key entities like Actors, Features, and Quality Attributes.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages