Skip to content

mahir-41/multilang_nlp_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Human-in-the-Loop Framework for Bengali Abstractive Summarization

MVP / Proof-of-Concept System


1. Abstract

Problem: High-quality Bengali abstractive summarization data is scarce.

Approach: We present a human-in-the-loop (HITL) framework that collects multiple forms of human feedback on model-generated summaries.

Contribution: An extensible MVP system that supports preference learning, post-editing, and gold-summary creation.

Scope: This work focuses on system design and data generation, not model training.

Outcome: Demonstrates feasibility of scalable human-guided supervision for low-resource summarization.

This project establishes a proof-of-concept research infrastructure for systematically collecting structured human feedback to address the data bottleneck in Bengali summarization research.


2. Motivation

Bengali is a low-resource language in NLP research. While multilingual models like MT5 can generate summaries in Bengali, the quality is often compromised due to limited training data and evaluation benchmarks.

Existing abstractive models (e.g., MT5) often produce:

  • Factual drift: Summaries that introduce information not present in the source
  • Verbosity: Overly long summaries that fail to compress key information
  • Loss of key details: Important information from the source text is omitted

Improving model quality requires structured human feedback, not just more raw data. Traditional approaches rely on large-scale parallel corpora that are expensive and time-consuming to create. A more scalable approach is to leverage human feedback in multiple forms to guide model improvement.

Important framing: This project targets the data bottleneck, not model architecture. We focus on creating a systematic way to collect and structure human feedback that can enable future model training and evaluation.


3. Research Goal & Scope

3.1 Research Goal

To explore how different forms of human feedback can be systematically collected to support Bengali summarization improvement.

3.2 Scope (MVP Framing)

This project is intentionally scoped as:

  • a minimal viable system
  • a design and feasibility study
  • a proof-of-concept research infrastructure

The system is designed to be:

  • Extensible: Can accommodate new models, languages, and feedback types
  • Modular: Each annotation mode operates independently
  • Data-focused: Emphasizes structured data collection over immediate model improvement

4. System Overview

Our system is a web-based HITL annotation interface with a modular design that supports independent annotation modes and centralized storage of structured feedback signals.

4.1 Architecture

The system consists of three main components:

  1. Frontend Interface (React/TypeScript): Web-based UI for human annotators
  2. Backend API (Python/Bottle): RESTful API for data management and model integration
  3. Database (SQLite): Centralized storage for documents, summaries, and feedback

Input: Bengali news articles from a curated dataset

Models:

  • MT5 (baseline abstractive summarizer)
  • KMeans (extractive summarizer for comparison)

Human: Feedback via web UI in three distinct modes

Output: Structured dataset for downstream use (fine-tuning, evaluation, analysis)

System Flow

flowchart TD
    A[Bengali News Articles] --> B[Document Database]
    B --> C[Model Generation]
    C --> D[MT5 Abstractive]
    C --> E[KMeans Extractive]
    D --> F[Human Annotation Interface]
    E --> F
    F --> G[Summary Comparison]
    F --> H[Summary Modification]
    F --> I[Summary From Scratch]
    G --> J[Structured Feedback Database]
    H --> J
    I --> J
    J --> K[Export for Training/Evaluation]
Loading

The architecture is designed to be model-agnostic and language-agnostic by design. While the current implementation focuses on Bengali (language code 'bn'), the system can be extended to other languages by modifying the language filter in the document retrieval logic.


5. Human-in-the-Loop Annotation Modes

This system supports three complementary feedback signals, each capturing different aspects of human judgment about summary quality.

5.1 Summary Comparison (Preference Signal)

Purpose: Collect pairwise preference judgments between two model-generated summaries.

Interface: Side-by-side comparison of two summaries with the original article visible.

Annotator selects:

  • Better summary: Left or right panel (Model A or Model B)
  • Confidence score: 1-5 scale indicating certainty of preference
  • Reason: One or more of:
    • Concise: More compact and efficient
    • Faithful: More accurate to source content
    • Readable: Better language quality and flow

Data captured:

  • Summary IDs being compared
  • Preference (1 = left preferred, 2 = right preferred)
  • Confidence level (1-5)
  • Reason for preference

Enables future:

  • Preference learning algorithms
  • Ranking-based fine-tuning (e.g., RLHF)
  • Model comparison studies

Implementation: The comparison interface (SummarizationComparison.tsx) allows annotators to select any combination of MT5 and KMeans models for comparison, providing flexibility in evaluation scenarios.

5.2 Summary Modification (Post-Edit Signal)

Purpose: Capture human corrections and improvements to model-generated summaries.

Interface: Annotator views the original article and a model-generated summary, then edits the summary directly.

Captures:

  • Correction patterns: What errors do models make?
  • Compression preferences: How do humans compress information differently?
  • Factual adjustments: What factual errors need correction?

Data captured:

  • Original document ID
  • Parent summary ID (the model-generated summary being edited)
  • Modified summary text
  • Origin type: human_modified
  • Edit flag: edit_based = 1

Research value: This mirrors post-editing workflows used in machine translation research, where human edits reveal systematic model weaknesses.

Implementation: The modification interface (SummaryModification.tsx) pre-populates the editable text area with the model output, allowing annotators to make targeted improvements while preserving the original for comparison.

5.3 Summary From Scratch (Gold Signal)

Purpose: Collect high-quality human-written summaries as reference standards.

Interface: Annotator views only the original article and writes a summary from scratch.

Represents:

  • Near-gold supervision: High-quality training examples
  • Upper-bound reference quality: Best-case summary quality
  • Human compression patterns: How humans naturally summarize

Data captured:

  • Original document ID
  • Human-written summary text
  • Origin type: human_written
  • Gold flag: scratch_gold = 1

Research value: These summaries serve as:

  • Training targets for supervised learning
  • Evaluation references (e.g., ROUGE scores)
  • Quality benchmarks for model comparison

Implementation: The from-scratch interface (SummaryFromScratch.tsx) provides a clean writing environment with only the source text visible, minimizing bias from model outputs.


6. Data Logging & Extensibility

Each interaction is logged with comprehensive metadata to support multiple downstream use cases.

6.1 Data Schema

The database schema (schema.sql) is designed to support fine-tuning, evaluation, and multi-annotator extension:

Documents Table:

  • doc_id: Unique document identifier
  • title, description: Source article content
  • language: Language code (currently 'bn' for Bengali)

Summaries Table:

  • summary_id: Unique summary identifier
  • doc_id: Links to source document
  • summary_text: The summary content
  • origin_type: model_generated, human_written, or human_modified
  • model_name, model_version: For model-generated summaries
  • parent_summary_id: Links modified summaries to their source
  • edit_based: Flag for post-edited summaries
  • scratch_gold: Flag for human-written gold summaries

Summary Comparisons Table:

  • comparison_id: Unique comparison identifier
  • summary_id_1, summary_id_2: Summaries being compared
  • preference: 1 or 2 (which summary is preferred)
  • confidence: 1-5 confidence level
  • reason: concise, faithful, or readable

Annotation Sessions Table:

  • session_id: Unique session identifier
  • doc_id: Document being annotated
  • user_id: Annotator identifier (reserved for multi-annotator extension)
  • started_at, ended_at: Session timing

6.2 Extensibility Features

Model-agnostic design:

  • The system accepts any model name and version string
  • Model integration requires only implementing a summarization function
  • Current models: MT5 (abstractive), KMeans (extractive)
  • Future models can be added without schema changes

Language-agnostic design:

  • Language filtering is parameterized (currently 'bn')
  • Database schema supports any language code
  • UI components are language-agnostic (display text as-is)

Multi-annotator support:

  • Schema includes user_id fields in annotation sessions
  • Can track inter-annotator agreement (future work)
  • Supports distributed annotation workflows

Export capabilities:

  • All tables can be exported as CSV files
  • ZIP archive generation for complete dataset export
  • Structured format ready for training pipelines

Important emphasis: "The system is model-agnostic and language-agnostic by design."


7. Preliminary Analysis (Pilot Data)

Aggregated statistics from early interactions demonstrate the end-to-end HITL pipeline functionality.

7.1 Statistics Dashboard

The Statistics component (Statistics.tsx) provides real-time visualization of collected data:

Tallies:

  • Total comparisons: Number of pairwise preference judgments
  • Total modifications: Number of post-edited summaries
  • From-scratch summaries: Number of human-written gold summaries

Model Comparison Analysis:

  • Model wins: Head-to-head comparison results between model pairs
  • Win distribution: Which models are preferred and in what contexts

Confidence Metrics:

  • Average confidence: Mean confidence level across all comparisons
  • Indicates annotator certainty in their judgments

Modification Patterns:

  • Modifications by parent origin: Which model outputs are most frequently edited
  • Reveals which models produce summaries requiring more correction

7.2 Pipeline Validation

These statistics are illustrative, not conclusive. They demonstrate:

  • Data collection is functioning correctly
  • All three annotation modes are operational
  • Feedback signals are being stored and retrievable
  • The system can aggregate and visualize collected data

The statistics serve as a proof-of-concept that the infrastructure can support larger-scale data collection and future empirical studies.


8. Limitations

The following limitations are explicitly acknowledged:

8.1 Annotation Scale

  • Single annotator: All feedback comes from one human annotator
  • Small pilot dataset: Initial data collection is limited in scope
  • No inter-annotator agreement: Cannot measure annotation consistency or reliability

8.2 Model Scope

  • Limited model comparison: Currently only MT5 and KMeans are integrated
  • No model training: Collected data has not been used for fine-tuning yet
  • Baseline models only: No state-of-the-art or custom models evaluated

8.3 Evaluation

  • No downstream training: Data has not been used to improve models
  • No evaluation metrics: ROUGE, BLEU, or other automatic metrics not computed
  • No human evaluation: No systematic human evaluation of model improvements

8.4 Generalizability

  • Single language focus: Currently limited to Bengali
  • Single domain: News articles only
  • Limited document types: No variation in article length or complexity

These limitations are expected for an MVP/proof-of-concept system and are explicitly acknowledged to set appropriate expectations for the scope of this work.


9. Future Work

9.1 Model Training & Evaluation

Fine-tuning experiments:

  • Fine-tune MT5 using collected signals:
    • Preference-only dataset (ranking loss)
    • Post-edit-only dataset (supervised learning from edits)
    • Gold-only dataset (standard supervised learning)
    • Combined signals (multi-task learning)

Evaluation framework:

  • Measure ROUGE scores against gold summaries
  • Compare human preference alignment with automatic metrics
  • Evaluate model improvements on held-out test set

9.2 Annotation Scale-Up

Multi-annotator support:

  • Deploy system to multiple annotators
  • Measure inter-annotator agreement
  • Aggregate preferences with confidence weighting
  • Identify systematic annotation patterns

Larger dataset:

  • Scale to hundreds or thousands of documents
  • Diversify document types and domains
  • Collect longitudinal data over time

9.3 System Extensions

Additional models:

  • Integrate GPT-based summarizers
  • Add more extractive models for comparison
  • Support custom fine-tuned models

Additional languages:

  • Extend to other low-resource languages
  • Compare annotation patterns across languages
  • Multi-lingual summarization evaluation

Advanced feedback types:

  • Sentence-level preference signals
  • Error type annotations (factual, coherence, etc.)
  • Quality score annotations (beyond binary preference)

9.4 Research Directions

Preference learning:

  • Implement RLHF using preference data
  • Compare different preference aggregation methods
  • Study confidence-weighted learning

Post-editing analysis:

  • Analyze edit patterns to identify model weaknesses
  • Use edits for targeted model improvement
  • Compare edit-based learning vs. gold-standard learning

Evaluation methodology:

  • Develop Bengali-specific evaluation metrics
  • Study correlation between human preference and automatic metrics
  • Design evaluation protocols for low-resource settings

10. Conclusion

This project demonstrates the feasibility of structured HITL supervision for Bengali summarization. The contribution is an extensible research platform, not a final model.

10.1 Key Contributions

  1. System Design: A modular, extensible architecture for collecting multiple forms of human feedback
  2. Data Infrastructure: Structured schema supporting preference, post-edit, and gold signals
  3. Proof-of-Concept: End-to-end pipeline validation demonstrating system functionality
  4. Research Foundation: Platform enabling future empirical studies

10.2 Research Impact

The system serves as a foundation for future empirical studies in:

  • Human-guided summarization for low-resource languages
  • Preference learning and RLHF for Bengali NLP
  • Post-editing workflows for abstractive summarization
  • Evaluation methodology for low-resource settings

10.3 Final Statement

The MVP demonstrates that systematic human feedback collection is feasible and can be scaled. The extensible design ensures the system can grow with research needs, supporting everything from small pilot studies to large-scale annotation campaigns. While this work does not claim performance improvements, it provides the essential infrastructure for future research that will.


References

dataset - https://www.kaggle.com/datasets/towhidahmedfoysal/bangla-summarization-datasetprothom-alo?resource=download model - https://huggingface.co/tashfiq61/bengali-summarizer-mt5


Appendix: Technical Implementation

A.1 Frontend Components

  • SummarizationComparison.tsx: Side-by-side comparison interface
  • SummaryModification.tsx: Post-editing interface
  • SummaryFromScratch.tsx: Gold summary writing interface
  • Statistics.tsx: Data visualization dashboard

A.2 Backend API

  • controller.py: RESTful API endpoints
    • /next-article: Document retrieval with model summary generation
    • /submit-comparison: Preference signal storage
    • /submit-modification: Post-edit signal storage
    • /submit-from-scratch: Gold summary storage
    • /statistics: Aggregated data retrieval

A.3 Database Schema

  • database/schema.sql: Complete SQLite schema definition
  • database/database_service.py: Data access layer with business logic

A.4 Model Integration

  • services/mt5_summarization/mt5_service.py: MT5 abstractive summarization
  • services/kmeans_summarization/kmeans_service.py: KMeans extractive summarization

End of Report

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors