Human-in-the-Loop Framework for Bengali Abstractive Summarization

MVP / Proof-of-Concept System

1. Abstract

Problem: High-quality Bengali abstractive summarization data is scarce.

Approach: We present a human-in-the-loop (HITL) framework that collects multiple forms of human feedback on model-generated summaries.

Contribution: An extensible MVP system that supports preference learning, post-editing, and gold-summary creation.

Scope: This work focuses on system design and data generation, not model training.

Outcome: Demonstrates feasibility of scalable human-guided supervision for low-resource summarization.

This project establishes a proof-of-concept research infrastructure for systematically collecting structured human feedback to address the data bottleneck in Bengali summarization research.

2. Motivation

Bengali is a low-resource language in NLP research. While multilingual models like MT5 can generate summaries in Bengali, the quality is often compromised due to limited training data and evaluation benchmarks.

Existing abstractive models (e.g., MT5) often produce:

Factual drift: Summaries that introduce information not present in the source
Verbosity: Overly long summaries that fail to compress key information
Loss of key details: Important information from the source text is omitted

Improving model quality requires structured human feedback, not just more raw data. Traditional approaches rely on large-scale parallel corpora that are expensive and time-consuming to create. A more scalable approach is to leverage human feedback in multiple forms to guide model improvement.

Important framing: This project targets the data bottleneck, not model architecture. We focus on creating a systematic way to collect and structure human feedback that can enable future model training and evaluation.

3. Research Goal & Scope

3.1 Research Goal

To explore how different forms of human feedback can be systematically collected to support Bengali summarization improvement.

3.2 Scope (MVP Framing)

This project is intentionally scoped as:

a minimal viable system
a design and feasibility study
a proof-of-concept research infrastructure

The system is designed to be:

Extensible: Can accommodate new models, languages, and feedback types
Modular: Each annotation mode operates independently
Data-focused: Emphasizes structured data collection over immediate model improvement

4. System Overview

Our system is a web-based HITL annotation interface with a modular design that supports independent annotation modes and centralized storage of structured feedback signals.

4.1 Architecture

The system consists of three main components:

Frontend Interface (React/TypeScript): Web-based UI for human annotators
Backend API (Python/Bottle): RESTful API for data management and model integration
Database (SQLite): Centralized storage for documents, summaries, and feedback

Input: Bengali news articles from a curated dataset

Models:

MT5 (baseline abstractive summarizer)
KMeans (extractive summarizer for comparison)

Human: Feedback via web UI in three distinct modes

Output: Structured dataset for downstream use (fine-tuning, evaluation, analysis)

System Flow

flowchart TD
    A[Bengali News Articles] --> B[Document Database]
    B --> C[Model Generation]
    C --> D[MT5 Abstractive]
    C --> E[KMeans Extractive]
    D --> F[Human Annotation Interface]
    E --> F
    F --> G[Summary Comparison]
    F --> H[Summary Modification]
    F --> I[Summary From Scratch]
    G --> J[Structured Feedback Database]
    H --> J
    I --> J
    J --> K[Export for Training/Evaluation]

The architecture is designed to be model-agnostic and language-agnostic by design. While the current implementation focuses on Bengali (language code 'bn'), the system can be extended to other languages by modifying the language filter in the document retrieval logic.

5. Human-in-the-Loop Annotation Modes

This system supports three complementary feedback signals, each capturing different aspects of human judgment about summary quality.

5.1 Summary Comparison (Preference Signal)

Purpose: Collect pairwise preference judgments between two model-generated summaries.

Interface: Side-by-side comparison of two summaries with the original article visible.

Annotator selects:

Better summary: Left or right panel (Model A or Model B)
Confidence score: 1-5 scale indicating certainty of preference
Reason: One or more of:
- Concise: More compact and efficient
- Faithful: More accurate to source content
- Readable: Better language quality and flow

Data captured:

Summary IDs being compared
Preference (1 = left preferred, 2 = right preferred)
Confidence level (1-5)
Reason for preference

Enables future:

Preference learning algorithms
Ranking-based fine-tuning (e.g., RLHF)
Model comparison studies

Implementation: The comparison interface (SummarizationComparison.tsx) allows annotators to select any combination of MT5 and KMeans models for comparison, providing flexibility in evaluation scenarios.

5.2 Summary Modification (Post-Edit Signal)

Purpose: Capture human corrections and improvements to model-generated summaries.

Interface: Annotator views the original article and a model-generated summary, then edits the summary directly.

Captures:

Correction patterns: What errors do models make?
Compression preferences: How do humans compress information differently?
Factual adjustments: What factual errors need correction?

Data captured:

Original document ID
Parent summary ID (the model-generated summary being edited)
Modified summary text
Origin type: human_modified
Edit flag: edit_based = 1

Research value: This mirrors post-editing workflows used in machine translation research, where human edits reveal systematic model weaknesses.

Implementation: The modification interface (SummaryModification.tsx) pre-populates the editable text area with the model output, allowing annotators to make targeted improvements while preserving the original for comparison.

5.3 Summary From Scratch (Gold Signal)

Purpose: Collect high-quality human-written summaries as reference standards.

Interface: Annotator views only the original article and writes a summary from scratch.

Represents:

Near-gold supervision: High-quality training examples
Upper-bound reference quality: Best-case summary quality
Human compression patterns: How humans naturally summarize

Data captured:

Original document ID
Human-written summary text
Origin type: human_written
Gold flag: scratch_gold = 1

Research value: These summaries serve as:

Training targets for supervised learning
Evaluation references (e.g., ROUGE scores)
Quality benchmarks for model comparison

Implementation: The from-scratch interface (SummaryFromScratch.tsx) provides a clean writing environment with only the source text visible, minimizing bias from model outputs.

6. Data Logging & Extensibility

Each interaction is logged with comprehensive metadata to support multiple downstream use cases.

6.1 Data Schema

The database schema (schema.sql) is designed to support fine-tuning, evaluation, and multi-annotator extension:

Documents Table:

doc_id: Unique document identifier
title, description: Source article content
language: Language code (currently 'bn' for Bengali)

Summaries Table:

summary_id: Unique summary identifier
doc_id: Links to source document
summary_text: The summary content
origin_type: model_generated, human_written, or human_modified
model_name, model_version: For model-generated summaries
parent_summary_id: Links modified summaries to their source
edit_based: Flag for post-edited summaries
scratch_gold: Flag for human-written gold summaries

Summary Comparisons Table:

comparison_id: Unique comparison identifier
summary_id_1, summary_id_2: Summaries being compared
preference: 1 or 2 (which summary is preferred)
confidence: 1-5 confidence level
reason: concise, faithful, or readable

Annotation Sessions Table:

session_id: Unique session identifier
doc_id: Document being annotated
user_id: Annotator identifier (reserved for multi-annotator extension)
started_at, ended_at: Session timing

6.2 Extensibility Features

Model-agnostic design:

The system accepts any model name and version string
Model integration requires only implementing a summarization function
Current models: MT5 (abstractive), KMeans (extractive)
Future models can be added without schema changes

Language-agnostic design:

Language filtering is parameterized (currently 'bn')
Database schema supports any language code
UI components are language-agnostic (display text as-is)

Multi-annotator support:

Schema includes user_id fields in annotation sessions
Can track inter-annotator agreement (future work)
Supports distributed annotation workflows

Export capabilities:

All tables can be exported as CSV files
ZIP archive generation for complete dataset export
Structured format ready for training pipelines

Important emphasis: "The system is model-agnostic and language-agnostic by design."

7. Preliminary Analysis (Pilot Data)

Aggregated statistics from early interactions demonstrate the end-to-end HITL pipeline functionality.

7.1 Statistics Dashboard

The Statistics component (Statistics.tsx) provides real-time visualization of collected data:

Tallies:

Total comparisons: Number of pairwise preference judgments
Total modifications: Number of post-edited summaries
From-scratch summaries: Number of human-written gold summaries

Model Comparison Analysis:

Model wins: Head-to-head comparison results between model pairs
Win distribution: Which models are preferred and in what contexts

Confidence Metrics:

Average confidence: Mean confidence level across all comparisons
Indicates annotator certainty in their judgments

Modification Patterns:

Modifications by parent origin: Which model outputs are most frequently edited
Reveals which models produce summaries requiring more correction

7.2 Pipeline Validation

These statistics are illustrative, not conclusive. They demonstrate:

Data collection is functioning correctly
All three annotation modes are operational
Feedback signals are being stored and retrievable
The system can aggregate and visualize collected data

The statistics serve as a proof-of-concept that the infrastructure can support larger-scale data collection and future empirical studies.

8. Limitations

The following limitations are explicitly acknowledged:

8.1 Annotation Scale

Single annotator: All feedback comes from one human annotator
Small pilot dataset: Initial data collection is limited in scope
No inter-annotator agreement: Cannot measure annotation consistency or reliability

8.2 Model Scope

Limited model comparison: Currently only MT5 and KMeans are integrated
No model training: Collected data has not been used for fine-tuning yet
Baseline models only: No state-of-the-art or custom models evaluated

8.3 Evaluation

No downstream training: Data has not been used to improve models
No evaluation metrics: ROUGE, BLEU, or other automatic metrics not computed
No human evaluation: No systematic human evaluation of model improvements

8.4 Generalizability

Single language focus: Currently limited to Bengali
Single domain: News articles only
Limited document types: No variation in article length or complexity

These limitations are expected for an MVP/proof-of-concept system and are explicitly acknowledged to set appropriate expectations for the scope of this work.

9. Future Work

9.1 Model Training & Evaluation

Fine-tuning experiments:

Fine-tune MT5 using collected signals:
- Preference-only dataset (ranking loss)
- Post-edit-only dataset (supervised learning from edits)
- Gold-only dataset (standard supervised learning)
- Combined signals (multi-task learning)

Evaluation framework:

Measure ROUGE scores against gold summaries
Compare human preference alignment with automatic metrics
Evaluate model improvements on held-out test set

9.2 Annotation Scale-Up

Multi-annotator support:

Deploy system to multiple annotators
Measure inter-annotator agreement
Aggregate preferences with confidence weighting
Identify systematic annotation patterns

Larger dataset:

Scale to hundreds or thousands of documents
Diversify document types and domains
Collect longitudinal data over time

9.3 System Extensions

Additional models:

Integrate GPT-based summarizers
Add more extractive models for comparison
Support custom fine-tuned models

Additional languages:

Extend to other low-resource languages
Compare annotation patterns across languages
Multi-lingual summarization evaluation

Advanced feedback types:

Sentence-level preference signals
Error type annotations (factual, coherence, etc.)
Quality score annotations (beyond binary preference)

9.4 Research Directions

Preference learning:

Implement RLHF using preference data
Compare different preference aggregation methods
Study confidence-weighted learning

Post-editing analysis:

Analyze edit patterns to identify model weaknesses
Use edits for targeted model improvement
Compare edit-based learning vs. gold-standard learning

Evaluation methodology:

Develop Bengali-specific evaluation metrics
Study correlation between human preference and automatic metrics
Design evaluation protocols for low-resource settings

10. Conclusion

This project demonstrates the feasibility of structured HITL supervision for Bengali summarization. The contribution is an extensible research platform, not a final model.

10.1 Key Contributions

System Design: A modular, extensible architecture for collecting multiple forms of human feedback
Data Infrastructure: Structured schema supporting preference, post-edit, and gold signals
Proof-of-Concept: End-to-end pipeline validation demonstrating system functionality
Research Foundation: Platform enabling future empirical studies

10.2 Research Impact

The system serves as a foundation for future empirical studies in:

Human-guided summarization for low-resource languages
Preference learning and RLHF for Bengali NLP
Post-editing workflows for abstractive summarization
Evaluation methodology for low-resource settings

10.3 Final Statement

The MVP demonstrates that systematic human feedback collection is feasible and can be scaled. The extensible design ensures the system can grow with research needs, supporting everything from small pilot studies to large-scale annotation campaigns. While this work does not claim performance improvements, it provides the essential infrastructure for future research that will.

References

dataset - https://www.kaggle.com/datasets/towhidahmedfoysal/bangla-summarization-datasetprothom-alo?resource=download model - https://huggingface.co/tashfiq61/bengali-summarizer-mt5

Appendix: Technical Implementation

A.1 Frontend Components

SummarizationComparison.tsx: Side-by-side comparison interface
SummaryModification.tsx: Post-editing interface
SummaryFromScratch.tsx: Gold summary writing interface
Statistics.tsx: Data visualization dashboard

A.2 Backend API

controller.py: RESTful API endpoints
- /next-article: Document retrieval with model summary generation
- /submit-comparison: Preference signal storage
- /submit-modification: Post-edit signal storage
- /submit-from-scratch: Gold summary storage
- /statistics: Aggregated data retrieval

A.3 Database Schema

database/schema.sql: Complete SQLite schema definition
database/database_service.py: Data access layer with business logic

A.4 Model Integration

services/mt5_summarization/mt5_service.py: MT5 abstractive summarization
services/kmeans_summarization/kmeans_service.py: KMeans extractive summarization

End of Report

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Human-in-the-Loop Framework for Bengali Abstractive Summarization

MVP / Proof-of-Concept System

1. Abstract

2. Motivation

3. Research Goal & Scope

3.1 Research Goal

3.2 Scope (MVP Framing)

4. System Overview

4.1 Architecture

System Flow

5. Human-in-the-Loop Annotation Modes

5.1 Summary Comparison (Preference Signal)

5.2 Summary Modification (Post-Edit Signal)

5.3 Summary From Scratch (Gold Signal)

6. Data Logging & Extensibility

6.1 Data Schema

6.2 Extensibility Features

7. Preliminary Analysis (Pilot Data)

7.1 Statistics Dashboard

7.2 Pipeline Validation

8. Limitations

8.1 Annotation Scale

8.2 Model Scope

8.3 Evaluation

8.4 Generalizability

9. Future Work

9.1 Model Training & Evaluation

9.2 Annotation Scale-Up

9.3 System Extensions

9.4 Research Directions

10. Conclusion

10.1 Key Contributions

10.2 Research Impact

10.3 Final Statement

References

Appendix: Technical Implementation

A.1 Frontend Components

A.2 Backend API

A.3 Database Schema

A.4 Model Integration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages