Problem: High-quality Bengali abstractive summarization data is scarce.
Approach: We present a human-in-the-loop (HITL) framework that collects multiple forms of human feedback on model-generated summaries.
Contribution: An extensible MVP system that supports preference learning, post-editing, and gold-summary creation.
Scope: This work focuses on system design and data generation, not model training.
Outcome: Demonstrates feasibility of scalable human-guided supervision for low-resource summarization.
This project establishes a proof-of-concept research infrastructure for systematically collecting structured human feedback to address the data bottleneck in Bengali summarization research.
Bengali is a low-resource language in NLP research. While multilingual models like MT5 can generate summaries in Bengali, the quality is often compromised due to limited training data and evaluation benchmarks.
Existing abstractive models (e.g., MT5) often produce:
- Factual drift: Summaries that introduce information not present in the source
- Verbosity: Overly long summaries that fail to compress key information
- Loss of key details: Important information from the source text is omitted
Improving model quality requires structured human feedback, not just more raw data. Traditional approaches rely on large-scale parallel corpora that are expensive and time-consuming to create. A more scalable approach is to leverage human feedback in multiple forms to guide model improvement.
Important framing: This project targets the data bottleneck, not model architecture. We focus on creating a systematic way to collect and structure human feedback that can enable future model training and evaluation.
To explore how different forms of human feedback can be systematically collected to support Bengali summarization improvement.
This project is intentionally scoped as:
- a minimal viable system
- a design and feasibility study
- a proof-of-concept research infrastructure
The system is designed to be:
- Extensible: Can accommodate new models, languages, and feedback types
- Modular: Each annotation mode operates independently
- Data-focused: Emphasizes structured data collection over immediate model improvement
Our system is a web-based HITL annotation interface with a modular design that supports independent annotation modes and centralized storage of structured feedback signals.
The system consists of three main components:
- Frontend Interface (React/TypeScript): Web-based UI for human annotators
- Backend API (Python/Bottle): RESTful API for data management and model integration
- Database (SQLite): Centralized storage for documents, summaries, and feedback
Input: Bengali news articles from a curated dataset
Models:
- MT5 (baseline abstractive summarizer)
- KMeans (extractive summarizer for comparison)
Human: Feedback via web UI in three distinct modes
Output: Structured dataset for downstream use (fine-tuning, evaluation, analysis)
flowchart TD
A[Bengali News Articles] --> B[Document Database]
B --> C[Model Generation]
C --> D[MT5 Abstractive]
C --> E[KMeans Extractive]
D --> F[Human Annotation Interface]
E --> F
F --> G[Summary Comparison]
F --> H[Summary Modification]
F --> I[Summary From Scratch]
G --> J[Structured Feedback Database]
H --> J
I --> J
J --> K[Export for Training/Evaluation]
The architecture is designed to be model-agnostic and language-agnostic by design. While the current implementation focuses on Bengali (language code 'bn'), the system can be extended to other languages by modifying the language filter in the document retrieval logic.
This system supports three complementary feedback signals, each capturing different aspects of human judgment about summary quality.
Purpose: Collect pairwise preference judgments between two model-generated summaries.
Interface: Side-by-side comparison of two summaries with the original article visible.
Annotator selects:
- Better summary: Left or right panel (Model A or Model B)
- Confidence score: 1-5 scale indicating certainty of preference
- Reason: One or more of:
- Concise: More compact and efficient
- Faithful: More accurate to source content
- Readable: Better language quality and flow
Data captured:
- Summary IDs being compared
- Preference (1 = left preferred, 2 = right preferred)
- Confidence level (1-5)
- Reason for preference
Enables future:
- Preference learning algorithms
- Ranking-based fine-tuning (e.g., RLHF)
- Model comparison studies
Implementation: The comparison interface (SummarizationComparison.tsx) allows annotators to select any combination of MT5 and KMeans models for comparison, providing flexibility in evaluation scenarios.
Purpose: Capture human corrections and improvements to model-generated summaries.
Interface: Annotator views the original article and a model-generated summary, then edits the summary directly.
Captures:
- Correction patterns: What errors do models make?
- Compression preferences: How do humans compress information differently?
- Factual adjustments: What factual errors need correction?
Data captured:
- Original document ID
- Parent summary ID (the model-generated summary being edited)
- Modified summary text
- Origin type:
human_modified - Edit flag:
edit_based = 1
Research value: This mirrors post-editing workflows used in machine translation research, where human edits reveal systematic model weaknesses.
Implementation: The modification interface (SummaryModification.tsx) pre-populates the editable text area with the model output, allowing annotators to make targeted improvements while preserving the original for comparison.
Purpose: Collect high-quality human-written summaries as reference standards.
Interface: Annotator views only the original article and writes a summary from scratch.
Represents:
- Near-gold supervision: High-quality training examples
- Upper-bound reference quality: Best-case summary quality
- Human compression patterns: How humans naturally summarize
Data captured:
- Original document ID
- Human-written summary text
- Origin type:
human_written - Gold flag:
scratch_gold = 1
Research value: These summaries serve as:
- Training targets for supervised learning
- Evaluation references (e.g., ROUGE scores)
- Quality benchmarks for model comparison
Implementation: The from-scratch interface (SummaryFromScratch.tsx) provides a clean writing environment with only the source text visible, minimizing bias from model outputs.
Each interaction is logged with comprehensive metadata to support multiple downstream use cases.
The database schema (schema.sql) is designed to support fine-tuning, evaluation, and multi-annotator extension:
Documents Table:
doc_id: Unique document identifiertitle,description: Source article contentlanguage: Language code (currently 'bn' for Bengali)
Summaries Table:
summary_id: Unique summary identifierdoc_id: Links to source documentsummary_text: The summary contentorigin_type:model_generated,human_written, orhuman_modifiedmodel_name,model_version: For model-generated summariesparent_summary_id: Links modified summaries to their sourceedit_based: Flag for post-edited summariesscratch_gold: Flag for human-written gold summaries
Summary Comparisons Table:
comparison_id: Unique comparison identifiersummary_id_1,summary_id_2: Summaries being comparedpreference: 1 or 2 (which summary is preferred)confidence: 1-5 confidence levelreason:concise,faithful, orreadable
Annotation Sessions Table:
session_id: Unique session identifierdoc_id: Document being annotateduser_id: Annotator identifier (reserved for multi-annotator extension)started_at,ended_at: Session timing
Model-agnostic design:
- The system accepts any model name and version string
- Model integration requires only implementing a summarization function
- Current models: MT5 (abstractive), KMeans (extractive)
- Future models can be added without schema changes
Language-agnostic design:
- Language filtering is parameterized (currently 'bn')
- Database schema supports any language code
- UI components are language-agnostic (display text as-is)
Multi-annotator support:
- Schema includes
user_idfields in annotation sessions - Can track inter-annotator agreement (future work)
- Supports distributed annotation workflows
Export capabilities:
- All tables can be exported as CSV files
- ZIP archive generation for complete dataset export
- Structured format ready for training pipelines
Important emphasis: "The system is model-agnostic and language-agnostic by design."
Aggregated statistics from early interactions demonstrate the end-to-end HITL pipeline functionality.
The Statistics component (Statistics.tsx) provides real-time visualization of collected data:
Tallies:
- Total comparisons: Number of pairwise preference judgments
- Total modifications: Number of post-edited summaries
- From-scratch summaries: Number of human-written gold summaries
Model Comparison Analysis:
- Model wins: Head-to-head comparison results between model pairs
- Win distribution: Which models are preferred and in what contexts
Confidence Metrics:
- Average confidence: Mean confidence level across all comparisons
- Indicates annotator certainty in their judgments
Modification Patterns:
- Modifications by parent origin: Which model outputs are most frequently edited
- Reveals which models produce summaries requiring more correction
These statistics are illustrative, not conclusive. They demonstrate:
- Data collection is functioning correctly
- All three annotation modes are operational
- Feedback signals are being stored and retrievable
- The system can aggregate and visualize collected data
The statistics serve as a proof-of-concept that the infrastructure can support larger-scale data collection and future empirical studies.
The following limitations are explicitly acknowledged:
- Single annotator: All feedback comes from one human annotator
- Small pilot dataset: Initial data collection is limited in scope
- No inter-annotator agreement: Cannot measure annotation consistency or reliability
- Limited model comparison: Currently only MT5 and KMeans are integrated
- No model training: Collected data has not been used for fine-tuning yet
- Baseline models only: No state-of-the-art or custom models evaluated
- No downstream training: Data has not been used to improve models
- No evaluation metrics: ROUGE, BLEU, or other automatic metrics not computed
- No human evaluation: No systematic human evaluation of model improvements
- Single language focus: Currently limited to Bengali
- Single domain: News articles only
- Limited document types: No variation in article length or complexity
These limitations are expected for an MVP/proof-of-concept system and are explicitly acknowledged to set appropriate expectations for the scope of this work.
Fine-tuning experiments:
- Fine-tune MT5 using collected signals:
- Preference-only dataset (ranking loss)
- Post-edit-only dataset (supervised learning from edits)
- Gold-only dataset (standard supervised learning)
- Combined signals (multi-task learning)
Evaluation framework:
- Measure ROUGE scores against gold summaries
- Compare human preference alignment with automatic metrics
- Evaluate model improvements on held-out test set
Multi-annotator support:
- Deploy system to multiple annotators
- Measure inter-annotator agreement
- Aggregate preferences with confidence weighting
- Identify systematic annotation patterns
Larger dataset:
- Scale to hundreds or thousands of documents
- Diversify document types and domains
- Collect longitudinal data over time
Additional models:
- Integrate GPT-based summarizers
- Add more extractive models for comparison
- Support custom fine-tuned models
Additional languages:
- Extend to other low-resource languages
- Compare annotation patterns across languages
- Multi-lingual summarization evaluation
Advanced feedback types:
- Sentence-level preference signals
- Error type annotations (factual, coherence, etc.)
- Quality score annotations (beyond binary preference)
Preference learning:
- Implement RLHF using preference data
- Compare different preference aggregation methods
- Study confidence-weighted learning
Post-editing analysis:
- Analyze edit patterns to identify model weaknesses
- Use edits for targeted model improvement
- Compare edit-based learning vs. gold-standard learning
Evaluation methodology:
- Develop Bengali-specific evaluation metrics
- Study correlation between human preference and automatic metrics
- Design evaluation protocols for low-resource settings
This project demonstrates the feasibility of structured HITL supervision for Bengali summarization. The contribution is an extensible research platform, not a final model.
- System Design: A modular, extensible architecture for collecting multiple forms of human feedback
- Data Infrastructure: Structured schema supporting preference, post-edit, and gold signals
- Proof-of-Concept: End-to-end pipeline validation demonstrating system functionality
- Research Foundation: Platform enabling future empirical studies
The system serves as a foundation for future empirical studies in:
- Human-guided summarization for low-resource languages
- Preference learning and RLHF for Bengali NLP
- Post-editing workflows for abstractive summarization
- Evaluation methodology for low-resource settings
The MVP demonstrates that systematic human feedback collection is feasible and can be scaled. The extensible design ensures the system can grow with research needs, supporting everything from small pilot studies to large-scale annotation campaigns. While this work does not claim performance improvements, it provides the essential infrastructure for future research that will.
dataset - https://www.kaggle.com/datasets/towhidahmedfoysal/bangla-summarization-datasetprothom-alo?resource=download model - https://huggingface.co/tashfiq61/bengali-summarizer-mt5
SummarizationComparison.tsx: Side-by-side comparison interfaceSummaryModification.tsx: Post-editing interfaceSummaryFromScratch.tsx: Gold summary writing interfaceStatistics.tsx: Data visualization dashboard
controller.py: RESTful API endpoints/next-article: Document retrieval with model summary generation/submit-comparison: Preference signal storage/submit-modification: Post-edit signal storage/submit-from-scratch: Gold summary storage/statistics: Aggregated data retrieval
database/schema.sql: Complete SQLite schema definitiondatabase/database_service.py: Data access layer with business logic
services/mt5_summarization/mt5_service.py: MT5 abstractive summarizationservices/kmeans_summarization/kmeans_service.py: KMeans extractive summarization
End of Report