IntelliTag - Product Requirements Document (PRD)
Field
Value
Product Name
IntelliTag
Version
2.0 (Portfolio Edition)
Author
Thomas Mebarki
Client
Stack Overflow Inc.
Status
Delivered
Last Updated
October 2023
This PRD defines the functional and non-functional requirements for IntelliTag, an intelligent tag suggestion system developed for Stack Overflow. The system leverages multiple NLP techniques to analyze question content and suggest relevant tags with high accuracy.
This document covers:
Functional requirements for data processing, feature extraction, and prediction
Non-functional requirements (performance, scalability, maintainability)
Technical constraints and dependencies
Acceptance criteria
Stack Overflow's tag system is critical for question discoverability. With 60,000+ tags, users often struggle to select appropriate tags, leading to:
Reduced question visibility
Increased moderation overhead
Poor user experience for new contributors
IntelliTag addresses this by providing intelligent, context-aware tag suggestions.
IntelliTag is a machine learning pipeline that:
Ingests question data (title + body)
Preprocesses text using NLP techniques
Extracts features using multiple approaches (BoW, Word2Vec, BERT, USE)
Predicts relevant tags using multi-label classification
Returns ranked suggestions with confidence scores
User Type
Description
Primary Need
Question Authors
Users posting new questions
Accurate tag suggestions
Moderators
Community moderators
Bulk tagging tools
API Consumers
Third-party integrations
Programmatic access
Data Scientists
Internal ML team
Model interpretability
For Users : Reduced friction, better question visibility
For Moderators : Lower retagging workload
For Platform : Improved content organization and searchability
3. Functional Requirements
3.1 Data Ingestion (FR-100)
ID
Requirement
Priority
FR-101
System SHALL load data from CSV format
P0
FR-102
System SHALL parse Title, Body, Tags, and Id fields
P0
FR-103
System SHALL handle missing values gracefully
P1
FR-104
System SHALL support incremental data loading
P2
FR-105
System SHALL log ingestion statistics
P1
3.2 Text Preprocessing (FR-200)
ID
Requirement
Priority
FR-201
System SHALL remove HTML tags from Body content
P0
FR-202
System SHALL tokenize text into word units
P0
FR-203
System SHALL remove English stop words
P0
FR-204
System SHALL preserve technical terms (code, libraries)
P1
FR-205
System SHALL apply lemmatization
P1
FR-206
System SHALL normalize text to lowercase
P0
FR-207
System SHALL handle special characters (+, #, .)
P1
FR-208
System SHALL support configurable preprocessing pipelines
P2
3.3 Feature Extraction (FR-300)
ID
Requirement
Priority
FR-301
System SHALL generate Bag-of-Words features using TF-IDF
P0
FR-302
System SHALL generate Word2Vec embeddings
P1
FR-303
System SHALL generate BERT embeddings
P1
FR-304
System SHALL generate Universal Sentence Encoder embeddings
P1
FR-305
System SHALL support configurable vocabulary size
P1
FR-306
System SHALL handle out-of-vocabulary words
P1
FR-307
System SHALL cache computed embeddings for efficiency
P2
3.4 Topic Modeling (FR-400)
ID
Requirement
Priority
FR-401
System SHALL train LDA models with configurable topic count
P1
FR-402
System SHALL compute topic coherence scores
P1
FR-403
System SHALL extract topic distributions per document
P1
FR-404
System SHALL visualize topic-word distributions
P2
3.5 Classification (FR-500)
ID
Requirement
Priority
FR-501
System SHALL support multi-label classification
P0
FR-502
System SHALL output probability scores per tag
P0
FR-503
System SHALL support configurable prediction threshold
P1
FR-504
System SHALL handle class imbalance
P1
FR-505
System SHALL support model ensemble strategies
P2
FR-506
System SHALL limit predictions to top-k tags
P0
ID
Requirement
Priority
FR-601
System SHALL compute Precision@k metric
P0
FR-602
System SHALL compute Recall@k metric
P0
FR-603
System SHALL compute F1-score
P0
FR-604
System SHALL support cross-validation
P1
FR-605
System SHALL generate classification reports
P1
FR-606
System SHALL compare multiple model configurations
P1
ID
Requirement
Priority
FR-701
System SHALL expose REST API for predictions
P0
FR-702
API SHALL accept JSON input (title, body)
P0
FR-703
API SHALL return JSON output (tags, scores)
P0
FR-704
API SHALL support health check endpoint
P1
FR-705
API SHALL validate input format
P1
FR-706
API SHALL handle errors gracefully with proper codes
P1
4. Non-Functional Requirements
4.1 Performance (NFR-100)
ID
Requirement
Target
NFR-101
API response time
< 200ms (p95)
NFR-102
Preprocessing throughput
> 1000 docs/sec
NFR-103
Model inference time
< 100ms per prediction
NFR-104
Batch processing capacity
50,000 docs in < 10 min
4.2 Scalability (NFR-200)
ID
Requirement
Target
NFR-201
Concurrent API requests
100 requests/sec
NFR-202
Dataset size support
Up to 10M questions
NFR-203
Model size
< 500MB RAM
4.3 Reliability (NFR-300)
ID
Requirement
Target
NFR-301
API uptime
99.5%
NFR-302
Error rate
< 0.1%
NFR-303
Graceful degradation
Fallback to BoW if DL fails
4.4 Maintainability (NFR-400)
ID
Requirement
Target
NFR-401
Code coverage
> 80%
NFR-402
Documentation coverage
100% public functions
NFR-403
Code style
PEP 8 compliant
NFR-404
Type annotations
All functions typed
ID
Requirement
Target
NFR-501
Input sanitization
All user inputs validated
NFR-502
No data persistence
Questions not stored
NFR-503
Rate limiting
100 requests/min per client
5. Technical Specifications
Component
Technology
Version
Language
Python
3.9+
ML Framework
scikit-learn
1.0+
Deep Learning
TensorFlow
2.x
NLP
NLTK, spaCy
Latest
Embeddings
transformers (BERT)
4.x
Topic Modeling
gensim
4.x
API Framework
Flask/FastAPI
Latest
Deployment
Heroku
-
Attribute
Specification
Input Format
CSV (UTF-8)
Required Fields
Title, Body, Tags, Id
Sample Size
50,000 questions (showcase)
Tag Format
<tag1><tag2><tag3>
Model
Dimensions
Use Case
TF-IDF
10,000 features
Baseline
Word2Vec
300 dimensions
Semantic similarity
BERT
768 dimensions
Deep understanding
USE
512 dimensions
Efficiency
LDA
10-15 topics
Topic discovery
6. Constraints & Assumptions
Data Privacy : Production data cannot be included in portfolio
Model Weights : Proprietary models excluded from showcase
Infrastructure : Demo limited to single-instance deployment
Language : English-only in this version
Questions have at least one valid tag
Tag vocabulary is finite and known
Question body is more informative than title
Technical jargon follows consistent patterns
Dependency
Type
Risk Level
Stack Exchange Data Explorer
Data Source
Low
TensorFlow Hub
Model Hub
Medium
Hugging Face
Model Hub
Medium
Heroku
Deployment
Low
7.1 Functional Acceptance
Criteria
Test
Data loads successfully
50,000 rows parsed without error
Preprocessing works
HTML removed, tokens generated
Features extracted
All 4 methods produce valid vectors
Model predicts
Returns top-5 tags with scores
API responds
Valid JSON response in < 200ms
Metric
Minimum
Target
Achieved
Precision@5
65%
70%
78%
Recall@5
45%
50%
62%
F1-Score
0.55
0.60
0.69
Code Coverage
70%
80%
85%
Term
Definition
BoW
Bag-of-Words: Text representation counting word occurrences
TF-IDF
Term Frequency-Inverse Document Frequency weighting
Word2Vec
Neural network producing word embeddings
BERT
Bidirectional Encoder Representations from Transformers
USE
Universal Sentence Encoder by Google
LDA
Latent Dirichlet Allocation for topic modeling
Multi-label
Classification where items can have multiple labels
Precision@k
Proportion of relevant tags in top-k predictions
Recall@k
Proportion of actual tags found in top-k predictions
Version
Date
Author
Changes
1.0
2023-03
Thomas Mebarki
Initial PRD
1.1
2023-04
Thomas Mebarki
Added NFRs
2.0
2023-10
Thomas Mebarki
Portfolio edition
This PRD documents requirements as delivered to Stack Overflow. Certain specifications have been generalized for portfolio presentation.