| Field | Value |
|---|---|
| System | IntelliTag |
| Version | 2.0 |
| Author | Thomas Mebarki |
| Role | Solution Architect |
| Status | Delivered |
┌─────────────────────────────────────────────────────────────────────┐
│ EXTERNAL SYSTEMS │
├─────────────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Stack │ │ Data │ │ Model Hubs │ │
│ │ Overflow │ │ Explorer │ │ (HuggingFace, TF Hub) │ │
│ │ Platform │ │ (SQL) │ │ │ │
│ └──────┬──────┘ └──────┬──────┘ └───────────┬─────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ INTELLITAG SYSTEM │ │
│ │ ┌─────────┐ ┌──────────┐ ┌─────────┐ ┌──────────────┐ │ │
│ │ │ API │◄─┤ Predictor│◄─┤ Models │◄─┤ Feature │ │ │
│ │ │ Layer │ │ Service │ │ Layer │ │ Extraction │ │ │
│ │ └─────────┘ └──────────┘ └─────────┘ └──────────────┘ │ │
│ │ ▲ ▲ │ │
│ │ │ │ │ │
│ │ ┌─────────┐ ┌──────────────┐ │ │
│ │ │ Health │ │ Data │ │ │
│ │ │ Monitor │ │ Pipeline │ │ │
│ │ └─────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
IntelliTag follows a layered architecture with clear separation of concerns:
┌─────────────────────────────────────────────────────────────┐
│ PRESENTATION LAYER │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ REST API (Flask/FastAPI) ││
│ │ • POST /predict - Tag predictions ││
│ │ • GET /health - Health check ││
│ └─────────────────────────────────────────────────────────┘│
├─────────────────────────────────────────────────────────────┤
│ SERVICE LAYER │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Prediction Service ││
│ │ • Orchestrates preprocessing → features → prediction ││
│ │ • Handles model selection and ensemble ││
│ └─────────────────────────────────────────────────────────┘│
├─────────────────────────────────────────────────────────────┤
│ DOMAIN LAYER │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Preprocessor │ │ Feature │ │ Classifier │ │
│ │ │ │ Extractor │ │ │ │
│ │ • Tokenize │ │ • TF-IDF │ │ • Multi-label│ │
│ │ • Clean │ │ • Word2Vec │ │ • Ensemble │ │
│ │ • Lemmatize │ │ • BERT │ │ • Threshold │ │
│ │ │ │ • USE │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ INFRASTRUCTURE LAYER │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Data Loader │ │ Model Store │ │ Config │ │
│ │ │ │ │ │ │ │
│ │ • CSV I/O │ │ • Serialize │ │ • Settings │ │
│ │ • Validation │ │ • Versioning │ │ • Env vars │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
intellitag/
│
├── README.md # Project overview and quick start
├── LICENSE # MIT License
├── setup.py # Package installation
├── requirements.txt # Production dependencies
├── requirements-dev.txt # Development dependencies
├── pyproject.toml # Modern Python config
├── .gitignore # Git ignore rules
├── .env.example # Environment template
├── Makefile # Common commands
│
├── src/ # Source code (package: intellitag)
│ └── intellitag/
│ ├── __init__.py # Package init with version
│ │
│ ├── config/ # Configuration management
│ │ ├── __init__.py
│ │ └── settings.py # Settings and constants
│ │
│ ├── data/ # Data handling
│ │ ├── __init__.py
│ │ ├── loader.py # Data loading utilities
│ │ └── preprocessor.py # Text preprocessing
│ │
│ ├── features/ # Feature extraction
│ │ ├── __init__.py
│ │ ├── base.py # Base extractor interface
│ │ ├── bow.py # Bag-of-Words (TF-IDF)
│ │ ├── word2vec.py # Word2Vec embeddings
│ │ ├── bert.py # BERT embeddings
│ │ └── use.py # Universal Sentence Encoder
│ │
│ ├── models/ # ML models
│ │ ├── __init__.py
│ │ ├── classifier.py # Multi-label classifier
│ │ ├── lda.py # Topic modeling
│ │ └── ensemble.py # Model ensemble
│ │
│ ├── api/ # API layer
│ │ ├── __init__.py
│ │ ├── app.py # Flask/FastAPI app
│ │ ├── routes.py # API routes
│ │ └── schemas.py # Request/Response schemas
│ │
│ └── utils/ # Utilities
│ ├── __init__.py
│ ├── logging.py # Logging configuration
│ └── metrics.py # Evaluation metrics
│
├── notebooks/ # Jupyter notebooks
│ ├── 01_data_exploration.ipynb
│ ├── 02_feature_engineering.ipynb
│ └── 03_model_evaluation.ipynb
│
├── tests/ # Test suite
│ ├── __init__.py
│ ├── conftest.py # Pytest fixtures
│ ├── unit/
│ │ ├── __init__.py
│ │ ├── test_preprocessor.py
│ │ ├── test_features.py
│ │ └── test_classifier.py
│ └── integration/
│ ├── __init__.py
│ └── test_api.py
│
├── docs/ # Documentation
│ ├── PRODUCT_VISION.md
│ ├── USER_STORIES.md
│ ├── PRD.md
│ ├── DATA_DICTIONARY.md
│ ├── ARCHITECTURE.md
│ └── API.md
│
├── models/ # Trained models (gitignored)
│ └── .gitkeep
│
├── data/ # Data files (gitignored)
│ ├── raw/
│ ├── processed/
│ └── .gitkeep
│
└── scripts/ # Utility scripts
├── train.py # Training script
└── evaluate.py # Evaluation script
intellitag/
├── config/ ◄── No dependencies (leaf)
├── utils/ ◄── config
├── data/ ◄── config, utils
├── features/ ◄── config, data, utils
├── models/ ◄── config, features, utils
└── api/ ◄── config, models, utils
class DataLoader:
"""Handles data loading and validation."""
def load_csv(path: str) -> pd.DataFrame
def validate_schema(df: pd.DataFrame) -> bool
def split_data(df: pd.DataFrame, test_size: float) -> Tuple[DataFrame, DataFrame]class TextPreprocessor:
"""Text preprocessing pipeline."""
def __init__(self, config: PreprocessConfig)
def clean_html(text: str) -> str
def tokenize(text: str) -> List[str]
def remove_stop_words(tokens: List[str]) -> List[str]
def lemmatize(tokens: List[str]) -> List[str]
def process(text: str) -> str # Full pipelineclass BaseFeatureExtractor(ABC):
"""Abstract base for all feature extractors."""
@abstractmethod
def fit(self, texts: List[str]) -> None
@abstractmethod
def transform(self, texts: List[str]) -> np.ndarray
def fit_transform(self, texts: List[str]) -> np.ndarray
@abstractmethod
def save(self, path: str) -> None
@classmethod
@abstractmethod
def load(cls, path: str) -> 'BaseFeatureExtractor'| Class | Description | Output Shape |
|---|---|---|
BowExtractor |
TF-IDF vectorization | (n, 10000) |
Word2VecExtractor |
Word2Vec embeddings | (n, 300) |
BertExtractor |
BERT [CLS] embeddings | (n, 768) |
UseExtractor |
USE sentence embeddings | (n, 512) |
class MultiLabelClassifier:
"""Multi-label tag classifier."""
def __init__(self,
base_estimator: str = 'logistic',
threshold: float = 0.3)
def fit(self, X: np.ndarray, y: np.ndarray) -> None
def predict(self, X: np.ndarray) -> List[List[str]]
def predict_proba(self, X: np.ndarray) -> np.ndarray
def evaluate(self, X: np.ndarray, y: np.ndarray) -> Dict[str, float]class LDATopicModel:
"""Latent Dirichlet Allocation for topic modeling."""
def __init__(self, n_topics: int = 10)
def fit(self, texts: List[str]) -> None
def get_topics(self) -> List[List[Tuple[str, float]]]
def transform(self, texts: List[str]) -> np.ndarray| Method | Endpoint | Description |
|---|---|---|
| POST | /api/v1/predict |
Get tag predictions |
| GET | /api/v1/health |
Health check |
| GET | /api/v1/info |
Model information |
# Request
class PredictionRequest(BaseModel):
title: str = Field(..., min_length=10, max_length=300)
body: str = Field(..., min_length=30, max_length=30000)
top_k: int = Field(default=5, ge=1, le=10)
# Response
class TagPrediction(BaseModel):
tag: str
confidence: float
class PredictionResponse(BaseModel):
status: str
predictions: List[TagPrediction]
model_version: str
processing_time_ms: int┌──────────┐ ┌──────────────┐ ┌─────────────┐ ┌───────────┐
│ Raw Data │───►│ Preprocessor │───►│ Feature │───►│ Classifier│
│ (CSV) │ │ │ │ Extractor │ │ Training │
└──────────┘ └──────────────┘ └─────────────┘ └───────────┘
│
▼
┌───────────┐
│ Model │
│ Artifacts │
└───────────┘
┌──────────┐ ┌──────────────┐ ┌─────────────┐ ┌───────────┐
│ API │───►│ Preprocessor │───►│ Feature │───►│ Classifier│
│ Request │ │ │ │ Extractor │ │ │
└──────────┘ └──────────────┘ └─────────────┘ └───────────┘
│
▼
┌───────────┐
│ API │
│ Response │
└───────────┘
| Component | Technology | Version | Rationale |
|---|---|---|---|
| Language | Python | 3.9+ | ML ecosystem maturity |
| ML Framework | scikit-learn | 1.0+ | Multi-label classification |
| Deep Learning | TensorFlow | 2.x | USE, BERT support |
| NLP | NLTK | 3.8+ | Preprocessing utilities |
| Data | pandas | 2.0+ | Data manipulation |
| Numerical | NumPy | 1.24+ | Array operations |
| API | FastAPI | 0.100+ | Modern, async, auto-docs |
| Validation | Pydantic | 2.x | Request/response schemas |
| Tool | Purpose |
|---|---|
| pytest | Testing framework |
| black | Code formatting |
| flake8 | Linting |
| mypy | Type checking |
| pre-commit | Git hooks |
| sphinx | Documentation |
| Component | Technology |
|---|---|
| Deployment | Heroku / Docker |
| CI/CD | GitHub Actions |
| Monitoring | Logging + Prometheus |
Context: Multiple feature extraction methods need to be supported interchangeably.
Decision: Use Strategy pattern with abstract base class.
Consequences:
- ✅ Easy to add new extractors
- ✅ Consistent interface
- ✅ Supports ensemble methods
⚠️ Slight overhead for simple cases
Context: Different models need different preprocessing (BoW vs. DL).
Decision: Configurable preprocessing with sensible defaults.
Consequences:
- ✅ Flexibility for different use cases
- ✅ Reproducible pipelines
⚠️ More configuration to manage
Context: Need REST API for serving predictions.
Decision: FastAPI over Flask.
Consequences:
- ✅ Automatic OpenAPI documentation
- ✅ Built-in validation with Pydantic
- ✅ Async support for scalability
⚠️ Newer framework, less community resources
- All API inputs validated with Pydantic schemas
- Maximum text length enforced (30KB body)
- HTML sanitization in preprocessing
- 100 requests/minute per client (configurable)
- Implemented at API gateway level
- No question content persisted
- Logs anonymized (no PII)
- Model artifacts contain no training data
| Component | Cache Type | TTL |
|---|---|---|
| Vectorizers | Memory | Session |
| Models | Memory | Session |
| Embeddings | Optional disk | 24h |
- Support batch predictions for efficiency
- Configurable batch size (default: 32)
- Quantization for BERT (optional)
- Lazy loading for heavy models
┌─────────────────────────────────────┐
│ Heroku Dyno │
│ ┌─────────────────────────────────┐│
│ │ FastAPI App ││
│ │ ┌───────┐ ┌────────────────┐ ││
│ │ │ API │ │ ML Models │ ││
│ │ │ Routes│ │ (in memory) │ ││
│ │ └───────┘ └────────────────┘ ││
│ └─────────────────────────────────┘│
└─────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ Load Balancer │
└──────────────────────┬──────────────────────────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ API │ │ API │ │ API │
│ Instance│ │ Instance│ │ Instance│
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└──────────────┼──────────────┘
▼
┌─────────────────┐
│ Model Store │
│ (S3/GCS) │
└─────────────────┘
# Structured logging format
{
"timestamp": "2023-10-15T10:30:00Z",
"level": "INFO",
"service": "intellitag",
"event": "prediction_completed",
"duration_ms": 45,
"tags_count": 5
}| Metric | Type | Description |
|---|---|---|
prediction_latency_ms |
Histogram | Prediction response time |
prediction_count |
Counter | Total predictions |
error_count |
Counter | Failed predictions |
model_load_time_ms |
Gauge | Model initialization time |
-
Model Improvements
- Fine-tuned BERT on Stack Overflow data
- Ensemble with learned weights
-
Infrastructure
- Kubernetes deployment
- Model versioning with MLflow
-
Features
- Multi-language support
- User feedback integration
- Real-time model updates
| Item | Priority | Effort |
|---|---|---|
| Add async support for embedding generation | Medium | Medium |
| Implement model A/B testing | Low | High |
| Add comprehensive integration tests | High | Medium |
This architecture document reflects the system as designed and delivered to Stack Overflow.