Skip to content

Latest commit

 

History

History
567 lines (452 loc) · 23.3 KB

File metadata and controls

567 lines (452 loc) · 23.3 KB

IntelliTag - Architecture Document

Document Information

Field Value
System IntelliTag
Version 2.0
Author Thomas Mebarki
Role Solution Architect
Status Delivered

1. Architecture Overview

1.1 System Context

┌─────────────────────────────────────────────────────────────────────┐
│                         EXTERNAL SYSTEMS                            │
├─────────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────┐ │
│  │ Stack       │    │ Data        │    │ Model Hubs              │ │
│  │ Overflow    │    │ Explorer    │    │ (HuggingFace, TF Hub)   │ │
│  │ Platform    │    │ (SQL)       │    │                         │ │
│  └──────┬──────┘    └──────┬──────┘    └───────────┬─────────────┘ │
│         │                  │                       │               │
│         ▼                  ▼                       ▼               │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                      INTELLITAG SYSTEM                       │   │
│  │  ┌─────────┐  ┌──────────┐  ┌─────────┐  ┌──────────────┐   │   │
│  │  │   API   │◄─┤ Predictor│◄─┤ Models  │◄─┤ Feature      │   │   │
│  │  │ Layer   │  │ Service  │  │ Layer   │  │ Extraction   │   │   │
│  │  └─────────┘  └──────────┘  └─────────┘  └──────────────┘   │   │
│  │       ▲                                         ▲            │   │
│  │       │                                         │            │   │
│  │  ┌─────────┐                           ┌──────────────┐      │   │
│  │  │ Health  │                           │ Data         │      │   │
│  │  │ Monitor │                           │ Pipeline     │      │   │
│  │  └─────────┘                           └──────────────┘      │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

1.2 High-Level Architecture

IntelliTag follows a layered architecture with clear separation of concerns:

┌─────────────────────────────────────────────────────────────┐
│                     PRESENTATION LAYER                       │
│  ┌─────────────────────────────────────────────────────────┐│
│  │  REST API (Flask/FastAPI)                               ││
│  │  • POST /predict - Tag predictions                      ││
│  │  • GET /health - Health check                           ││
│  └─────────────────────────────────────────────────────────┘│
├─────────────────────────────────────────────────────────────┤
│                      SERVICE LAYER                           │
│  ┌─────────────────────────────────────────────────────────┐│
│  │  Prediction Service                                     ││
│  │  • Orchestrates preprocessing → features → prediction   ││
│  │  • Handles model selection and ensemble                 ││
│  └─────────────────────────────────────────────────────────┘│
├─────────────────────────────────────────────────────────────┤
│                       DOMAIN LAYER                           │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │ Preprocessor │  │ Feature      │  │ Classifier   │      │
│  │              │  │ Extractor    │  │              │      │
│  │ • Tokenize   │  │ • TF-IDF     │  │ • Multi-label│      │
│  │ • Clean      │  │ • Word2Vec   │  │ • Ensemble   │      │
│  │ • Lemmatize  │  │ • BERT       │  │ • Threshold  │      │
│  │              │  │ • USE        │  │              │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
├─────────────────────────────────────────────────────────────┤
│                   INFRASTRUCTURE LAYER                       │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │ Data Loader  │  │ Model Store  │  │ Config       │      │
│  │              │  │              │  │              │      │
│  │ • CSV I/O    │  │ • Serialize  │  │ • Settings   │      │
│  │ • Validation │  │ • Versioning │  │ • Env vars   │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└─────────────────────────────────────────────────────────────┘

2. Project Structure

2.1 Directory Layout

intellitag/
│
├── README.md                 # Project overview and quick start
├── LICENSE                   # MIT License
├── setup.py                  # Package installation
├── requirements.txt          # Production dependencies
├── requirements-dev.txt      # Development dependencies
├── pyproject.toml           # Modern Python config
├── .gitignore               # Git ignore rules
├── .env.example             # Environment template
├── Makefile                 # Common commands
│
├── src/                     # Source code (package: intellitag)
│   └── intellitag/
│       ├── __init__.py      # Package init with version
│       │
│       ├── config/          # Configuration management
│       │   ├── __init__.py
│       │   └── settings.py  # Settings and constants
│       │
│       ├── data/            # Data handling
│       │   ├── __init__.py
│       │   ├── loader.py    # Data loading utilities
│       │   └── preprocessor.py  # Text preprocessing
│       │
│       ├── features/        # Feature extraction
│       │   ├── __init__.py
│       │   ├── base.py      # Base extractor interface
│       │   ├── bow.py       # Bag-of-Words (TF-IDF)
│       │   ├── word2vec.py  # Word2Vec embeddings
│       │   ├── bert.py      # BERT embeddings
│       │   └── use.py       # Universal Sentence Encoder
│       │
│       ├── models/          # ML models
│       │   ├── __init__.py
│       │   ├── classifier.py    # Multi-label classifier
│       │   ├── lda.py           # Topic modeling
│       │   └── ensemble.py      # Model ensemble
│       │
│       ├── api/             # API layer
│       │   ├── __init__.py
│       │   ├── app.py       # Flask/FastAPI app
│       │   ├── routes.py    # API routes
│       │   └── schemas.py   # Request/Response schemas
│       │
│       └── utils/           # Utilities
│           ├── __init__.py
│           ├── logging.py   # Logging configuration
│           └── metrics.py   # Evaluation metrics
│
├── notebooks/               # Jupyter notebooks
│   ├── 01_data_exploration.ipynb
│   ├── 02_feature_engineering.ipynb
│   └── 03_model_evaluation.ipynb
│
├── tests/                   # Test suite
│   ├── __init__.py
│   ├── conftest.py          # Pytest fixtures
│   ├── unit/
│   │   ├── __init__.py
│   │   ├── test_preprocessor.py
│   │   ├── test_features.py
│   │   └── test_classifier.py
│   └── integration/
│       ├── __init__.py
│       └── test_api.py
│
├── docs/                    # Documentation
│   ├── PRODUCT_VISION.md
│   ├── USER_STORIES.md
│   ├── PRD.md
│   ├── DATA_DICTIONARY.md
│   ├── ARCHITECTURE.md
│   └── API.md
│
├── models/                  # Trained models (gitignored)
│   └── .gitkeep
│
├── data/                    # Data files (gitignored)
│   ├── raw/
│   ├── processed/
│   └── .gitkeep
│
└── scripts/                 # Utility scripts
    ├── train.py             # Training script
    └── evaluate.py          # Evaluation script

2.2 Package Dependencies

intellitag/
├── config/     ◄── No dependencies (leaf)
├── utils/      ◄── config
├── data/       ◄── config, utils
├── features/   ◄── config, data, utils
├── models/     ◄── config, features, utils
└── api/        ◄── config, models, utils

3. Component Design

3.1 Data Layer (src/intellitag/data/)

3.1.1 DataLoader

class DataLoader:
    """Handles data loading and validation."""

    def load_csv(path: str) -> pd.DataFrame
    def validate_schema(df: pd.DataFrame) -> bool
    def split_data(df: pd.DataFrame, test_size: float) -> Tuple[DataFrame, DataFrame]

3.1.2 TextPreprocessor

class TextPreprocessor:
    """Text preprocessing pipeline."""

    def __init__(self, config: PreprocessConfig)
    def clean_html(text: str) -> str
    def tokenize(text: str) -> List[str]
    def remove_stop_words(tokens: List[str]) -> List[str]
    def lemmatize(tokens: List[str]) -> List[str]
    def process(text: str) -> str  # Full pipeline

3.2 Features Layer (src/intellitag/features/)

3.2.1 Base Interface

class BaseFeatureExtractor(ABC):
    """Abstract base for all feature extractors."""

    @abstractmethod
    def fit(self, texts: List[str]) -> None

    @abstractmethod
    def transform(self, texts: List[str]) -> np.ndarray

    def fit_transform(self, texts: List[str]) -> np.ndarray

    @abstractmethod
    def save(self, path: str) -> None

    @classmethod
    @abstractmethod
    def load(cls, path: str) -> 'BaseFeatureExtractor'

3.2.2 Implementations

Class Description Output Shape
BowExtractor TF-IDF vectorization (n, 10000)
Word2VecExtractor Word2Vec embeddings (n, 300)
BertExtractor BERT [CLS] embeddings (n, 768)
UseExtractor USE sentence embeddings (n, 512)

3.3 Models Layer (src/intellitag/models/)

3.3.1 MultiLabelClassifier

class MultiLabelClassifier:
    """Multi-label tag classifier."""

    def __init__(self,
                 base_estimator: str = 'logistic',
                 threshold: float = 0.3)

    def fit(self, X: np.ndarray, y: np.ndarray) -> None
    def predict(self, X: np.ndarray) -> List[List[str]]
    def predict_proba(self, X: np.ndarray) -> np.ndarray
    def evaluate(self, X: np.ndarray, y: np.ndarray) -> Dict[str, float]

3.3.2 LDATopicModel

class LDATopicModel:
    """Latent Dirichlet Allocation for topic modeling."""

    def __init__(self, n_topics: int = 10)
    def fit(self, texts: List[str]) -> None
    def get_topics(self) -> List[List[Tuple[str, float]]]
    def transform(self, texts: List[str]) -> np.ndarray

3.4 API Layer (src/intellitag/api/)

3.4.1 Endpoints

Method Endpoint Description
POST /api/v1/predict Get tag predictions
GET /api/v1/health Health check
GET /api/v1/info Model information

3.4.2 Request/Response Schemas

# Request
class PredictionRequest(BaseModel):
    title: str = Field(..., min_length=10, max_length=300)
    body: str = Field(..., min_length=30, max_length=30000)
    top_k: int = Field(default=5, ge=1, le=10)

# Response
class TagPrediction(BaseModel):
    tag: str
    confidence: float

class PredictionResponse(BaseModel):
    status: str
    predictions: List[TagPrediction]
    model_version: str
    processing_time_ms: int

4. Data Flow

4.1 Training Pipeline

┌──────────┐    ┌──────────────┐    ┌─────────────┐    ┌───────────┐
│ Raw Data │───►│ Preprocessor │───►│ Feature     │───►│ Classifier│
│ (CSV)    │    │              │    │ Extractor   │    │ Training  │
└──────────┘    └──────────────┘    └─────────────┘    └───────────┘
                                                              │
                                                              ▼
                                                       ┌───────────┐
                                                       │ Model     │
                                                       │ Artifacts │
                                                       └───────────┘

4.2 Inference Pipeline

┌──────────┐    ┌──────────────┐    ┌─────────────┐    ┌───────────┐
│ API      │───►│ Preprocessor │───►│ Feature     │───►│ Classifier│
│ Request  │    │              │    │ Extractor   │    │           │
└──────────┘    └──────────────┘    └─────────────┘    └───────────┘
                                                              │
                                                              ▼
                                                       ┌───────────┐
                                                       │ API       │
                                                       │ Response  │
                                                       └───────────┘

5. Technology Stack

5.1 Core Technologies

Component Technology Version Rationale
Language Python 3.9+ ML ecosystem maturity
ML Framework scikit-learn 1.0+ Multi-label classification
Deep Learning TensorFlow 2.x USE, BERT support
NLP NLTK 3.8+ Preprocessing utilities
Data pandas 2.0+ Data manipulation
Numerical NumPy 1.24+ Array operations
API FastAPI 0.100+ Modern, async, auto-docs
Validation Pydantic 2.x Request/response schemas

5.2 Development Tools

Tool Purpose
pytest Testing framework
black Code formatting
flake8 Linting
mypy Type checking
pre-commit Git hooks
sphinx Documentation

5.3 Infrastructure

Component Technology
Deployment Heroku / Docker
CI/CD GitHub Actions
Monitoring Logging + Prometheus

6. Design Decisions

6.1 ADR-001: Feature Extractor Strategy Pattern

Context: Multiple feature extraction methods need to be supported interchangeably.

Decision: Use Strategy pattern with abstract base class.

Consequences:

  • ✅ Easy to add new extractors
  • ✅ Consistent interface
  • ✅ Supports ensemble methods
  • ⚠️ Slight overhead for simple cases

6.2 ADR-002: Preprocessing Pipeline Configuration

Context: Different models need different preprocessing (BoW vs. DL).

Decision: Configurable preprocessing with sensible defaults.

Consequences:

  • ✅ Flexibility for different use cases
  • ✅ Reproducible pipelines
  • ⚠️ More configuration to manage

6.3 ADR-003: API Framework Selection

Context: Need REST API for serving predictions.

Decision: FastAPI over Flask.

Consequences:

  • ✅ Automatic OpenAPI documentation
  • ✅ Built-in validation with Pydantic
  • ✅ Async support for scalability
  • ⚠️ Newer framework, less community resources

7. Security Considerations

7.1 Input Validation

  • All API inputs validated with Pydantic schemas
  • Maximum text length enforced (30KB body)
  • HTML sanitization in preprocessing

7.2 Rate Limiting

  • 100 requests/minute per client (configurable)
  • Implemented at API gateway level

7.3 Data Privacy

  • No question content persisted
  • Logs anonymized (no PII)
  • Model artifacts contain no training data

8. Performance Optimization

8.1 Caching Strategy

Component Cache Type TTL
Vectorizers Memory Session
Models Memory Session
Embeddings Optional disk 24h

8.2 Batch Processing

  • Support batch predictions for efficiency
  • Configurable batch size (default: 32)

8.3 Model Optimization

  • Quantization for BERT (optional)
  • Lazy loading for heavy models

9. Deployment Architecture

9.1 Single Instance (Demo)

┌─────────────────────────────────────┐
│            Heroku Dyno              │
│  ┌─────────────────────────────────┐│
│  │         FastAPI App             ││
│  │  ┌───────┐  ┌────────────────┐  ││
│  │  │ API   │  │ ML Models      │  ││
│  │  │ Routes│  │ (in memory)    │  ││
│  │  └───────┘  └────────────────┘  ││
│  └─────────────────────────────────┘│
└─────────────────────────────────────┘

9.2 Production (Reference)

┌─────────────────────────────────────────────────────┐
│                   Load Balancer                      │
└──────────────────────┬──────────────────────────────┘
                       │
        ┌──────────────┼──────────────┐
        ▼              ▼              ▼
   ┌─────────┐    ┌─────────┐    ┌─────────┐
   │ API     │    │ API     │    │ API     │
   │ Instance│    │ Instance│    │ Instance│
   └────┬────┘    └────┬────┘    └────┬────┘
        │              │              │
        └──────────────┼──────────────┘
                       ▼
              ┌─────────────────┐
              │  Model Store    │
              │  (S3/GCS)       │
              └─────────────────┘

10. Monitoring & Observability

10.1 Logging

# Structured logging format
{
    "timestamp": "2023-10-15T10:30:00Z",
    "level": "INFO",
    "service": "intellitag",
    "event": "prediction_completed",
    "duration_ms": 45,
    "tags_count": 5
}

10.2 Metrics

Metric Type Description
prediction_latency_ms Histogram Prediction response time
prediction_count Counter Total predictions
error_count Counter Failed predictions
model_load_time_ms Gauge Model initialization time

11. Future Considerations

11.1 Potential Enhancements

  1. Model Improvements

    • Fine-tuned BERT on Stack Overflow data
    • Ensemble with learned weights
  2. Infrastructure

    • Kubernetes deployment
    • Model versioning with MLflow
  3. Features

    • Multi-language support
    • User feedback integration
    • Real-time model updates

11.2 Technical Debt

Item Priority Effort
Add async support for embedding generation Medium Medium
Implement model A/B testing Low High
Add comprehensive integration tests High Medium

This architecture document reflects the system as designed and delivered to Stack Overflow.