IntelliTag - Architecture Document

Document Information

Field	Value
System	IntelliTag
Version	2.0
Author	Thomas Mebarki
Role	Solution Architect
Status	Delivered

1. Architecture Overview

1.1 System Context

┌─────────────────────────────────────────────────────────────────────┐
│                         EXTERNAL SYSTEMS                            │
├─────────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────┐ │
│  │ Stack       │    │ Data        │    │ Model Hubs              │ │
│  │ Overflow    │    │ Explorer    │    │ (HuggingFace, TF Hub)   │ │
│  │ Platform    │    │ (SQL)       │    │                         │ │
│  └──────┬──────┘    └──────┬──────┘    └───────────┬─────────────┘ │
│         │                  │                       │               │
│         ▼                  ▼                       ▼               │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                      INTELLITAG SYSTEM                       │   │
│  │  ┌─────────┐  ┌──────────┐  ┌─────────┐  ┌──────────────┐   │   │
│  │  │   API   │◄─┤ Predictor│◄─┤ Models  │◄─┤ Feature      │   │   │
│  │  │ Layer   │  │ Service  │  │ Layer   │  │ Extraction   │   │   │
│  │  └─────────┘  └──────────┘  └─────────┘  └──────────────┘   │   │
│  │       ▲                                         ▲            │   │
│  │       │                                         │            │   │
│  │  ┌─────────┐                           ┌──────────────┐      │   │
│  │  │ Health  │                           │ Data         │      │   │
│  │  │ Monitor │                           │ Pipeline     │      │   │
│  │  └─────────┘                           └──────────────┘      │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

1.2 High-Level Architecture

IntelliTag follows a layered architecture with clear separation of concerns:

┌─────────────────────────────────────────────────────────────┐
│                     PRESENTATION LAYER                       │
│  ┌─────────────────────────────────────────────────────────┐│
│  │  REST API (Flask/FastAPI)                               ││
│  │  • POST /predict - Tag predictions                      ││
│  │  • GET /health - Health check                           ││
│  └─────────────────────────────────────────────────────────┘│
├─────────────────────────────────────────────────────────────┤
│                      SERVICE LAYER                           │
│  ┌─────────────────────────────────────────────────────────┐│
│  │  Prediction Service                                     ││
│  │  • Orchestrates preprocessing → features → prediction   ││
│  │  • Handles model selection and ensemble                 ││
│  └─────────────────────────────────────────────────────────┘│
├─────────────────────────────────────────────────────────────┤
│                       DOMAIN LAYER                           │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │ Preprocessor │  │ Feature      │  │ Classifier   │      │
│  │              │  │ Extractor    │  │              │      │
│  │ • Tokenize   │  │ • TF-IDF     │  │ • Multi-label│      │
│  │ • Clean      │  │ • Word2Vec   │  │ • Ensemble   │      │
│  │ • Lemmatize  │  │ • BERT       │  │ • Threshold  │      │
│  │              │  │ • USE        │  │              │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
├─────────────────────────────────────────────────────────────┤
│                   INFRASTRUCTURE LAYER                       │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │ Data Loader  │  │ Model Store  │  │ Config       │      │
│  │              │  │              │  │              │      │
│  │ • CSV I/O    │  │ • Serialize  │  │ • Settings   │      │
│  │ • Validation │  │ • Versioning │  │ • Env vars   │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└─────────────────────────────────────────────────────────────┘

2. Project Structure

2.1 Directory Layout

intellitag/
│
├── README.md                 # Project overview and quick start
├── LICENSE                   # MIT License
├── setup.py                  # Package installation
├── requirements.txt          # Production dependencies
├── requirements-dev.txt      # Development dependencies
├── pyproject.toml           # Modern Python config
├── .gitignore               # Git ignore rules
├── .env.example             # Environment template
├── Makefile                 # Common commands
│
├── src/                     # Source code (package: intellitag)
│   └── intellitag/
│       ├── __init__.py      # Package init with version
│       │
│       ├── config/          # Configuration management
│       │   ├── __init__.py
│       │   └── settings.py  # Settings and constants
│       │
│       ├── data/            # Data handling
│       │   ├── __init__.py
│       │   ├── loader.py    # Data loading utilities
│       │   └── preprocessor.py  # Text preprocessing
│       │
│       ├── features/        # Feature extraction
│       │   ├── __init__.py
│       │   ├── base.py      # Base extractor interface
│       │   ├── bow.py       # Bag-of-Words (TF-IDF)
│       │   ├── word2vec.py  # Word2Vec embeddings
│       │   ├── bert.py      # BERT embeddings
│       │   └── use.py       # Universal Sentence Encoder
│       │
│       ├── models/          # ML models
│       │   ├── __init__.py
│       │   ├── classifier.py    # Multi-label classifier
│       │   ├── lda.py           # Topic modeling
│       │   └── ensemble.py      # Model ensemble
│       │
│       ├── api/             # API layer
│       │   ├── __init__.py
│       │   ├── app.py       # Flask/FastAPI app
│       │   ├── routes.py    # API routes
│       │   └── schemas.py   # Request/Response schemas
│       │
│       └── utils/           # Utilities
│           ├── __init__.py
│           ├── logging.py   # Logging configuration
│           └── metrics.py   # Evaluation metrics
│
├── notebooks/               # Jupyter notebooks
│   ├── 01_data_exploration.ipynb
│   ├── 02_feature_engineering.ipynb
│   └── 03_model_evaluation.ipynb
│
├── tests/                   # Test suite
│   ├── __init__.py
│   ├── conftest.py          # Pytest fixtures
│   ├── unit/
│   │   ├── __init__.py
│   │   ├── test_preprocessor.py
│   │   ├── test_features.py
│   │   └── test_classifier.py
│   └── integration/
│       ├── __init__.py
│       └── test_api.py
│
├── docs/                    # Documentation
│   ├── PRODUCT_VISION.md
│   ├── USER_STORIES.md
│   ├── PRD.md
│   ├── DATA_DICTIONARY.md
│   ├── ARCHITECTURE.md
│   └── API.md
│
├── models/                  # Trained models (gitignored)
│   └── .gitkeep
│
├── data/                    # Data files (gitignored)
│   ├── raw/
│   ├── processed/
│   └── .gitkeep
│
└── scripts/                 # Utility scripts
    ├── train.py             # Training script
    └── evaluate.py          # Evaluation script

2.2 Package Dependencies

intellitag/
├── config/     ◄── No dependencies (leaf)
├── utils/      ◄── config
├── data/       ◄── config, utils
├── features/   ◄── config, data, utils
├── models/     ◄── config, features, utils
└── api/        ◄── config, models, utils

3. Component Design

3.1 Data Layer (`src/intellitag/data/`)

3.1.1 DataLoader

class DataLoader:
    """Handles data loading and validation."""

    def load_csv(path: str) -> pd.DataFrame
    def validate_schema(df: pd.DataFrame) -> bool
    def split_data(df: pd.DataFrame, test_size: float) -> Tuple[DataFrame, DataFrame]

3.1.2 TextPreprocessor

class TextPreprocessor:
    """Text preprocessing pipeline."""

    def __init__(self, config: PreprocessConfig)
    def clean_html(text: str) -> str
    def tokenize(text: str) -> List[str]
    def remove_stop_words(tokens: List[str]) -> List[str]
    def lemmatize(tokens: List[str]) -> List[str]
    def process(text: str) -> str  # Full pipeline

3.2 Features Layer (`src/intellitag/features/`)

3.2.1 Base Interface

class BaseFeatureExtractor(ABC):
    """Abstract base for all feature extractors."""

    @abstractmethod
    def fit(self, texts: List[str]) -> None

    @abstractmethod
    def transform(self, texts: List[str]) -> np.ndarray

    def fit_transform(self, texts: List[str]) -> np.ndarray

    @abstractmethod
    def save(self, path: str) -> None

    @classmethod
    @abstractmethod
    def load(cls, path: str) -> 'BaseFeatureExtractor'

3.2.2 Implementations

Class	Description	Output Shape
`BowExtractor`	TF-IDF vectorization	(n, 10000)
`Word2VecExtractor`	Word2Vec embeddings	(n, 300)
`BertExtractor`	BERT [CLS] embeddings	(n, 768)
`UseExtractor`	USE sentence embeddings	(n, 512)

3.3 Models Layer (`src/intellitag/models/`)

3.3.1 MultiLabelClassifier

class MultiLabelClassifier:
    """Multi-label tag classifier."""

    def __init__(self,
                 base_estimator: str = 'logistic',
                 threshold: float = 0.3)

    def fit(self, X: np.ndarray, y: np.ndarray) -> None
    def predict(self, X: np.ndarray) -> List[List[str]]
    def predict_proba(self, X: np.ndarray) -> np.ndarray
    def evaluate(self, X: np.ndarray, y: np.ndarray) -> Dict[str, float]

3.3.2 LDATopicModel

class LDATopicModel:
    """Latent Dirichlet Allocation for topic modeling."""

    def __init__(self, n_topics: int = 10)
    def fit(self, texts: List[str]) -> None
    def get_topics(self) -> List[List[Tuple[str, float]]]
    def transform(self, texts: List[str]) -> np.ndarray

3.4 API Layer (`src/intellitag/api/`)

3.4.1 Endpoints

Method	Endpoint	Description
POST	`/api/v1/predict`	Get tag predictions
GET	`/api/v1/health`	Health check
GET	`/api/v1/info`	Model information

3.4.2 Request/Response Schemas

# Request
class PredictionRequest(BaseModel):
    title: str = Field(..., min_length=10, max_length=300)
    body: str = Field(..., min_length=30, max_length=30000)
    top_k: int = Field(default=5, ge=1, le=10)

# Response
class TagPrediction(BaseModel):
    tag: str
    confidence: float

class PredictionResponse(BaseModel):
    status: str
    predictions: List[TagPrediction]
    model_version: str
    processing_time_ms: int

4. Data Flow

4.1 Training Pipeline

┌──────────┐    ┌──────────────┐    ┌─────────────┐    ┌───────────┐
│ Raw Data │───►│ Preprocessor │───►│ Feature     │───►│ Classifier│
│ (CSV)    │    │              │    │ Extractor   │    │ Training  │
└──────────┘    └──────────────┘    └─────────────┘    └───────────┘
                                                              │
                                                              ▼
                                                       ┌───────────┐
                                                       │ Model     │
                                                       │ Artifacts │
                                                       └───────────┘

4.2 Inference Pipeline

┌──────────┐    ┌──────────────┐    ┌─────────────┐    ┌───────────┐
│ API      │───►│ Preprocessor │───►│ Feature     │───►│ Classifier│
│ Request  │    │              │    │ Extractor   │    │           │
└──────────┘    └──────────────┘    └─────────────┘    └───────────┘
                                                              │
                                                              ▼
                                                       ┌───────────┐
                                                       │ API       │
                                                       │ Response  │
                                                       └───────────┘

5. Technology Stack

5.1 Core Technologies

Component	Technology	Version	Rationale
Language	Python	3.9+	ML ecosystem maturity
ML Framework	scikit-learn	1.0+	Multi-label classification
Deep Learning	TensorFlow	2.x	USE, BERT support
NLP	NLTK	3.8+	Preprocessing utilities
Data	pandas	2.0+	Data manipulation
Numerical	NumPy	1.24+	Array operations
API	FastAPI	0.100+	Modern, async, auto-docs
Validation	Pydantic	2.x	Request/response schemas

5.2 Development Tools

Tool	Purpose
pytest	Testing framework
black	Code formatting
flake8	Linting
mypy	Type checking
pre-commit	Git hooks
sphinx	Documentation

5.3 Infrastructure

Component	Technology
Deployment	Heroku / Docker
CI/CD	GitHub Actions
Monitoring	Logging + Prometheus

6. Design Decisions

6.1 ADR-001: Feature Extractor Strategy Pattern

Context: Multiple feature extraction methods need to be supported interchangeably.

Decision: Use Strategy pattern with abstract base class.

Consequences:

✅ Easy to add new extractors
✅ Consistent interface
✅ Supports ensemble methods
⚠️ Slight overhead for simple cases

6.2 ADR-002: Preprocessing Pipeline Configuration

Context: Different models need different preprocessing (BoW vs. DL).

Decision: Configurable preprocessing with sensible defaults.

Consequences:

✅ Flexibility for different use cases
✅ Reproducible pipelines
⚠️ More configuration to manage

6.3 ADR-003: API Framework Selection

Context: Need REST API for serving predictions.

Decision: FastAPI over Flask.

Consequences:

✅ Automatic OpenAPI documentation
✅ Built-in validation with Pydantic
✅ Async support for scalability
⚠️ Newer framework, less community resources

7. Security Considerations

7.1 Input Validation

All API inputs validated with Pydantic schemas
Maximum text length enforced (30KB body)
HTML sanitization in preprocessing

7.2 Rate Limiting

100 requests/minute per client (configurable)
Implemented at API gateway level

7.3 Data Privacy

No question content persisted
Logs anonymized (no PII)
Model artifacts contain no training data

8. Performance Optimization

8.1 Caching Strategy

Component	Cache Type	TTL
Vectorizers	Memory	Session
Models	Memory	Session
Embeddings	Optional disk	24h

8.2 Batch Processing

Support batch predictions for efficiency
Configurable batch size (default: 32)

8.3 Model Optimization

Quantization for BERT (optional)
Lazy loading for heavy models

9. Deployment Architecture

9.1 Single Instance (Demo)

┌─────────────────────────────────────┐
│            Heroku Dyno              │
│  ┌─────────────────────────────────┐│
│  │         FastAPI App             ││
│  │  ┌───────┐  ┌────────────────┐  ││
│  │  │ API   │  │ ML Models      │  ││
│  │  │ Routes│  │ (in memory)    │  ││
│  │  └───────┘  └────────────────┘  ││
│  └─────────────────────────────────┘│
└─────────────────────────────────────┘

9.2 Production (Reference)

┌─────────────────────────────────────────────────────┐
│                   Load Balancer                      │
└──────────────────────┬──────────────────────────────┘
                       │
        ┌──────────────┼──────────────┐
        ▼              ▼              ▼
   ┌─────────┐    ┌─────────┐    ┌─────────┐
   │ API     │    │ API     │    │ API     │
   │ Instance│    │ Instance│    │ Instance│
   └────┬────┘    └────┬────┘    └────┬────┘
        │              │              │
        └──────────────┼──────────────┘
                       ▼
              ┌─────────────────┐
              │  Model Store    │
              │  (S3/GCS)       │
              └─────────────────┘

10. Monitoring & Observability

10.1 Logging

# Structured logging format
{
    "timestamp": "2023-10-15T10:30:00Z",
    "level": "INFO",
    "service": "intellitag",
    "event": "prediction_completed",
    "duration_ms": 45,
    "tags_count": 5
}

10.2 Metrics

Metric	Type	Description
`prediction_latency_ms`	Histogram	Prediction response time
`prediction_count`	Counter	Total predictions
`error_count`	Counter	Failed predictions
`model_load_time_ms`	Gauge	Model initialization time

11. Future Considerations

11.1 Potential Enhancements

Model Improvements
- Fine-tuned BERT on Stack Overflow data
- Ensemble with learned weights
Infrastructure
- Kubernetes deployment
- Model versioning with MLflow
Features
- Multi-language support
- User feedback integration
- Real-time model updates

11.2 Technical Debt

Item	Priority	Effort
Add async support for embedding generation	Medium	Medium
Implement model A/B testing	Low	High
Add comprehensive integration tests	High	Medium

This architecture document reflects the system as designed and delivered to Stack Overflow.

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

IntelliTag - Architecture Document

Document Information

1. Architecture Overview

1.1 System Context

1.2 High-Level Architecture

2. Project Structure

2.1 Directory Layout

2.2 Package Dependencies

3. Component Design

3.1 Data Layer (src/intellitag/data/)

3.1.1 DataLoader

3.1.2 TextPreprocessor

3.2 Features Layer (src/intellitag/features/)

3.2.1 Base Interface

3.2.2 Implementations

3.3 Models Layer (src/intellitag/models/)

3.3.1 MultiLabelClassifier

3.3.2 LDATopicModel

3.4 API Layer (src/intellitag/api/)

3.4.1 Endpoints

3.4.2 Request/Response Schemas

4. Data Flow

4.1 Training Pipeline

4.2 Inference Pipeline

5. Technology Stack

5.1 Core Technologies

5.2 Development Tools

5.3 Infrastructure

6. Design Decisions

6.1 ADR-001: Feature Extractor Strategy Pattern

6.2 ADR-002: Preprocessing Pipeline Configuration

6.3 ADR-003: API Framework Selection

7. Security Considerations

7.1 Input Validation

7.2 Rate Limiting

7.3 Data Privacy

8. Performance Optimization

8.1 Caching Strategy

8.2 Batch Processing

8.3 Model Optimization

9. Deployment Architecture

9.1 Single Instance (Demo)

9.2 Production (Reference)

10. Monitoring & Observability

10.1 Logging

10.2 Metrics

11. Future Considerations

11.1 Potential Enhancements

11.2 Technical Debt

3.1 Data Layer (`src/intellitag/data/`)

3.2 Features Layer (`src/intellitag/features/`)

3.3 Models Layer (`src/intellitag/models/`)

3.4 API Layer (`src/intellitag/api/`)