Skip to content

Latest commit

 

History

History
375 lines (289 loc) · 10.3 KB

File metadata and controls

375 lines (289 loc) · 10.3 KB

IntelliTag - Data Dictionary

Overview

This document defines all data structures, fields, and transformations used in the IntelliTag system. It serves as the single source of truth for data specifications.


1. Source Data

1.1 Raw Dataset: QueryResults.csv

Data extracted from Stack Exchange Data Explorer.

Field Type Description Example Nullable
Id Integer Unique question identifier 40101130 No
Title String Question title "How do I calculate a rolling idxmax" No
Body String (HTML) Question body with HTML formatting "<p>consider the <code>pd.Series</code>..." No
Tags String Tags in chevron format "<python><pandas><numpy>" No
Score Integer Question vote score 9 Yes
ViewCount Integer Number of views 7584 Yes
FavoriteCount Float Number of favorites 0.0 Yes
AnswerCount Integer Number of answers 6 Yes

Statistics:

  • Total Records: 50,000
  • Unique Titles: 49,999
  • Unique Tag Combinations: 49,190

2. Processed Data

2.1 Bag-of-Words Dataset: data_bow.csv

Preprocessed data optimized for Bag-of-Words feature extraction.

Field Type Description Transformation Applied
Id Integer (Index) Question identifier Set as index
Title String Cleaned title Tokenized, lowercased, lemmatized, stop words removed
Body String Cleaned body HTML removed, tokenized, lowercased, lemmatized, stop words removed
Tags List[String] Parsed tags Extracted from chevron format
text String Combined text Title + " " + Body

Preprocessing Pipeline (BoW):

Raw Text → HTML Removal → Tokenization → Stop Word Removal →
Lowercase → Lemmatization → Join Tokens

2.2 Sentence Embedding Dataset: data_se.csv

Preprocessed data optimized for deep learning embeddings (BERT, USE).

Field Type Description Transformation Applied
Id Integer (Index) Question identifier Set as index
Title String Lightly cleaned title Tokenized, lowercased (no lemmatization)
Body String Lightly cleaned body HTML removed, tokenized, lowercased
Tags List[String] Parsed tags Extracted from chevron format
text String Combined text Title + " " + Body

Preprocessing Pipeline (DL):

Raw Text → HTML Removal → Tokenization → Lowercase → Join Tokens

Rationale: Deep learning models like BERT handle morphological variations internally, so lemmatization is skipped to preserve semantic richness.


3. Feature Representations

3.1 TF-IDF Features

Attribute Specification
Vectorizer TfidfVectorizer
Max Features 10,000
N-gram Range (1, 2) - unigrams and bigrams
Min DF 2 (minimum document frequency)
Max DF 0.95 (maximum document frequency)
Output Shape (n_samples, 10000)
Data Type scipy.sparse.csr_matrix

3.2 Word2Vec Features

Attribute Specification
Model Pre-trained Google News or custom
Dimensions 300
Aggregation Mean of word vectors
OOV Handling Zero vector
Output Shape (n_samples, 300)
Data Type numpy.ndarray

3.3 BERT Features

Attribute Specification
Model bert-base-uncased
Tokenizer BertTokenizer
Max Length 512 tokens
Truncation True (from end)
Embedding [CLS] token output
Dimensions 768
Output Shape (n_samples, 768)
Data Type numpy.ndarray

3.4 Universal Sentence Encoder Features

Attribute Specification
Model TensorFlow Hub USE
Version universal-sentence-encoder/4
Dimensions 512
Input Raw text (model handles preprocessing)
Output Shape (n_samples, 512)
Data Type numpy.ndarray

4. Label Data

4.1 Tag Structure

Original Format:

<python><pandas><numpy><dataframe><series>

Parsed Format:

['python', 'pandas', 'numpy', 'dataframe', 'series']

4.2 Multi-Label Encoding

Encoding Method Description Use Case
MultiLabelBinarizer Binary matrix (n_samples, n_tags) Classification training
List of Lists Python list format Data processing

Tag Statistics:

Metric Value
Total Unique Tags ~5,000 (in sample)
Avg Tags per Question 3.2
Max Tags per Question 5
Min Tags per Question 1
Most Common Tag javascript

4.3 Top 50 Tags (Sample)

Rank Tag Frequency
1 javascript 8,234
2 python 7,891
3 java 6,543
4 c# 5,678
5 php 4,321
6 android 3,987
7 html 3,876
8 jquery 3,654
9 css 3,432
10 c++ 3,210
... ... ...

5. Model Artifacts

5.1 LDA Models

File Pattern Description
lda_model.model Main LDA model (default topics)
lda_model_{n}.model LDA model with n topics
*.model.expElogbeta.npy Topic-word distribution matrix
*.model.id2word Dictionary mapping
*.model.state Model state for updates

LDA Configuration:

Parameter Value
Topics Range 5-20
Passes 15
Chunksize 2000
Alpha 'auto'
Eta 'auto'

5.2 Vectorizers

Artifact Format Content
tfidf_vectorizer.pkl Pickle Fitted TfidfVectorizer
mlb.pkl Pickle MultiLabelBinarizer

5.3 Classifiers

Artifact Format Content
classifier_bow.pkl Pickle BoW-based classifier
classifier_w2v.pkl Pickle Word2Vec-based classifier
classifier_bert.pkl Pickle BERT-based classifier
classifier_use.pkl Pickle USE-based classifier

6. API Data Structures

6.1 Prediction Request

{
  "title": "string (required)",
  "body": "string (required)",
  "top_k": "integer (optional, default=5)",
  "threshold": "float (optional, default=0.1)"
}

Validation Rules:

  • title: 10-300 characters
  • body: 30-30000 characters
  • top_k: 1-10
  • threshold: 0.0-1.0

6.2 Prediction Response

{
  "status": "success",
  "predictions": [
    {
      "tag": "python",
      "confidence": 0.92
    },
    {
      "tag": "pandas",
      "confidence": 0.87
    },
    {
      "tag": "dataframe",
      "confidence": 0.73
    }
  ],
  "model_version": "1.0.0",
  "processing_time_ms": 45
}

6.3 Error Response

{
  "status": "error",
  "error_code": "VALIDATION_ERROR",
  "message": "Title must be between 10 and 300 characters",
  "timestamp": "2023-10-15T10:30:00Z"
}

Error Codes:

Code Description
VALIDATION_ERROR Input validation failed
MODEL_ERROR Model inference failed
INTERNAL_ERROR Unexpected server error
RATE_LIMIT_ERROR Too many requests

7. Transformation Functions

7.1 Text Cleaning Functions

Function Input Output Description
clean_html(text) HTML string Plain text Removes HTML tags
tokenize(text) String List[String] Splits into tokens
remove_stop_words(tokens) List[String] List[String] Filters stop words
lemmatize(tokens) List[String] List[String] Reduces to base form
normalize_case(tokens) List[String] List[String] Lowercase conversion

7.2 Feature Extraction Functions

Function Input Output Description
extract_tfidf(texts) List[String] Sparse Matrix TF-IDF vectors
extract_word2vec(texts) List[String] ndarray Word2Vec embeddings
extract_bert(texts) List[String] ndarray BERT embeddings
extract_use(texts) List[String] ndarray USE embeddings

7.3 Tag Processing Functions

Function Input Output Description
parse_tags(tag_string) "<a><b><c>" ['a','b','c'] Extracts tags
encode_tags(tag_lists) List[List[String]] Binary Matrix Multi-label encoding
decode_tags(binary) Binary Matrix List[List[String]] Reverse encoding

8. Data Quality Rules

8.1 Validation Rules

Rule Field Condition
Not Null Title, Body, Tags Cannot be empty
Min Length Title >= 10 characters
Min Length Body >= 30 characters
Valid Tags Tags At least 1 tag
Max Tags Tags <= 5 tags
UTF-8 All text Valid UTF-8 encoding

8.2 Data Cleaning Rules

Issue Action
HTML entities Decode (&amp;&)
Code blocks Preserve content, remove formatting
URLs Remove (start with http)
Mentions Remove (start with @)
Extra whitespace Normalize to single space

9. File Locations

9.1 Data Files

data/
├── raw/
│   └── QueryResults.csv      # Original data
├── processed/
│   ├── data_bow.csv          # BoW-ready data
│   └── data_se.csv           # DL-ready data
└── .gitkeep

9.2 Model Files

models/
├── lda/
│   └── lda_model*.model      # LDA models
├── vectorizers/
│   ├── tfidf_vectorizer.pkl
│   └── mlb.pkl
├── classifiers/
│   └── classifier_*.pkl
└── .gitkeep

10. Changelog

Version Date Changes
1.0 2023-03 Initial data dictionary
1.1 2023-04 Added API structures
2.0 2023-10 Portfolio anonymization

This data dictionary documents the IntelliTag data architecture as delivered to Stack Overflow.