IntelliTag - Data Dictionary

Overview

This document defines all data structures, fields, and transformations used in the IntelliTag system. It serves as the single source of truth for data specifications.

1. Source Data

1.1 Raw Dataset: `QueryResults.csv`

Data extracted from Stack Exchange Data Explorer.

Field	Type	Description	Example	Nullable
`Id`	Integer	Unique question identifier	`40101130`	No
`Title`	String	Question title	`"How do I calculate a rolling idxmax"`	No
`Body`	String (HTML)	Question body with HTML formatting	`"<p>consider the <code>pd.Series</code>..."`	No
`Tags`	String	Tags in chevron format	`"<python><pandas><numpy>"`	No
`Score`	Integer	Question vote score	`9`	Yes
`ViewCount`	Integer	Number of views	`7584`	Yes
`FavoriteCount`	Float	Number of favorites	`0.0`	Yes
`AnswerCount`	Integer	Number of answers	`6`	Yes

Statistics:

Total Records: 50,000
Unique Titles: 49,999
Unique Tag Combinations: 49,190

2. Processed Data

2.1 Bag-of-Words Dataset: `data_bow.csv`

Preprocessed data optimized for Bag-of-Words feature extraction.

Field	Type	Description	Transformation Applied
`Id`	Integer (Index)	Question identifier	Set as index
`Title`	String	Cleaned title	Tokenized, lowercased, lemmatized, stop words removed
`Body`	String	Cleaned body	HTML removed, tokenized, lowercased, lemmatized, stop words removed
`Tags`	List[String]	Parsed tags	Extracted from chevron format
`text`	String	Combined text	`Title + " " + Body`

Preprocessing Pipeline (BoW):

Raw Text → HTML Removal → Tokenization → Stop Word Removal →
Lowercase → Lemmatization → Join Tokens

2.2 Sentence Embedding Dataset: `data_se.csv`

Preprocessed data optimized for deep learning embeddings (BERT, USE).

Field	Type	Description	Transformation Applied
`Id`	Integer (Index)	Question identifier	Set as index
`Title`	String	Lightly cleaned title	Tokenized, lowercased (no lemmatization)
`Body`	String	Lightly cleaned body	HTML removed, tokenized, lowercased
`Tags`	List[String]	Parsed tags	Extracted from chevron format
`text`	String	Combined text	`Title + " " + Body`

Preprocessing Pipeline (DL):

Raw Text → HTML Removal → Tokenization → Lowercase → Join Tokens

Rationale: Deep learning models like BERT handle morphological variations internally, so lemmatization is skipped to preserve semantic richness.

3. Feature Representations

3.1 TF-IDF Features

Attribute	Specification
Vectorizer	`TfidfVectorizer`
Max Features	10,000
N-gram Range	(1, 2) - unigrams and bigrams
Min DF	2 (minimum document frequency)
Max DF	0.95 (maximum document frequency)
Output Shape	`(n_samples, 10000)`
Data Type	`scipy.sparse.csr_matrix`

3.2 Word2Vec Features

Attribute	Specification
Model	Pre-trained Google News or custom
Dimensions	300
Aggregation	Mean of word vectors
OOV Handling	Zero vector
Output Shape	`(n_samples, 300)`
Data Type	`numpy.ndarray`

3.3 BERT Features

Attribute	Specification
Model	`bert-base-uncased`
Tokenizer	`BertTokenizer`
Max Length	512 tokens
Truncation	True (from end)
Embedding	[CLS] token output
Dimensions	768
Output Shape	`(n_samples, 768)`
Data Type	`numpy.ndarray`

3.4 Universal Sentence Encoder Features

Attribute	Specification
Model	TensorFlow Hub USE
Version	`universal-sentence-encoder/4`
Dimensions	512
Input	Raw text (model handles preprocessing)
Output Shape	`(n_samples, 512)`
Data Type	`numpy.ndarray`

4. Label Data

4.1 Tag Structure

Original Format:

<python><pandas><numpy><dataframe><series>

Parsed Format:

['python', 'pandas', 'numpy', 'dataframe', 'series']

4.2 Multi-Label Encoding

Encoding Method	Description	Use Case
MultiLabelBinarizer	Binary matrix (n_samples, n_tags)	Classification training
List of Lists	Python list format	Data processing

Tag Statistics:

Metric	Value
Total Unique Tags	~5,000 (in sample)
Avg Tags per Question	3.2
Max Tags per Question	5
Min Tags per Question	1
Most Common Tag	`javascript`

4.3 Top 50 Tags (Sample)

Rank	Tag	Frequency
1	javascript	8,234
2	python	7,891
3	java	6,543
4	c#	5,678
5	php	4,321
6	android	3,987
7	html	3,876
8	jquery	3,654
9	css	3,432
10	c++	3,210
...	...	...

5. Model Artifacts

5.1 LDA Models

File Pattern	Description
`lda_model.model`	Main LDA model (default topics)
`lda_model_{n}.model`	LDA model with n topics
`*.model.expElogbeta.npy`	Topic-word distribution matrix
`*.model.id2word`	Dictionary mapping
`*.model.state`	Model state for updates

LDA Configuration:

Parameter	Value
Topics Range	5-20
Passes	15
Chunksize	2000
Alpha	'auto'
Eta	'auto'

5.2 Vectorizers

Artifact	Format	Content
`tfidf_vectorizer.pkl`	Pickle	Fitted TfidfVectorizer
`mlb.pkl`	Pickle	MultiLabelBinarizer

5.3 Classifiers

Artifact	Format	Content
`classifier_bow.pkl`	Pickle	BoW-based classifier
`classifier_w2v.pkl`	Pickle	Word2Vec-based classifier
`classifier_bert.pkl`	Pickle	BERT-based classifier
`classifier_use.pkl`	Pickle	USE-based classifier

6. API Data Structures

6.1 Prediction Request

{
  "title": "string (required)",
  "body": "string (required)",
  "top_k": "integer (optional, default=5)",
  "threshold": "float (optional, default=0.1)"
}

Validation Rules:

title: 10-300 characters
body: 30-30000 characters
top_k: 1-10
threshold: 0.0-1.0

6.2 Prediction Response

{
  "status": "success",
  "predictions": [
    {
      "tag": "python",
      "confidence": 0.92
    },
    {
      "tag": "pandas",
      "confidence": 0.87
    },
    {
      "tag": "dataframe",
      "confidence": 0.73
    }
  ],
  "model_version": "1.0.0",
  "processing_time_ms": 45
}

6.3 Error Response

{
  "status": "error",
  "error_code": "VALIDATION_ERROR",
  "message": "Title must be between 10 and 300 characters",
  "timestamp": "2023-10-15T10:30:00Z"
}

Error Codes:

Code	Description
`VALIDATION_ERROR`	Input validation failed
`MODEL_ERROR`	Model inference failed
`INTERNAL_ERROR`	Unexpected server error
`RATE_LIMIT_ERROR`	Too many requests

7. Transformation Functions

7.1 Text Cleaning Functions

Function	Input	Output	Description
`clean_html(text)`	HTML string	Plain text	Removes HTML tags
`tokenize(text)`	String	List[String]	Splits into tokens
`remove_stop_words(tokens)`	List[String]	List[String]	Filters stop words
`lemmatize(tokens)`	List[String]	List[String]	Reduces to base form
`normalize_case(tokens)`	List[String]	List[String]	Lowercase conversion

7.2 Feature Extraction Functions

Function	Input	Output	Description
`extract_tfidf(texts)`	List[String]	Sparse Matrix	TF-IDF vectors
`extract_word2vec(texts)`	List[String]	ndarray	Word2Vec embeddings
`extract_bert(texts)`	List[String]	ndarray	BERT embeddings
`extract_use(texts)`	List[String]	ndarray	USE embeddings

7.3 Tag Processing Functions

Function	Input	Output	Description
`parse_tags(tag_string)`	`"<a><b><c>"`	`['a','b','c']`	Extracts tags
`encode_tags(tag_lists)`	List[List[String]]	Binary Matrix	Multi-label encoding
`decode_tags(binary)`	Binary Matrix	List[List[String]]	Reverse encoding

8. Data Quality Rules

8.1 Validation Rules

Rule	Field	Condition
Not Null	Title, Body, Tags	Cannot be empty
Min Length	Title	>= 10 characters
Min Length	Body	>= 30 characters
Valid Tags	Tags	At least 1 tag
Max Tags	Tags	<= 5 tags
UTF-8	All text	Valid UTF-8 encoding

8.2 Data Cleaning Rules

Issue	Action
HTML entities	Decode (`&` → `&`)
Code blocks	Preserve content, remove formatting
URLs	Remove (start with http)
Mentions	Remove (start with @)
Extra whitespace	Normalize to single space

9. File Locations

9.1 Data Files

data/
├── raw/
│   └── QueryResults.csv      # Original data
├── processed/
│   ├── data_bow.csv          # BoW-ready data
│   └── data_se.csv           # DL-ready data
└── .gitkeep

9.2 Model Files

models/
├── lda/
│   └── lda_model*.model      # LDA models
├── vectorizers/
│   ├── tfidf_vectorizer.pkl
│   └── mlb.pkl
├── classifiers/
│   └── classifier_*.pkl
└── .gitkeep

10. Changelog

Version	Date	Changes
1.0	2023-03	Initial data dictionary
1.1	2023-04	Added API structures
2.0	2023-10	Portfolio anonymization

This data dictionary documents the IntelliTag data architecture as delivered to Stack Overflow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IntelliTag - Data Dictionary

Overview

1. Source Data

1.1 Raw Dataset: `QueryResults.csv`

2. Processed Data

2.1 Bag-of-Words Dataset: `data_bow.csv`

2.2 Sentence Embedding Dataset: `data_se.csv`

3. Feature Representations

3.1 TF-IDF Features

3.2 Word2Vec Features

3.3 BERT Features

3.4 Universal Sentence Encoder Features

4. Label Data

4.1 Tag Structure

4.2 Multi-Label Encoding

4.3 Top 50 Tags (Sample)

5. Model Artifacts

5.1 LDA Models

5.2 Vectorizers

5.3 Classifiers

6. API Data Structures

6.1 Prediction Request

6.2 Prediction Response

6.3 Error Response

7. Transformation Functions

7.1 Text Cleaning Functions

7.2 Feature Extraction Functions

7.3 Tag Processing Functions

8. Data Quality Rules

8.1 Validation Rules

8.2 Data Cleaning Rules

9. File Locations

9.1 Data Files

9.2 Model Files

10. Changelog

FilesExpand file tree

DATA_DICTIONARY.md

Latest commit

History

DATA_DICTIONARY.md

File metadata and controls

IntelliTag - Data Dictionary

Overview

1. Source Data

1.1 Raw Dataset: QueryResults.csv

2. Processed Data

2.1 Bag-of-Words Dataset: data_bow.csv

2.2 Sentence Embedding Dataset: data_se.csv

3. Feature Representations

3.1 TF-IDF Features

3.2 Word2Vec Features

3.3 BERT Features

3.4 Universal Sentence Encoder Features

4. Label Data

4.1 Tag Structure

4.2 Multi-Label Encoding

4.3 Top 50 Tags (Sample)

5. Model Artifacts

5.1 LDA Models

5.2 Vectorizers

5.3 Classifiers

6. API Data Structures

6.1 Prediction Request

6.2 Prediction Response

6.3 Error Response

7. Transformation Functions

7.1 Text Cleaning Functions

7.2 Feature Extraction Functions

7.3 Tag Processing Functions

8. Data Quality Rules

8.1 Validation Rules

8.2 Data Cleaning Rules

9. File Locations

9.1 Data Files

9.2 Model Files

10. Changelog

1.1 Raw Dataset: `QueryResults.csv`

2.1 Bag-of-Words Dataset: `data_bow.csv`

2.2 Sentence Embedding Dataset: `data_se.csv`