You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This document defines all data structures, fields, and transformations used in the IntelliTag system. It serves as the single source of truth for data specifications.
1. Source Data
1.1 Raw Dataset: QueryResults.csv
Data extracted from Stack Exchange Data Explorer.
Field
Type
Description
Example
Nullable
Id
Integer
Unique question identifier
40101130
No
Title
String
Question title
"How do I calculate a rolling idxmax"
No
Body
String (HTML)
Question body with HTML formatting
"<p>consider the <code>pd.Series</code>..."
No
Tags
String
Tags in chevron format
"<python><pandas><numpy>"
No
Score
Integer
Question vote score
9
Yes
ViewCount
Integer
Number of views
7584
Yes
FavoriteCount
Float
Number of favorites
0.0
Yes
AnswerCount
Integer
Number of answers
6
Yes
Statistics:
Total Records: 50,000
Unique Titles: 49,999
Unique Tag Combinations: 49,190
2. Processed Data
2.1 Bag-of-Words Dataset: data_bow.csv
Preprocessed data optimized for Bag-of-Words feature extraction.
Field
Type
Description
Transformation Applied
Id
Integer (Index)
Question identifier
Set as index
Title
String
Cleaned title
Tokenized, lowercased, lemmatized, stop words removed
Body
String
Cleaned body
HTML removed, tokenized, lowercased, lemmatized, stop words removed
Tags
List[String]
Parsed tags
Extracted from chevron format
text
String
Combined text
Title + " " + Body
Preprocessing Pipeline (BoW):
Raw Text → HTML Removal → Tokenization → Stop Word Removal →
Lowercase → Lemmatization → Join Tokens
2.2 Sentence Embedding Dataset: data_se.csv
Preprocessed data optimized for deep learning embeddings (BERT, USE).
Field
Type
Description
Transformation Applied
Id
Integer (Index)
Question identifier
Set as index
Title
String
Lightly cleaned title
Tokenized, lowercased (no lemmatization)
Body
String
Lightly cleaned body
HTML removed, tokenized, lowercased
Tags
List[String]
Parsed tags
Extracted from chevron format
text
String
Combined text
Title + " " + Body
Preprocessing Pipeline (DL):
Raw Text → HTML Removal → Tokenization → Lowercase → Join Tokens
Rationale: Deep learning models like BERT handle morphological variations internally, so lemmatization is skipped to preserve semantic richness.