John Hooker, Eileen Chen, Davy Wang
This project holds the data scripts and model training and evaluation scripts for our CMSC723 Final Project. This project aims to use train a model to make predictions about author attribute information (semantic score, location, and gender) from tweet text. It compares a baseline model that trains to predict all 3 scores at once to a model that restricts model prediction tasks to a single random attribute instead of all 3 at once.
We found the data from dataworld which included a dataset with tweet text, semantic score, location, gender and other attributes (that we removed for the cleaned data). We were unable to push our data to this GitHub repository because git recognized AWS keys. We were unable to verify whether this was erroneous because of the data formatting or whether they were real keys, but for this security reason we do not further publish our data or weights of our model here (although if you want to reproduce our results you can run the /data_cleaning/join_clean_data.py script after downloading the 3 relevant tables from the link to reproduce the data).
We note that most of the /data_cleaning scripts used a lot of AI to be written. In the /data_cleaning folder, the scripts include (all can be run using python and required packages):
- join_clean_data.py: The main script used to merge and clean the data tables. It firsts joins the data to add the country and gender attributes to the main table, drops all columns that are not the tweet text, sentiment, country, and gender, before dropping rows with missing values, printing statistics, and saving the results to merged_table.json and merged_table.csv.
- data_analysis_text.py: Script used to print statistics about the text content inluding metrics about the length and most common uni, bi, and trigrams. Also creates a histogram that was used in our report.
- data_analysis_sentiment.py: Script used to print statistics about the sentiment values, show the top values, and generate a histogram of the values.
- data_analysis_location.py: Script used to print statistics about the number of and most common locations tweets originated from.
- data_analysis_gender.py: Script used to print statistics about the gender distribution of the authors.
- tweetid_to_date.py: Script used to convert the tweetIDs from the data to the datetime they describe.
main.py holds all the core logic with the code for training and evaluation. This file can be run using python3 main.py after installing the required packages.
-
Data Ingestion and Splitting The script loads the cleaned dataset, final_merged_table_cut.json and performs a deterministic split using a fixed random seed with 78% for the trianing pool and 22% for the testing set. It validates that all required columns (text, sentiment, country, and gender) are present and drops any rows with missing values.
-
Training Model A performs a joint extraction where all attributes are outputted in a single response string. This represents the standard multi-task learning approach where the model shares its context across all labels. Model B implements the random single-masking strategy. THe script randomly selects one target attribute and masks the loss for the other two. At inference time, the model is sampled three separate times, once for each attribute.
-
Metrics and Evaluation The script performs evaluation on a 1,000-sample test subset:
- Attribute-level accuracy: Calculates the exact match for sentiment, country, and gender independently
- Macro-average accuracy: Computes the mean performance across all three tasks
- Normalization: Ensures that case sensitivity and whitespace do not penalize correct predictions
- Outputs The model checkpoints and LoRA adapter weights are saved to the Tinker cloud service. The script generates error_analysis_results.csv, which maps a 200-sample test result that allows for qualitative analysis. All training runs, including cross-entropy loss and training negative log-likelihood, are logged to WandB for tracking and debugging.