diff --git a/project/docs/4_Advanced/advanced_encoder_classifier/Data/r_reddit_data.csv b/project/docs/4_Advanced/advanced_encoder_classifier/Data/r_reddit_data.csv new file mode 100644 index 00000000..8ad0657d --- /dev/null +++ b/project/docs/4_Advanced/advanced_encoder_classifier/Data/r_reddit_data.csv @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:3b5071d8525b5e227127c719f0d971387be79fe079f42173d6e9a4f7aaf3eb48 +size 8820071 diff --git a/project/docs/4_Advanced/advanced_encoder_classifier/Data/r_ubc_comments.csv b/project/docs/4_Advanced/advanced_encoder_classifier/Data/r_ubc_comments.csv new file mode 100644 index 00000000..2d243af3 --- /dev/null +++ b/project/docs/4_Advanced/advanced_encoder_classifier/Data/r_ubc_comments.csv @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:caa88bd35e45d73cd059bc94db0340c60c77f277f5456abc09982f324bda1859 +size 3231079 diff --git a/project/docs/4_Advanced/advanced_encoder_classifier/Data/r_ubc_posts.csv b/project/docs/4_Advanced/advanced_encoder_classifier/Data/r_ubc_posts.csv new file mode 100644 index 00000000..1b5c9625 --- /dev/null +++ b/project/docs/4_Advanced/advanced_encoder_classifier/Data/r_ubc_posts.csv @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:4cf41652b9371f332b35d4541ecfe270c35dd332de9ecabfc7a11f0f16106b7e +size 547342 diff --git a/project/docs/4_Advanced/advanced_encoder_classifier/Encoder_Classifier.ipynb b/project/docs/4_Advanced/advanced_encoder_classifier/Encoder_Classifier.ipynb new file mode 100644 index 00000000..1ec7c62f --- /dev/null +++ b/project/docs/4_Advanced/advanced_encoder_classifier/Encoder_Classifier.ipynb @@ -0,0 +1,3809 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "bWTvogDqdEiX" + }, + "source": [ + "# Encoder-only Architecture for Text Classification\n", + "\n", + "Recall that the chief reason for using machine learning models over traditional, lexicon-based sentiment analysis is that like lexicon-based approaches to text classifcation rely on fixed dictionaries, which results in them missing context, irony, or slang. ML models, especially deep learning models, learn complex patterns from data, making them particularly useful for more modern text, such as social media posts and online discussions.\n", + "\n", + "\n", + "Generally, for text classification tasks, it is faster to use an encoder-only model like BERT, as we don't need to worry about generating some kind of output sequence- all we are doing is classifying text with some pre-defined labels. In other words, our focus is on *understanding* text rather than *generating* text, so it is both faster, and less computationally expensive, to use encoder-only models like BERT." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 319 + }, + "id": "lOn2QpIxdEid", + "outputId": "77da072f-d675-4796-bd0a-31c7e6696d94" + }, + "outputs": [], + "source": [ + "from advanced_encoder_classifier_tests import Tests" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uLDjxqMndEih" + }, + "source": [ + "## 0. Data Importing and Cleaning\n", + "\n", + "As mentioned, ML-based text classification works the best for nuanced text with lots of slang. Hence, we'll test it on a dataset of recently-scraped *r/UBC* posts.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9oMu6RGIdEii" + }, + "source": [ + "```bash\n", + "!pip install transformers # you likely don't have this installed.\n", + "!pip install pytorch # you likely don't have this installed.\n", + "!pip install datasets # you likely don't have this installed.\n", + "!pip install evaluate # you likely don't have this installed.\n", + "!pip install sentencepiece # you likely don't have this installed.\n", + "!pip install accelerate # you likely don't have this installed.\n", + "!pip install numpy\n", + "!pip install pandas\n", + "!pip install seaborn\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "GuKsxdADdEij" + }, + "outputs": [], + "source": [ + "from transformers import AutoTokenizer, AutoModelForMaskedLM\n", + "import torch\n", + "import re\n", + "import pandas as pd\n", + "from transformers import pipeline\n", + "import matplotlib.pyplot as plt\n", + "from sklearn.model_selection import train_test_split\n", + "import numpy as np\n", + "from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments\n", + "from datasets import Dataset\n", + "from sklearn.preprocessing import MultiLabelBinarizer\n", + "import os\n", + "from sklearn.preprocessing import MultiLabelBinarizer\n", + "from numpy import array\n", + "from transformers import AutoModelForSequenceClassification, AutoTokenizer\n", + "from scipy.special import expit\n", + "import ast" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "GWryxyb2dEil", + "outputId": "be729f1a-f241-45ee-ce68-db5dc89faee0" + }, + "outputs": [], + "source": [ + "tokenizer = AutoTokenizer.from_pretrained(\"google-bert/bert-base-uncased\") # to break our reddit posts down into tokens\n", + "model = AutoModelForMaskedLM.from_pretrained(\"google-bert/bert-base-uncased\") # our model" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ENm8PeNSdEim" + }, + "source": [ + "The reddit dataset is dated around late February 2025, however, you are free to scrape newer posts locally by `cd`-ing into the currect directory and running `python project/docs/4_Advanced/advanced_encoder_classifier/r_ubc_scraper.py` in your terminal. This will generate a fresh new set of post/comment datasets for our analysis." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "KrTucVDQdEir" + }, + "outputs": [], + "source": [ + "# Load the dataset\n", + "reddit_comments = pd.read_csv('Data/r_ubc_comments.csv')\n", + "reddit_posts = pd.read_csv(\"Data/r_ubc_posts.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "sHoTFP51dEit" + }, + "outputs": [], + "source": [ + "# Merging datasets\n", + "reddit_data = reddit_posts.merge(reddit_comments.groupby(\"post_id\")[\"body\"].apply(lambda x: \" \".join(x)), left_on=\"id\", right_on ='post_id', how=\"left\")\n", + "reddit_data[\"full_text\"] = reddit_data['title'] + [' '] + reddit_data[\"selftext\"].fillna('') + [' '] + reddit_data[\"body\"].fillna('')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "n_IAXST1dEiv" + }, + "outputs": [], + "source": [ + "# Convert from UTC to detailed month, date, day of week and hour\n", + "reddit_data['created_utc'] = pd.to_datetime(\n", + " reddit_data['created_utc'],\n", + " unit='s', # Timestamp is in seconds\n", + " utc=True # Enforce UTC timezone\n", + ")\n", + "\n", + "reddit_data['month'] = reddit_data['created_utc'].dt.month_name()\n", + "reddit_data['date'] = reddit_data['created_utc'].dt.day\n", + "reddit_data['day_of_week'] = reddit_data['created_utc'].dt.day_name()\n", + "reddit_data['hour'] = reddit_data['created_utc'].dt.hour\n", + "\n", + "def clean_text(text):\n", + " if isinstance(text, str):\n", + " text = text.lower()\n", + " text = re.sub(r\"http\\S+\", \"\", text)\n", + " text = re.sub(r\"[^a-zA-Z0-9\\s]\", \"\", text)\n", + " return text.strip()\n", + " else:\n", + " return \"\"\n", + "\n", + "reddit_data[\"clean_text\"] = reddit_data[\"full_text\"].apply(clean_text)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3KgTS5XgdEiw" + }, + "source": [ + "## 1. Labelling using a pre-trained roBERTa model\n", + "\n", + "Here we used a pre-trained model from hugging face to generate sentiment labels for each posts. Specifically, we'll be using `cardiffnlp/twitter-roberta-base-sentiment-latest`, a model pre-trained on English twitter texts. It classifies emotions into three categories: 0 -> Negative; 1 -> Neutral; 2 -> Positive. \n", + "\n", + "If you want to work on datasets containing multiple languages, you can use `google-bert/bert-base-multilingual-uncased`, it is a model pretrained on on the top 102 languages with the largest Wikipedia using a masked language modeling (MLM) objective. You can find more information about it on ." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Twc2jPuadEi0", + "outputId": "fdb112c4-2c10-4c02-f8c0-1972d2819256" + }, + "outputs": [], + "source": [ + "emotion_classifier = pipeline(\"sentiment-analysis\", model=\"cardiffnlp/twitter-roberta-base-sentiment-latest\", tokenizer=\"cardiffnlp/twitter-roberta-base-sentiment-latest\")\n", + "\n", + "reddit_data[\"sentiment\"] = reddit_data[\"clean_text\"].apply(lambda x: emotion_classifier(x[:512])[0][\"label\"] if pd.notna(x) else \"neutral\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9TOMGlYqdEi1" + }, + "source": [ + "Here we are using another pre-trained model to generate topic labels for each posts. Specfically, we are using `cardiffnlp/tweet-topic-base-multilingual`, a model pre-trained on ~198M multilingual tweets and finedtuned for English, Spanish, Japanese, and Greek. You can find information about its labels on .\n", + "\n", + "Unfortunately, there hasn't yet to be a \"good\" multilingual model for topic classifications for some languages up untill Feburuary 2025, but I believe you can always find something that works for your demand on ." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 476, + "referenced_widgets": [ + "60d1fffa2d8c4df0bccaa48d3d5f6ac2", + "3ae8becb4e774557aa6aeb9ead040202", + "3b4ab23581a044d9b652f921f70ad3ea", + "00e6fe2c23204fc6adae3fe8ee241c20", + "8cd62148f3a443ce86f46534c9e4b023", + "0699b4d1645246e19d27ecb3597bf5e6", + "2cbfeb95e309491ba7acb88d21f671e9", + "8ac0e1298a17492fa69a667441dbbb51", + "9d8e68792cb04edca4bd60961cefa16f", + "dbfd3f0cf30546efad85646f6c594291", + "7d81208cf6f44e9e92c7c027ad3d9820", + "3d330197949b4df28b7fe9e0e0500308", + "5f05ea88a9d847c8b4135c2b5fef1122", + "d142c16f55ba40b4971cfa4bd39302e3", + "9e76ac51592847ad8d8d8ae532a33739", + "b3fdd531164641e1921c30fd5774467c", + "c97b5ea8b5b94f30976a2cf75751d149", + "6b40d0fc9bb84f71816d3c7aec0ba01e", + "efb43112d53e4e36bc44d33bcd8d28c4", + "45050ac6f75e4933a1a288f9c18741ff", + "4d48534b7330466db8ee94faba52d6b8", + "318bc0fc6c8047ffb3c248e1a53891a3", + "ae39a79497cc443aa65441a47d43a6bc", + "4475ab1ef21f40c781029cb07331aef2", + "1848061e49b540908833ca2d4241c5d3", + "69eb254ef8d6474f8c46d4a727375119", + "afd72a3d2a064d88bc2094cf247af7e3", + "351429314b604e048dd8f1fee3a786b6", + "22c47742278543c19739f00b88d9618f", + "64efb2ce9d5245d29e3e1e0016652a5f", + "9a79b6d7d6654062b0c8de281d9811e2", + "a9bae5c716bd44ea968bec85f28c9a83", + "8b38c7006abf49cb826cb57e22a893ac", + "0f80eab8aa3c482ab1102c6d287dcf17", + "e441df716d72449eaa097167885b5b53", + "a23e47196e5c47459a08bca806417067", + "1126401817004cb0bb6c71209a9493f2", + "8802353acc1a4fb187f995aab868c520", + "1292527b6bbb47e8bbbf3696214fb6d8", + "1d60ea3aa6834deaa52b7b1cd4cd606b", + "13ce2c9f579f4b5aa6d4ca0c01be4423", + "3e7849275f78471d908ac2a107654b15", + "7218c9d168f346f7824e177f8a0ed9a7", + "d74a6517e8334f47a0f7c3039a2cf57c", + "07da26df269c466ca4b6126cf55d275f", + "a06c1b2fcb4b46f7adf6e3fca132cfe9", + "8d9ff3d8226e46f98c4b98ef22632ed4", + "620abe669ad44aff95df37a314b1c42a", + "3fc62e98fc9949ac91a65874c71ca60f", + "6b2259ecb37f4d04a5b6ed9d3e211e2d", + "5f6888c8b7b0463e8377b23afce3d382", + "10d0a68fc55b4e49bcb6d9846688d41d", + "1ca4b36939a04d9cb7c0ed645433cfb2", + "59eb47688845405585a8f9ab532081b1", + "c7a03e6fdda547ed8f6a861f6e86682f", + "b11f88b0f5c641528ca26b7f476f7077", + "cb2afa95f28f47f182c9744ef0d09454", + "cea7534319b7415f84d0b2b6088a2257", + "777eaa8f8e8647e7b051b798b30feb63", + "80afd851d6784dedb9f63926bdbba5ce", + "9ac2a923cd064d3aa66fe39aa70847d5", + "c013b77cfb8449f2b5b37dc2b61a101f", + "4c5837e6a5a847a2a1a3528c8d350c95", + "d9315c53637c4286ab98c6f95e6a6974", + "83aeb27ef845444d9f0277d4a03875eb", + "5c708772344045a1be691188eb57fb66" + ] + }, + "id": "iV5stw0BdEi2", + "outputId": "f8b4ac6b-12db-4284-92db-e5eab1375c6a" + }, + "outputs": [], + "source": [ + "\n", + "MODEL = \"cardiffnlp/tweet-topic-base-multilingual\"\n", + "tokenizer = AutoTokenizer.from_pretrained(MODEL)\n", + "model = AutoModelForSequenceClassification.from_pretrained(MODEL)\n", + "\n", + "class_mapping = model.config.id2label\n", + "\n", + "def classify_text(text):\n", + " tokens = tokenizer(text, return_tensors=\"pt\", truncation=True, padding=True, max_length=512)\n", + " output = model(**tokens)\n", + "\n", + " scores = output[0][0].detach().numpy()\n", + " scores = expit(scores)\n", + "\n", + " predictions = (scores >= 0.5) * 1\n", + "\n", + " # Get predicted labels\n", + " predicted_labels = [class_mapping[i] for i in range(len(predictions)) if predictions[i]]\n", + "\n", + " return \", \".join(predicted_labels) if predicted_labels else \"other\"\n", + "\n", + "reddit_data[\"topic\"] = reddit_data[\"clean_text\"].apply(classify_text)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_n6eAeofdEi3" + }, + "outputs": [], + "source": [ + "def clean_topics(topic_list):\n", + " if isinstance(topic_list, list):\n", + " return [topic.strip() for topic in topic_list]\n", + " return topic_list\n", + "\n", + "reddit_data[\"topics\"] = reddit_data[\"topic\"].apply(lambda x: x.split(\",\"))\n", + "\n", + "reddit_data[\"topics\"] = reddit_data[\"topics\"].apply(clean_topics)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ipS2vXLcdEi4" + }, + "outputs": [], + "source": [ + "reddit_data[\"multi_labels\"] = reddit_data.apply(lambda row: row[\"topics\"] + [row[\"sentiment\"]], axis=1)\n", + "\n", + "reddit_data[\"multi_labels\"] = reddit_data[\"multi_labels\"].apply(lambda x: eval(x) if isinstance(x, str) else x)\n", + "\n", + "reddit_data.to_csv('Data/r_reddit_data.csv', index=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mgMh0ImGdEi5" + }, + "source": [ + "Since we saved the dataframe as a csv, we are now able to begin from this step by reading the csv file directly." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "30C03J50dEi6", + "outputId": "3b2ce9cf-1907-4a53-de4b-f59f0c5e4fcb" + }, + "outputs": [], + "source": [ + "# Split training and testing data\n", + "# You can start from this step if you don't want to re-generate emotion and topic labels\n", + "reddit_data = pd.read_csv(\"Data/r_reddit_data.csv\")\n", + "reddit_data[\"multi_labels\"] = reddit_data[\"multi_labels\"].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x) \n", + "\n", + "train_indices, test_indices = train_test_split(reddit_data.index, test_size=0.2, random_state=114514) # We set seed to ensure reproducible results\n", + "\n", + "train_data = reddit_data.loc[train_indices].reset_index(drop=True)\n", + "test_data = reddit_data.loc[test_indices].reset_index(drop=True)\n", + "\n", + "print(train_data)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YmNywh_GdEi7" + }, + "source": [ + "We can visualize the share of different labels with a pie chart." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 708 + }, + "id": "vTIhSNw8dEi7", + "outputId": "c4f15ec8-8f36-48d5-864a-4efc9542a36d" + }, + "outputs": [], + "source": [ + "# Print the distribution table of labels\n", + "\n", + "train_data[\"topics\"] = train_data[\"topics\"].apply(lambda x: eval(x) if isinstance(x, str) else x)\n", + "test_data[\"topics\"] = test_data[\"topics\"].apply(lambda x: eval(x) if isinstance(x, str) else x)\n", + "\n", + "train_data[\"topics\"] = train_data[\"topics\"].apply(clean_topics)\n", + "test_data[\"topics\"] = test_data[\"topics\"].apply(clean_topics)\n", + "\n", + "train_exploded = train_data.explode(\"topics\")\n", + "test_exploded = test_data.explode(\"topics\")\n", + "\n", + "train_counts = train_exploded[\"topics\"].value_counts()\n", + "test_counts = test_exploded[\"topics\"].value_counts()\n", + "\n", + "distribution_df = pd.DataFrame({\n", + " \"train_count\": train_counts,\n", + " \"test_count\": test_counts\n", + "}).fillna(0).astype(int)\n", + "\n", + "distribution_df[\"train %\"] = (distribution_df[\"train_count\"]/distribution_df[\"train_count\"].sum() * 100).round(2)\n", + "distribution_df[\"test %\"] = (distribution_df[\"test_count\"]/distribution_df[\"test_count\"].sum() * 100).round(2)\n", + "\n", + "distribution_df.sort_values(by=\"train_count\", ascending=False, inplace=True)\n", + "\n", + "display(distribution_df)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "VgnBUxojdEi8", + "outputId": "bca20f7a-20bd-4a5a-ba53-553b05e31776" + }, + "outputs": [], + "source": [ + "# Plot the Pie Chart\n", + "train_emotion_counts = pd.Series(train_data[\"sentiment\"]).value_counts()\n", + "test_emotion_counts = pd.Series(test_data[\"sentiment\"]).value_counts()\n", + "\n", + "plt.figure(figsize=(12,8))\n", + "plt.subplot(1,2,1)\n", + "train_emotion_counts.plot.pie(autopct='%1.1f%%', startangle=140, title=\"Train Set Labels (Emotion)\")\n", + "plt.ylabel('')\n", + "\n", + "plt.subplot(1,2,2)\n", + "test_emotion_counts.plot.pie(autopct='%1.1f%%', startangle=140, title=\"Test Set Labels (Emotion)\")\n", + "plt.ylabel('')\n", + "\n", + "plt.show()\n", + "\n", + "plt.figure(figsize=(12, 9))\n", + "distribution_df.plot.bar(y = \"train_count\", title=\"Train Set Labels (Topic)\")\n", + "plt.ylabel('')\n", + "\n", + "plt.show()\n", + "\n", + "plt.figure(figsize=(12, 9))\n", + "distribution_df.plot.bar(y = \"test_count\", title=\"Test Set Labels (Topic)\")\n", + "plt.ylabel('')\n", + "\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cjZGwnF3dEjB" + }, + "source": [ + "Then we can preprocess our data by binarizing the unique labels and tokenize our columns for classification." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_Y-Uy2VZdEjB" + }, + "source": [ + "### Question 1\n", + "\n", + "Out of the options below, which one is not a strength of using encoder only model for classification?\n", + "\n", + "- A) It is relatively cheaper to train.\n", + "- B) It can generate labels itself even if no pre-defined label is attached to training texts.\n", + "- C) It can overcome missing context, irony or slang.\n", + "- D) It works well with multi-label tasks.\n", + "- E) None of the above.\n", + "\n", + "*Enter your answer below as a a string with one of A, B, C, D, E ie. \"A\"*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "YIlSOP35dEjC" + }, + "outputs": [], + "source": [ + "answer1 = # Your answer here\n", + "\n", + "Tests.test1(answer1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "w_8tuz61dEjC" + }, + "source": [ + "## 2. Training Classifiers" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BXoiWEF-dEjD" + }, + "source": [ + "### 1. OneVsRestClassifier\n", + "\n", + "This is our first classifying strategy, which consists in fitting one binary classifier per class.\n", + "\n", + "For example, if you want to classify 3 types of fruits, apple, banana and orange with this strategy, you will need to train 3 binary classifiers. The first of these classifiers determines if it is an apple, the second is responsible for determining if it is a banana, and the third is responsible for determining if it is an orange.\n", + "\n", + "The 3 classifiers will then vote for the given new observation, the voting result will be shown as probability. Let's say we have a new observation that we are unsure what type of fruit it is, so we feed this observation to our classifier. The \"Apple\" classifier claims it's 80% likely an apple, the \"Banana\" classifier claims it's 10% likely a banana, and the \"Orange\" classifier claims it's 50% likely an orange. Then we classify based on the highest probability that our new object is an apple.\n", + "\n", + "While it is simple and effective, this strategy has its drawback -- What if there are large amount of labels for us to classify? What if the labels have specific hierarchies? In the first situation, it may take a lot of time to train multiple classifiers for this task; and in the second situation, we could make a lot of mistakes using binary classification. Then we'd better use alternative strategies.\n", + "\n", + "Anyhow, this is a good way to begin encoder classification.\n", + "\n", + "### 2. MultiOutputClassifier\n", + "\n", + "This strategy consists of fitting one classifier per target. This is a simple strategy for extending classifiers that do not natively support multi-target classification.\n", + "\n", + "It is somewhat similar to the *OneVsRest* strategy we just described, but uses a different approach. Let's go back to the fruit classification example, where in the previous setup we trained 3 different classifiers for 3 different types of fruits. Now, let's say that we basically determine which type of fruit it is by looking at the multiple features shared by these fruits (e.g., color, smoothness, and flavor). Then, the *MultiOutput* strategy would train a classifier to specialize in one variable.\n", + "\n", + "You may have noticed that this overcomes the shortcomings of the *OneVsRest* strategy to a certain extent, especially when there are more “labels” than training variables. However, when we have more variables than labels, this strategy is less efficient than the *OneVsRest* strategy. Therefore, we need to be cautious when choosing our strategy.\n", + "\n", + "### 3. ClassifierChain\n", + "\n", + "We are finally coming to our last strategy -- It is to train multi-label model that arranges binary classifiers into a chain. \n", + "\n", + "I'll stick with the fruit categorization example (because it's simple and straightforward), and assume that we've used a *MultiOutput* strategy, which provides us with 3 classifiers that deal with 3 different features of the fruit. How can we make them work more efficiently? We want them to work in a band, rather than individually! A *ClassifierChain* can also treat the evidence found by the previous classifier as a feature for the next one, thus improving the quality of the classification by sharing common knowledge and \"pass on\" the evidence chain.\n", + "\n", + "While this strategy sounds smarter than the first two, it has its drawbacks. First, it is more prone to false correlation and over-fitting. Second, it requires more resources to train. Third, when there are fewer covariates, the improvement that this strategy may bring is not significant. Therefore, special care should be taken when deciding to adopt this strategy!\n", + "\n", + "Now that we've finished this short tutorial on 3 different classification strategies, we will show how to train a basic encoder classifier for multi-label task with an example." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RAXGujRXdEjE" + }, + "source": [ + "### Question 2\n", + "\n", + "Suppose that a hospital is developing a system to automatically classify patient symptoms into multiple diagnostic categories. Each patient might have 0-3 simultaneous conditions from 15 possible diseases. Doctors observe that certain conditions often co-occur (e.g., diabetes and hypertension). Which strategy would BEST leverage these inter-label relationships while maintaining reasonable computational efficiency?\n", + "\n", + "- A) *OneVsRest*\n", + "- B) *MultiOutput*\n", + "- C) *ClassifierChain*\n", + "- D) None of the above\n", + "\n", + "*Enter your answer below as a a string with one of A, B, C, D, ie. \"A\"*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MwZaN59HdEjF" + }, + "outputs": [], + "source": [ + "answer2 = # Your answer here\n", + "\n", + "Tests.test2(answer2)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "m_bi6T_xdEjG" + }, + "source": [ + "## 3. Training a Multi-Label Classifier on Reddit Data\n", + "\n", + "Now that we've gone over the various classifying strategies, we'll try and train a new classifier on the labels produced by `cardiffnlp/tweet-topic-base-multilingual`. Specifically, we'll train a bert model on the synthetic labels on the reddit dataset we generated earlier. Training on synthetic data, data generated by other ML models, is a valid technique. In fact, [researchers estimate that new human-generated data for training ML models will run out within the next 2 to 8 years, forcing the use of synthetic data for training](https://arxiv.org/pdf/2211.04325). There are, of course, obvious downsides; in a classification context, the biases present in the classifications of one model will be transferred over to any models trained on the upstream model's classifications. However, for learning purposes, this method works well and is cost-effective, as the code below can easily be modified and reused for human-labeled datasets.\n", + "\n", + "Our workflow will consist of the following:\n", + "\n", + "1) Convert labels into matrices of binary values.\n", + "2) Tokenize our `clean_data` column into tokens for classification.\n", + "3) Define/discuss accuracy, recall, and reate a function for evaluating our model's predictions.\n", + "4) Train and test our model!\n", + "\n", + "We'll accomplish this primarily with the use of two libraries: `skit-learn` and `transformers`. `skit-learn` provides us a convenient way of transforming our labels into binary vectors, while the `transformers` library abstracts most of the training and evaluation into simpler code.\n", + "\n", + "### 3.1 Data-preprocessing\n", + "\n", + "We'll start by doing some simple data preprocessing and setting up our GPU.\n", + "\n", + "\n", + "You'll probably want a CUDA-compatible GPU for this. Consider google collab for smaller datasets.\n", + "" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "z6-UopDNdEjH", + "outputId": "891a9091-6e53-4ce6-f577-37496f17bd35" + }, + "outputs": [], + "source": [ + "os.environ[\"WANDB_DISABLED\"] = \"true\" # disables annoying API key requirement\n", + "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # Best to have a GPU.\n", + "model.to(device)\n", + "\n", + "train_dataset = train_data[['clean_text', 'multi_labels']]\n", + "test_dataset = test_data[['clean_text', 'multi_labels']]# clean_text is our textual input, 'multi_labels' is our list of labels.\n", + "print(type(train_data['multi_labels'].iloc[0]))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KlQTOiJudEjI" + }, + "source": [ + "Notice that the multi-label column is comprised of lists of labels. Each list corresponds to the series of labels assigned to each reddit post. While this is intuitive for us to understand, it is not so intuitive for our model. So, we'll need to convert it into a format easily interpretable by a computer: a binary vector. Each value in the vector corresponds to the presence of one of the 23 labels in our dataset. Repeating this for all n rows of the dataset, we end up with a matrix of size $700 \\times 23$. We'll do this conversion using `sklean`'s [MultiLabelBinarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html) class." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "1L9xr1BbkKi-" + }, + "outputs": [], + "source": [ + "# for example, we can convert this list of sets containing labels into a binary matrix.\n", + "# each set contains labels corresponding to a column in a hypothetical dataset.\n", + "\n", + "mlb = MultiLabelBinarizer()\n", + "mlb.fit_transform([{'sci-fi', 'thriller'}, {'comedy'}])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "f_m6oQkHkzUd" + }, + "source": [ + "The first row of the matrix corresponds to the first set of labels, and is interpreted as \"`comedy` is absent, `sci-fi` is present, `thriller` is present. Likewise, the second row is interpreted as \"`comedy` is present, `sci-fi` is absent, `thriller` is absent.\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "ikDlv-Gtkbra", + "outputId": "2bfefefd-e452-4cf8-d0bc-9e121f5c4fef" + }, + "outputs": [], + "source": [ + "list(mlb.classes_)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "hFtbpcB0dEjK", + "outputId": "c112f55d-5ea7-48a3-df96-90023460de6b" + }, + "outputs": [], + "source": [ + "mlb = MultiLabelBinarizer()\n", + "\n", + "train_labels = mlb.fit_transform(train_data['multi_labels']) #converting training data labels to binary matrix\n", + "test_labels = mlb.transform(test_data['multi_labels']) # converting testing data labels to binary matrix\n", + "num_labels = train_labels.shape[1] #.shape[1] gives us the number of entries in each row of the matrix\n", + "print(num_labels)\n", + "print(list(mlb.classes_))\n", + "\n", + "\n", + "train_data = train_data.copy() #creates an independent copy to avoid changing the original dataframe\n", + "test_data = test_data.copy()\n", + "train_data['labels_encoded'] = list(train_labels) # assigns labels to new column\n", + "test_data['labels_encoded'] = list(test_labels)\n", + "# Converting the pandas DataFrames into HuggingFace Datasets, in order to use the trainer.\n", + "train_dataset = Dataset.from_pandas(train_data[['clean_text', 'labels_encoded']])\n", + "test_dataset = Dataset.from_pandas(test_data[['clean_text', 'labels_encoded']])\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 196 + }, + "id": "tsyfCYfM4NCG", + "outputId": "40731985-bf87-4f76-a2fc-f3caed9e8120" + }, + "outputs": [], + "source": [ + "print(array(train_dataset['labels_encoded']))\n", + "print(array(test_dataset['labels_encoded']))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P439OQYv4hXb" + }, + "source": [ + "Next, we need to tokenize our text to pass through the model. We define a custom tokenizing function below:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "30565I0rdEjO" + }, + "outputs": [], + "source": [ + "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n", + "\n", + "def tokenize_function(batch):\n", + " \"\"\"\n", + " Given a batch dictionary, tokenizes text using the global tokenizer with:\n", + " - Truncation: cutting off texts longer than 128 tokens.\n", + " - Padding: ensuring every text is exactly 128 tokens long.\n", + " Next it converts each label in \"labels_encoded\" to a float and adds the processed labels to the tokenized\n", + " output under the key \"labels\".\n", + " \"\"\"\n", + " encoding = tokenizer(batch[\"clean_text\"], truncation=True, padding='max_length',max_length=128) #converts raw text into tokens\n", + " labels = [] #initializing the main empty list of labels\n", + "\n", + " for label_list in batch[\"labels_encoded\"]: # outer loop: iterates over each list of labels in the batch\n", + " converted_labels = [] # initializing a inner list of labels for each batch\n", + " for x in label_list: # Iterates over each individual label in the current label list\n", + " converted_labels.append(float(x)) # converts each label to a float and appends it to `converted_labels`\n", + " labels.append(converted_labels) # appends resulting list to main list of labels\n", + "\n", + " encoding[\"labels\"] = labels\n", + " return encoding" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WA_bmlDN_7Gd" + }, + "source": [ + "Now that our tokenizer function has been written, we can map the function onto the training and testing datasets." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 81, + "referenced_widgets": [ + "2d4f64953c8a48948da7357104fccc84", + "141e8ddafd0f4d45b3ec18a79f856c09", + "4110084eb8eb4597905a8d8ae6646e98", + "569edb26e9b94220a26ffa8b145c7f2c", + "8f517e1061744e4e8ffdcc0c3f600b92", + "614b423d37144268919d8acd3a3837a4", + "22b4074c6ebf4536b8a0695f7263e1af", + "a4d54f0de5e84d1b959b9deeb6a1c76d", + "cd40bb71897847ca8319eb8266b76105", + "cd1770f47899474d9cc2d13d4b077008", + "83c6475f03be4806b0b99f36b9c38540", + "0b8bd669ceb3413bb1fde381ea0e1825", + "825a496aaced4a3f9a75fa4ec573dd34", + "ace161e28a294d5c86307ab4a9d31e34", + "4a161d6667d34518b51910160a542ace", + "fe7494966ec7437baf3b2eaf82a325e4", + "0a24b679037242309af41bf77e1dcd92", + "86749ecfd3cf4133ab075582168c2366", + "866647207caf4eeabfad30088758c5fd", + "ae26ce55afc44c9584ff567f350c7ba7", + "da6b01047dc6407aa4222c331012aa8d", + "bb06655a789e424e8cc8c7cd3296f423" + ] + }, + "id": "G8iTy8V5dEjQ", + "outputId": "1e3b09d0-fdcb-45d4-9fb8-1382c28dac42" + }, + "outputs": [], + "source": [ + "train_dataset = train_dataset.map(tokenize_function, batched=True)\n", + "test_dataset = test_dataset.map(tokenize_function, batched=True)\n", + "\n", + "train_dataset = train_dataset.remove_columns([\"clean_text\", \"labels_encoded\"]) #removing the old columns\n", + "test_dataset = test_dataset.remove_columns([\"clean_text\", \"labels_encoded\"])\n", + "\n", + "train_dataset.set_format(\"torch\") # changes the output format of the dataset so that each sample is returned as PyTorch tensors\n", + "test_dataset.set_format(\"torch\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9HxiG1eFBPdL" + }, + "source": [ + "Note that we haven't actually done any training yet; All we've done is preprocessed our data to a format that is legible by a model. We can now go ahead and create an instance of a BERT model that we are training from HuggingFace, by specifying our model and the problem type (multi-label classification)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "7V71-KlNdEjR", + "outputId": "205bac6d-ea80-4156-e5ed-8d8bdd4fe84b" + }, + "outputs": [], + "source": [ + "model = BertForSequenceClassification.from_pretrained(\n", + " 'bert-base-uncased',\n", + " num_labels=num_labels,\n", + " problem_type=\"multi_label_classification\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EN1_0g1LH_nk" + }, + "source": [ + "It's one thing to train a model, but how do we go about actually evaluating it's performance? We need a set of concrete metrics. Before diving into writing a evaluation function, let's define some new terminology:\n", + "\n", + "**Accuracy** is defined as the total number of correct predictions over the total number of predictions.\n", + "\n", + "$$\\text{Accuracy} = \\frac{\\text{Total \\# of Correct Predictions}}{\\text{Total \\# of Predictions}}$$\n", + "\n", + "This metric is simple and intuitive, however, as we will see, in scenarios where one class dominates, a high accuracy can be misleading because the model might simply be predicting the majority class most of the time.\n", + "\n", + "**Recall** is proportion of actual positives (binary label equals 1) that were correctly identified.\n", + "\n", + "$$\\text{Recall} = \\frac{\\text{True Positive}}{\\text{True Positive + False Positive}}$$\n", + "\n", + "Recall is used primarily in situations where making a false positive is extremely costly. For instance, if we were training a model to pick up on AI-generated text submitted by students on assignments, we'd heavily prioritize maximizing recall, as false positives would have lasting impacts on innocent students.\n", + "\n", + "We'll be evaluating both recall and accuracy." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "DqKPxf6-dEjS" + }, + "outputs": [], + "source": [ + "def compute_metrics(p):\n", + " \"\"\"\n", + " Given a tuple containing an array of logits and an array of labels, compute accuracy and recall by:\n", + " 1) Applying a sigmoid activation function to the logits, to convert them into probabilities\n", + " 2) Converting the resulting probabilities into binary predictions\n", + " 3) Cacliuating resulting accuracy and recall\n", + " \"\"\"\n", + " logits, labels = p\n", + " probs = 1 / (1 + np.exp(-logits)) # sigmoid activation function\n", + " preds = (probs >= 0.5).astype(int) # # Threshold at 0.5 to get binary predictions\n", + " binary_accuracy = np.mean(preds == labels) # binary accuracy\n", + " true_positives = np.logical_and(preds == 1, labels == 1).sum()\n", + " total_actual = (labels == 1).sum()\n", + " micro_recall = true_positives / total_actual if total_actual > 0 else 0\n", + "\n", + " return {\"binary_accuracy\": binary_accuracy, \"micro_recall\": micro_recall}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ln7fnGMgMlMR" + }, + "source": [ + "Furthermore, we need to specify the training arguments and the trainer. We do this using HuggingFace's `trainer` and `TrainingArguments` classes, which are handy APIs for easily training a model while abstracting away all of the complicated details.\n", + "\n", + "The details here aren't important; you can read more about what each argument does [here](https://huggingface.co/docs/transformers/en/main_classes/trainer) and [here](https://huggingface.co/docs/transformers/v4.49.0/en/main_classes/trainer#transformers.TrainingArguments)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "SdXWETCFdEjS", + "outputId": "e4eec329-97e8-4033-b27b-1cd4abea5ee8" + }, + "outputs": [], + "source": [ + "training_args = TrainingArguments(\n", + " output_dir=\"./results\",\n", + " evaluation_strategy=\"epoch\",\n", + " num_train_epochs=3,\n", + " per_device_train_batch_size=3,\n", + " per_device_eval_batch_size=3,\n", + " learning_rate=2e-5,\n", + " weight_decay=0.01,\n", + " logging_steps=10,\n", + " load_best_model_at_end=True,\n", + " metric_for_best_model=\"binary_accuracy\",\n", + " save_total_limit=2,\n", + " save_strategy=\"epoch\"\n", + ")\n", + "\n", + "\n", + "trainer = Trainer(\n", + " model=model,\n", + " args=training_args,\n", + " train_dataset=train_dataset,\n", + " eval_dataset=test_dataset,\n", + " compute_metrics=compute_metrics,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_rX6-RRoNYuT" + }, + "source": [ + "Lastly, the easiest part is actually training the model:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 227 + }, + "id": "gTXARKxgdEjX", + "outputId": "630272be-dda4-4f19-e47b-50710b8f2b0d" + }, + "outputs": [], + "source": [ + "# This call will run the full training loop\n", + "trainer.train()\n", + "\n", + "results = trainer.evaluate()\n", + "print(results)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wzZzPP3cOeQ-" + }, + "source": [ + "Wow! What a high accuracy, after being trained on just 700 examples! The flaw here is obvious: we are training a BERT classifer on synthetic data generated by a BERT-based model. Furthermore, as we mentioned earlier, our labels are not evenly distributed across the training and evaluation data. Many labels appear in almost every row, while some appear < 10 times.\n", + "\n", + "\n", + "## Test your understanding\n", + "\n", + "1) (easy) Suppose you are given a list of sets like: `[{1,2},{1,2,3},{2,3},{3,4}]`. What will be the resulting binary matrix after running `mlb.fit_transform()`?\n", + "\n", + "
\n", + " Hide/Show Solution\n", + "

\n", + "\\begin{bmatrix}\n", + "1 & 1 & 0 & 0 \\\\\n", + "1 & 1 & 1 & 0 \\\\\n", + "0 & 1 & 1 & 0 \\\\\n", + "0 & 0 & 1 & 1\n", + "\\end{bmatrix}\n", + "

\n", + "
\n", + "\n", + "2) (easy) Another method of calculating accuracy is **total accuracy** which is the number rows where each label is correctly classified over the number of total rows. For instance, if a model outputs `[0,0,1]` as it's classification for a given text and the correct answer is `[1,0,1]`, the binary accuracy would be 2/3, while the total accuracy would be 0.\n", + "\n", + "What would you expect to happen to our accuracy score, if we used total accuracy as our accuracy metric instead of binary accuracy?\n", + "\n", + "
\n", + " Hide/Show Solution\n", + "

Our accuracy would be much lower, and we are now checking if each row of the predicted matrix of binary labels is identical to each row of the actual matrix, instead of checking if a given value in the predicted matrix matches the respective value in the actual matrix.

\n", + "
\n", + "\n", + "3) (Hard) Head over to kaggle and check out the [jigsaw classification challenge](https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/data). Note that you'll need an [api key](https://www.kaggle.com/docs/api) to download the dataset below. Using the classification code above, can you train a BERT model on the dataset?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "zOK1Axh8X2N7", + "outputId": "9e590a7c-cda7-4435-d283-0e43cf6339b0" + }, + "outputs": [], + "source": [ + "!kaggle competitions download -c jigsaw-toxic-comment-classification-challenge\n", + "\n", + "# you might need to do !pip install kaggle first!\n" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "T4", + "machine_shape": "hm", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.5" + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "00e6fe2c23204fc6adae3fe8ee241c20": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_dbfd3f0cf30546efad85646f6c594291", + "placeholder": "​", + "style": "IPY_MODEL_7d81208cf6f44e9e92c7c027ad3d9820", + "value": " 1.18k/1.18k [00:00<00:00, 149kB/s]" + } + }, + "0699b4d1645246e19d27ecb3597bf5e6": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "07da26df269c466ca4b6126cf55d275f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_a06c1b2fcb4b46f7adf6e3fca132cfe9", + "IPY_MODEL_8d9ff3d8226e46f98c4b98ef22632ed4", + "IPY_MODEL_620abe669ad44aff95df37a314b1c42a" + ], + "layout": "IPY_MODEL_3fc62e98fc9949ac91a65874c71ca60f" + } + }, + "0a24b679037242309af41bf77e1dcd92": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "0b8bd669ceb3413bb1fde381ea0e1825": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_825a496aaced4a3f9a75fa4ec573dd34", + "IPY_MODEL_ace161e28a294d5c86307ab4a9d31e34", + "IPY_MODEL_4a161d6667d34518b51910160a542ace" + ], + "layout": "IPY_MODEL_fe7494966ec7437baf3b2eaf82a325e4" + } + }, + "0f80eab8aa3c482ab1102c6d287dcf17": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_e441df716d72449eaa097167885b5b53", + "IPY_MODEL_a23e47196e5c47459a08bca806417067", + "IPY_MODEL_1126401817004cb0bb6c71209a9493f2" + ], + "layout": "IPY_MODEL_8802353acc1a4fb187f995aab868c520" + } + }, + "10d0a68fc55b4e49bcb6d9846688d41d": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "1126401817004cb0bb6c71209a9493f2": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_7218c9d168f346f7824e177f8a0ed9a7", + "placeholder": "​", + "style": "IPY_MODEL_d74a6517e8334f47a0f7c3039a2cf57c", + "value": " 964/964 [00:00<00:00, 89.8kB/s]" + } + }, + "1292527b6bbb47e8bbbf3696214fb6d8": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "13ce2c9f579f4b5aa6d4ca0c01be4423": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "141e8ddafd0f4d45b3ec18a79f856c09": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_614b423d37144268919d8acd3a3837a4", + "placeholder": "​", + "style": "IPY_MODEL_22b4074c6ebf4536b8a0695f7263e1af", + "value": "Map: 100%" + } + }, + "1848061e49b540908833ca2d4241c5d3": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_64efb2ce9d5245d29e3e1e0016652a5f", + "max": 17082756, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_9a79b6d7d6654062b0c8de281d9811e2", + "value": 17082756 + } + }, + "1ca4b36939a04d9cb7c0ed645433cfb2": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "1d60ea3aa6834deaa52b7b1cd4cd606b": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "22b4074c6ebf4536b8a0695f7263e1af": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "22c47742278543c19739f00b88d9618f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "2cbfeb95e309491ba7acb88d21f671e9": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "2d4f64953c8a48948da7357104fccc84": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_141e8ddafd0f4d45b3ec18a79f856c09", + "IPY_MODEL_4110084eb8eb4597905a8d8ae6646e98", + "IPY_MODEL_569edb26e9b94220a26ffa8b145c7f2c" + ], + "layout": "IPY_MODEL_8f517e1061744e4e8ffdcc0c3f600b92" + } + }, + "318bc0fc6c8047ffb3c248e1a53891a3": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "351429314b604e048dd8f1fee3a786b6": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "3ae8becb4e774557aa6aeb9ead040202": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_0699b4d1645246e19d27ecb3597bf5e6", + "placeholder": "​", + "style": "IPY_MODEL_2cbfeb95e309491ba7acb88d21f671e9", + "value": "tokenizer_config.json: 100%" + } + }, + "3b4ab23581a044d9b652f921f70ad3ea": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_8ac0e1298a17492fa69a667441dbbb51", + "max": 1175, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_9d8e68792cb04edca4bd60961cefa16f", + "value": 1175 + } + }, + "3d330197949b4df28b7fe9e0e0500308": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_5f05ea88a9d847c8b4135c2b5fef1122", + "IPY_MODEL_d142c16f55ba40b4971cfa4bd39302e3", + "IPY_MODEL_9e76ac51592847ad8d8d8ae532a33739" + ], + "layout": "IPY_MODEL_b3fdd531164641e1921c30fd5774467c" + } + }, + "3e7849275f78471d908ac2a107654b15": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "3fc62e98fc9949ac91a65874c71ca60f": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "4110084eb8eb4597905a8d8ae6646e98": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_a4d54f0de5e84d1b959b9deeb6a1c76d", + "max": 790, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_cd40bb71897847ca8319eb8266b76105", + "value": 790 + } + }, + "4475ab1ef21f40c781029cb07331aef2": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_351429314b604e048dd8f1fee3a786b6", + "placeholder": "​", + "style": "IPY_MODEL_22c47742278543c19739f00b88d9618f", + "value": "tokenizer.json: 100%" + } + }, + "45050ac6f75e4933a1a288f9c18741ff": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "4a161d6667d34518b51910160a542ace": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_da6b01047dc6407aa4222c331012aa8d", + "placeholder": "​", + "style": "IPY_MODEL_bb06655a789e424e8cc8c7cd3296f423", + "value": " 198/198 [00:01<00:00, 153.11 examples/s]" + } + }, + "4c5837e6a5a847a2a1a3528c8d350c95": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "4d48534b7330466db8ee94faba52d6b8": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "569edb26e9b94220a26ffa8b145c7f2c": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_cd1770f47899474d9cc2d13d4b077008", + "placeholder": "​", + "style": "IPY_MODEL_83c6475f03be4806b0b99f36b9c38540", + "value": " 790/790 [00:06<00:00, 123.05 examples/s]" + } + }, + "59eb47688845405585a8f9ab532081b1": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "5c708772344045a1be691188eb57fb66": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "5f05ea88a9d847c8b4135c2b5fef1122": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_c97b5ea8b5b94f30976a2cf75751d149", + "placeholder": "​", + "style": "IPY_MODEL_6b40d0fc9bb84f71816d3c7aec0ba01e", + "value": "sentencepiece.bpe.model: 100%" + } + }, + "5f6888c8b7b0463e8377b23afce3d382": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "60d1fffa2d8c4df0bccaa48d3d5f6ac2": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_3ae8becb4e774557aa6aeb9ead040202", + "IPY_MODEL_3b4ab23581a044d9b652f921f70ad3ea", + "IPY_MODEL_00e6fe2c23204fc6adae3fe8ee241c20" + ], + "layout": "IPY_MODEL_8cd62148f3a443ce86f46534c9e4b023" + } + }, + "614b423d37144268919d8acd3a3837a4": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "620abe669ad44aff95df37a314b1c42a": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_59eb47688845405585a8f9ab532081b1", + "placeholder": "​", + "style": "IPY_MODEL_c7a03e6fdda547ed8f6a861f6e86682f", + "value": " 1.97k/1.97k [00:00<00:00, 154kB/s]" + } + }, + "64efb2ce9d5245d29e3e1e0016652a5f": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "69eb254ef8d6474f8c46d4a727375119": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_a9bae5c716bd44ea968bec85f28c9a83", + "placeholder": "​", + "style": "IPY_MODEL_8b38c7006abf49cb826cb57e22a893ac", + "value": " 17.1M/17.1M [00:00<00:00, 99.6MB/s]" + } + }, + "6b2259ecb37f4d04a5b6ed9d3e211e2d": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "6b40d0fc9bb84f71816d3c7aec0ba01e": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "7218c9d168f346f7824e177f8a0ed9a7": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "777eaa8f8e8647e7b051b798b30feb63": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_83aeb27ef845444d9f0277d4a03875eb", + "placeholder": "​", + "style": "IPY_MODEL_5c708772344045a1be691188eb57fb66", + "value": " 1.11G/1.11G [00:05<00:00, 214MB/s]" + } + }, + "7d81208cf6f44e9e92c7c027ad3d9820": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "80afd851d6784dedb9f63926bdbba5ce": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "825a496aaced4a3f9a75fa4ec573dd34": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_0a24b679037242309af41bf77e1dcd92", + "placeholder": "​", + "style": "IPY_MODEL_86749ecfd3cf4133ab075582168c2366", + "value": "Map: 100%" + } + }, + "83aeb27ef845444d9f0277d4a03875eb": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "83c6475f03be4806b0b99f36b9c38540": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "866647207caf4eeabfad30088758c5fd": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "86749ecfd3cf4133ab075582168c2366": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "8802353acc1a4fb187f995aab868c520": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "8ac0e1298a17492fa69a667441dbbb51": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "8b38c7006abf49cb826cb57e22a893ac": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "8cd62148f3a443ce86f46534c9e4b023": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "8d9ff3d8226e46f98c4b98ef22632ed4": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_10d0a68fc55b4e49bcb6d9846688d41d", + "max": 1965, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_1ca4b36939a04d9cb7c0ed645433cfb2", + "value": 1965 + } + }, + "8f517e1061744e4e8ffdcc0c3f600b92": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "9a79b6d7d6654062b0c8de281d9811e2": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "9ac2a923cd064d3aa66fe39aa70847d5": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "9d8e68792cb04edca4bd60961cefa16f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "9e76ac51592847ad8d8d8ae532a33739": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_4d48534b7330466db8ee94faba52d6b8", + "placeholder": "​", + "style": "IPY_MODEL_318bc0fc6c8047ffb3c248e1a53891a3", + "value": " 5.07M/5.07M [00:00<00:00, 78.5MB/s]" + } + }, + "a06c1b2fcb4b46f7adf6e3fca132cfe9": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_6b2259ecb37f4d04a5b6ed9d3e211e2d", + "placeholder": "​", + "style": "IPY_MODEL_5f6888c8b7b0463e8377b23afce3d382", + "value": "config.json: 100%" + } + }, + "a23e47196e5c47459a08bca806417067": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_13ce2c9f579f4b5aa6d4ca0c01be4423", + "max": 964, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_3e7849275f78471d908ac2a107654b15", + "value": 964 + } + }, + "a4d54f0de5e84d1b959b9deeb6a1c76d": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "a9bae5c716bd44ea968bec85f28c9a83": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "ace161e28a294d5c86307ab4a9d31e34": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_866647207caf4eeabfad30088758c5fd", + "max": 198, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_ae26ce55afc44c9584ff567f350c7ba7", + "value": 198 + } + }, + "ae26ce55afc44c9584ff567f350c7ba7": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "ae39a79497cc443aa65441a47d43a6bc": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_4475ab1ef21f40c781029cb07331aef2", + "IPY_MODEL_1848061e49b540908833ca2d4241c5d3", + "IPY_MODEL_69eb254ef8d6474f8c46d4a727375119" + ], + "layout": "IPY_MODEL_afd72a3d2a064d88bc2094cf247af7e3" + } + }, + "afd72a3d2a064d88bc2094cf247af7e3": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "b11f88b0f5c641528ca26b7f476f7077": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_cb2afa95f28f47f182c9744ef0d09454", + "IPY_MODEL_cea7534319b7415f84d0b2b6088a2257", + "IPY_MODEL_777eaa8f8e8647e7b051b798b30feb63" + ], + "layout": "IPY_MODEL_80afd851d6784dedb9f63926bdbba5ce" + } + }, + "b3fdd531164641e1921c30fd5774467c": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "bb06655a789e424e8cc8c7cd3296f423": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "c013b77cfb8449f2b5b37dc2b61a101f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "c7a03e6fdda547ed8f6a861f6e86682f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "c97b5ea8b5b94f30976a2cf75751d149": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "cb2afa95f28f47f182c9744ef0d09454": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_9ac2a923cd064d3aa66fe39aa70847d5", + "placeholder": "​", + "style": "IPY_MODEL_c013b77cfb8449f2b5b37dc2b61a101f", + "value": "model.safetensors: 100%" + } + }, + "cd1770f47899474d9cc2d13d4b077008": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "cd40bb71897847ca8319eb8266b76105": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "cea7534319b7415f84d0b2b6088a2257": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_4c5837e6a5a847a2a1a3528c8d350c95", + "max": 1112257300, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_d9315c53637c4286ab98c6f95e6a6974", + "value": 1112257300 + } + }, + "d142c16f55ba40b4971cfa4bd39302e3": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_efb43112d53e4e36bc44d33bcd8d28c4", + "max": 5069051, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_45050ac6f75e4933a1a288f9c18741ff", + "value": 5069051 + } + }, + "d74a6517e8334f47a0f7c3039a2cf57c": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "d9315c53637c4286ab98c6f95e6a6974": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "da6b01047dc6407aa4222c331012aa8d": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "dbfd3f0cf30546efad85646f6c594291": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "e441df716d72449eaa097167885b5b53": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_1292527b6bbb47e8bbbf3696214fb6d8", + "placeholder": "​", + "style": "IPY_MODEL_1d60ea3aa6834deaa52b7b1cd4cd606b", + "value": "special_tokens_map.json: 100%" + } + }, + "efb43112d53e4e36bc44d33bcd8d28c4": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "fe7494966ec7437baf3b2eaf82a325e4": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + } + } + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/project/docs/4_Advanced/advanced_encoder_classifier/Encoder_Classifier.qmd b/project/docs/4_Advanced/advanced_encoder_classifier/Encoder_Classifier.qmd new file mode 100644 index 00000000..88f64a88 --- /dev/null +++ b/project/docs/4_Advanced/advanced_encoder_classifier/Encoder_Classifier.qmd @@ -0,0 +1,540 @@ +--- +title: Encoder-only Architecture for Text Classification +author: Irene Berezin, Kaiyan Zhang +date: 10 March 2025 +format: + html: default + ipynb: + jupyter: + kernelspec: + display_name: "Python 3 (ipykernel)" + language: python3 + name: python3 +--- + + +Recall that the chief reason for using machine learning models over traditional, lexicon-based sentiment analysis is that like lexicon-based approaches to text classifcation rely on fixed dictionaries, which results in them missing context, irony, or slang. ML models, especially deep learning models, learn complex patterns from data, making them particularly useful for more modern text, such as social media posts and online discussions. + + +Generally, for text classification tasks, it is faster to use an encoder-only model like BERT, as we don't need to worry about generating some kind of output sequence- all we are doing is classifying text with some pre-defined labels. In other words, our focus is on *understanding* text rather than *generating* text, so it is both faster, and less computationally expensive, to use encoder-only models like BERT. + + +```{python} +from advanced_encoder_classifier_tests import Tests +``` + +## 0. Data Importing and Cleaning + +As mentioned, ML-based text classification works the best for nuanced text with lots of slang. Hence, we'll test it on a dataset of recently-scraped *r/UBC* posts. + +```bash +!pip install transformers # you likely don't have this installed. +!pip install pytorch # you likely don't have this installed. +!pip install datasets # you likely don't have this installed. +!pip install evaluate # you likely don't have this installed. +!pip install sentencepiece # you likely don't have this installed. +!pip install accelerate # you likely don't have this installed. +!pip install numpy +!pip install pandas +!pip install seaborn +``` + +```{python} +from transformers import AutoTokenizer, AutoModelForMaskedLM +import torch +import re +import pandas as pd +from transformers import pipeline +import matplotlib.pyplot as plt +from sklearn.model_selection import train_test_split +import numpy as np +from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments +from datasets import Dataset +from sklearn.preprocessing import MultiLabelBinarizer +import os +from sklearn.preprocessing import MultiLabelBinarizer +from numpy import array +from transformers import AutoModelForSequenceClassification, AutoTokenizer +from scipy.special import expit +import ast +``` + +```{python} +tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") # to break our reddit posts down into tokens +model = AutoModelForMaskedLM.from_pretrained("google-bert/bert-base-uncased") # our model +``` + +The reddit dataset is dated around late February 2025, however, you are free to scrape newer posts locally by `cd`-ing into the currect directory and running `python project/docs/4_Advanced/advanced_encoder_classifier/r_ubc_scraper.py` in your terminal. This will generate a fresh new set of post/comment datasets for our analysis. + +```{python} +# Load the dataset +reddit_comments = pd.read_csv('Data/r_ubc_comments.csv') +reddit_posts = pd.read_csv("Data/r_ubc_posts.csv") +``` + +```{python} +# Merging datasets +reddit_data = reddit_posts.merge(reddit_comments.groupby("post_id")["body"].apply(lambda x: " ".join(x)), left_on="id", right_on ='post_id', how="left") +reddit_data["full_text"] = reddit_data['title'] + [' '] + reddit_data["selftext"].fillna('') + [' '] + reddit_data["body"].fillna('') +``` + +```{python} +# Convert from UTC to detailed month, date, day of week and hour +reddit_data['created_utc'] = pd.to_datetime( + reddit_data['created_utc'], + unit='s', # Timestamp is in seconds + utc=True # Enforce UTC timezone +) + +reddit_data['month'] = reddit_data['created_utc'].dt.month_name() +reddit_data['date'] = reddit_data['created_utc'].dt.day +reddit_data['day_of_week'] = reddit_data['created_utc'].dt.day_name() +reddit_data['hour'] = reddit_data['created_utc'].dt.hour + +def clean_text(text): + if isinstance(text, str): + text = text.lower() + text = re.sub(r"http\S+", "", text) + text = re.sub(r"[^a-zA-Z0-9\s]", "", text) + return text.strip() + else: + return "" + +reddit_data["clean_text"] = reddit_data["full_text"].apply(clean_text) +``` + +## 1. Labelling using a pre-trained roBERTa model + +Here we used a pre-trained model from hugging face to generate sentiment labels for each posts. Specifically, we'll be using `cardiffnlp/twitter-roberta-base-sentiment-latest`, a model pre-trained on English twitter texts. It classifies emotions into three categories: 0 -> Negative; 1 -> Neutral; 2 -> Positive. + +If you want to work on datasets containing multiple languages, you can use `google-bert/bert-base-multilingual-uncased`, it is a model pretrained on on the top 102 languages with the largest Wikipedia using a masked language modeling (MLM) objective. You can find more information about it on . + +```{python} +emotion_classifier = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment-latest", tokenizer="cardiffnlp/twitter-roberta-base-sentiment-latest") + +reddit_data["sentiment"] = reddit_data["clean_text"].apply(lambda x: emotion_classifier(x[:512])[0]["label"] if pd.notna(x) else "neutral") +``` + +Here we are using another pre-trained model to generate topic labels for each posts. Specfically, we are using `cardiffnlp/tweet-topic-base-multilingual`, a model pre-trained on ~198M multilingual tweets and finedtuned for English, Spanish, Japanese, and Greek. You can find information about its labels on . + +Unfortunately, there hasn't yet to be a "good" multilingual model for topic classifications for some languages up untill Feburuary 2025, but I believe you can always find something that works for your demand on . + +```{python} + +MODEL = "cardiffnlp/tweet-topic-base-multilingual" +tokenizer = AutoTokenizer.from_pretrained(MODEL) +model = AutoModelForSequenceClassification.from_pretrained(MODEL) + +class_mapping = model.config.id2label + +def classify_text(text): + tokens = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512) + output = model(**tokens) + + scores = output[0][0].detach().numpy() + scores = expit(scores) + + predictions = (scores >= 0.5) * 1 + + # Get predicted labels + predicted_labels = [class_mapping[i] for i in range(len(predictions)) if predictions[i]] + + return ", ".join(predicted_labels) if predicted_labels else "other" + +reddit_data["topic"] = reddit_data["clean_text"].apply(classify_text) +``` + +```{python} +def clean_topics(topic_list): + if isinstance(topic_list, list): + return [topic.strip() for topic in topic_list] + return topic_list + +reddit_data["topics"] = reddit_data["topic"].apply(lambda x: x.split(",")) + +reddit_data["topics"] = reddit_data["topics"].apply(clean_topics) +``` + +```{python} +reddit_data["multi_labels"] = reddit_data.apply(lambda row: row["topics"] + [row["sentiment"]], axis=1) + +reddit_data["multi_labels"] = reddit_data["multi_labels"].apply(lambda x: eval(x) if isinstance(x, str) else x) + +reddit_data.to_csv('Data/r_reddit_data.csv', index=False) +``` + +Since we saved the dataframe as a csv, we are now able to begin from this step by reading the csv file directly. + +```{python} +# Split training and testing data +# You can start from this step if you don't want to re-generate emotion and topic labels +reddit_data = pd.read_csv("Data/r_reddit_data.csv") +reddit_data["multi_labels"] = reddit_data["multi_labels"].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x) + +train_indices, test_indices = train_test_split(reddit_data.index, test_size=0.2, random_state=114514) # We set seed to ensure reproducible results + +train_data = reddit_data.loc[train_indices].reset_index(drop=True) +test_data = reddit_data.loc[test_indices].reset_index(drop=True) + +print(train_data) +``` + +We can visualize the share of different labels with a pie chart. + +```{python} +# Print the distribution table of labels + +train_data["topics"] = train_data["topics"].apply(lambda x: eval(x) if isinstance(x, str) else x) +test_data["topics"] = test_data["topics"].apply(lambda x: eval(x) if isinstance(x, str) else x) + +train_data["topics"] = train_data["topics"].apply(clean_topics) +test_data["topics"] = test_data["topics"].apply(clean_topics) + +train_exploded = train_data.explode("topics") +test_exploded = test_data.explode("topics") + +train_counts = train_exploded["topics"].value_counts() +test_counts = test_exploded["topics"].value_counts() + +distribution_df = pd.DataFrame({ + "train_count": train_counts, + "test_count": test_counts +}).fillna(0).astype(int) + +distribution_df["train %"] = (distribution_df["train_count"]/distribution_df["train_count"].sum() * 100).round(2) +distribution_df["test %"] = (distribution_df["test_count"]/distribution_df["test_count"].sum() * 100).round(2) + +distribution_df.sort_values(by="train_count", ascending=False, inplace=True) + +display(distribution_df) +``` + +```{python} +# Plot the Pie Chart +train_emotion_counts = pd.Series(train_data["sentiment"]).value_counts() +test_emotion_counts = pd.Series(test_data["sentiment"]).value_counts() + +plt.figure(figsize=(12,8)) +plt.subplot(1,2,1) +train_emotion_counts.plot.pie(autopct='%1.1f%%', startangle=140, title="Train Set Labels (Emotion)") +plt.ylabel('') + +plt.subplot(1,2,2) +test_emotion_counts.plot.pie(autopct='%1.1f%%', startangle=140, title="Test Set Labels (Emotion)") +plt.ylabel('') + +plt.show() + +plt.figure(figsize=(12, 9)) +distribution_df.plot.bar(y = "train_count", title="Train Set Labels (Topic)") +plt.ylabel('') + +plt.show() + +plt.figure(figsize=(12, 9)) +distribution_df.plot.bar(y = "test_count", title="Test Set Labels (Topic)") +plt.ylabel('') + +plt.show() +``` + +Then we can preprocess our data by binarizing the unique labels and tokenize our columns for classification. + +### Question 1 + +Out of the options below, which one is not a strength of using encoder only model for classification? + +- A) It is relatively cheaper to train. +- B) It can generate labels itself even if no pre-defined label is attached to training texts. +- C) It can overcome missing context, irony or slang. +- D) It works well with multi-label tasks. +- E) None of the above. + +*Enter your answer below as a a string with one of A, B, C, D, E ie. "A"* + +```{python} +answer1 = # Your answer here + +Tests.test1(answer1) +``` + +## 2. Training Classifiers + +### 1. OneVsRestClassifier + +This is our first classifying strategy, which consists in fitting one binary classifier per class. + +For example, if you want to classify 3 types of fruits, apple, banana and orange with this strategy, you will need to train 3 binary classifiers. The first of these classifiers determines if it is an apple, the second is responsible for determining if it is a banana, and the third is responsible for determining if it is an orange. + +The 3 classifiers will then vote for the given new observation, the voting result will be shown as probability. Let's say we have a new observation that we are unsure what type of fruit it is, so we feed this observation to our classifier. The "Apple" classifier claims it's 80% likely an apple, the "Banana" classifier claims it's 10% likely a banana, and the "Orange" classifier claims it's 50% likely an orange. Then we classify based on the highest probability that our new object is an apple. + +While it is simple and effective, this strategy has its drawback -- What if there are large amount of labels for us to classify? What if the labels have specific hierarchies? In the first situation, it may take a lot of time to train multiple classifiers for this task; and in the second situation, we could make a lot of mistakes using binary classification. Then we'd better use alternative strategies. + +Anyhow, this is a good way to begin encoder classification. + +### 2. MultiOutputClassifier + +This strategy consists of fitting one classifier per target. This is a simple strategy for extending classifiers that do not natively support multi-target classification. + +It is somewhat similar to the *OneVsRest* strategy we just described, but uses a different approach. Let's go back to the fruit classification example, where in the previous setup we trained 3 different classifiers for 3 different types of fruits. Now, let's say that we basically determine which type of fruit it is by looking at the multiple features shared by these fruits (e.g., color, smoothness, and flavor). Then, the *MultiOutput* strategy would train a classifier to specialize in one variable. + +You may have noticed that this overcomes the shortcomings of the *OneVsRest* strategy to a certain extent, especially when there are more “labels” than training variables. However, when we have more variables than labels, this strategy is less efficient than the *OneVsRest* strategy. Therefore, we need to be cautious when choosing our strategy. + +### 3. ClassifierChain + +We are finally coming to our last strategy -- It is to train multi-label model that arranges binary classifiers into a chain. + +I'll stick with the fruit categorization example (because it's simple and straightforward), and assume that we've used a *MultiOutput* strategy, which provides us with 3 classifiers that deal with 3 different features of the fruit. How can we make them work more efficiently? We want them to work in a band, rather than individually! A *ClassifierChain* can also treat the evidence found by the previous classifier as a feature for the next one, thus improving the quality of the classification by sharing common knowledge and "pass on" the evidence chain. + +While this strategy sounds smarter than the first two, it has its drawbacks. First, it is more prone to false correlation and over-fitting. Second, it requires more resources to train. Third, when there are fewer covariates, the improvement that this strategy may bring is not significant. Therefore, special care should be taken when deciding to adopt this strategy! + +Now that we've finished this short tutorial on 3 different classification strategies, we will show how to train a basic encoder classifier for multi-label task with an example. + +### Question 2 + +Suppose that a hospital is developing a system to automatically classify patient symptoms into multiple diagnostic categories. Each patient might have 0-3 simultaneous conditions from 15 possible diseases. Doctors observe that certain conditions often co-occur (e.g., diabetes and hypertension). Which strategy would BEST leverage these inter-label relationships while maintaining reasonable computational efficiency? + +- A) *OneVsRest* +- B) *MultiOutput* +- C) *ClassifierChain* +- D) None of the above + +*Enter your answer below as a a string with one of A, B, C, D, ie. "A"* + +```{python} +answer2 = # Your answer here + +Tests.test2(answer2) +``` + +## 3. Training a Multi-Label Classifier on Reddit Data + +Now that we've gone over the various classifying strategies, we'll try and train a new classifier on the labels produced by `cardiffnlp/tweet-topic-base-multilingual`. Specifically, we'll train a bert model on the synthetic labels on the reddit dataset we generated earlier. Training on synthetic data, data generated by other ML models, is a valid technique. In fact, [researchers estimate that new human-generated data for training ML models will run out within the next 2 to 8 years, forcing the use of synthetic data for training](https://arxiv.org/pdf/2211.04325). There are, of course, obvious downsides; in a classification context, the biases present in the classifications of one model will be transferred over to any models trained on the upstream model's classifications. However, for learning purposes, this method works well and is cost-effective, as the code below can easily be modified and reused for human-labeled datasets. + +Our workflow will consist of the following: + +1) Convert labels into matrices of binary values. +2) Tokenize our `clean_data` column into tokens for classification. +3) Define/discuss accuracy, recall, and reate a function for evaluating our model's predictions. +4) Train and test our model! + +We'll accomplish this primarily with the use of two libraries: `skit-learn` and `transformers`. `skit-learn` provides us a convenient way of transforming our labels into binary vectors, while the `transformers` library abstracts most of the training and evaluation into simpler code. + +### 3.1 Data-preprocessing + +We'll start by doing some simple data preprocessing and setting up our GPU. + + +You'll probably want a CUDA-compatible GPU for this. Consider google collab for smaller datasets. + + +```{python} +os.environ["WANDB_DISABLED"] = "true" # disables annoying API key requirement +device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # Best to have a GPU. +model.to(device) + +train_dataset = train_data[['clean_text', 'multi_labels']] +test_dataset = test_data[['clean_text', 'multi_labels']]# clean_text is our textual input, 'multi_labels' is our list of labels. +print(type(train_data['multi_labels'].iloc[0])) +``` + +Notice that the multi-label column is comprised of lists of labels. Each list corresponds to the series of labels assigned to each reddit post. While this is intuitive for us to understand, it is not so intuitive for our model. So, we'll need to convert it into a format easily interpretable by a computer: a binary vector. Each value in the vector corresponds to the presence of one of the 23 labels in our dataset. Repeating this for all n rows of the dataset, we end up with a matrix of size $700 \times 23$. We'll do this conversion using `sklean`'s [MultiLabelBinarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html) class. + +```{python} +# for example, we can convert this list of sets containing labels into a binary matrix. +# each set contains labels corresponding to a column in a hypothetical dataset. + +mlb = MultiLabelBinarizer() +mlb.fit_transform([{'sci-fi', 'thriller'}, {'comedy'}]) +``` + +The first row of the matrix corresponds to the first set of labels, and is interpreted as "`comedy` is absent, `sci-fi` is present, `thriller` is present. Likewise, the second row is interpreted as "`comedy` is present, `sci-fi` is absent, `thriller` is absent." + +```{python} +list(mlb.classes_) +``` + +```{python} +mlb = MultiLabelBinarizer() + +train_labels = mlb.fit_transform(train_data['multi_labels']) #converting training data labels to binary matrix +test_labels = mlb.transform(test_data['multi_labels']) # converting testing data labels to binary matrix +num_labels = train_labels.shape[1] #.shape[1] gives us the number of entries in each row of the matrix +print(num_labels) +print(list(mlb.classes_)) + + +train_data = train_data.copy() #creates an independent copy to avoid changing the original dataframe +test_data = test_data.copy() +train_data['labels_encoded'] = list(train_labels) # assigns labels to new column +test_data['labels_encoded'] = list(test_labels) +# Converting the pandas DataFrames into HuggingFace Datasets, in order to use the trainer. +train_dataset = Dataset.from_pandas(train_data[['clean_text', 'labels_encoded']]) +test_dataset = Dataset.from_pandas(test_data[['clean_text', 'labels_encoded']]) +``` + +```{python} +print(array(train_dataset['labels_encoded'])) +print(array(test_dataset['labels_encoded'])) +``` + +Next, we need to tokenize our text to pass through the model. We define a custom tokenizing function below: + +```{python} +tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') + +def tokenize_function(batch): + """ + Given a batch dictionary, tokenizes text using the global tokenizer with: + - Truncation: cutting off texts longer than 128 tokens. + - Padding: ensuring every text is exactly 128 tokens long. + Next it converts each label in "labels_encoded" to a float and adds the processed labels to the tokenized + output under the key "labels". + """ + encoding = tokenizer(batch["clean_text"], truncation=True, padding='max_length',max_length=128) #converts raw text into tokens + labels = [] #initializing the main empty list of labels + + for label_list in batch["labels_encoded"]: # outer loop: iterates over each list of labels in the batch + converted_labels = [] # initializing a inner list of labels for each batch + for x in label_list: # Iterates over each individual label in the current label list + converted_labels.append(float(x)) # converts each label to a float and appends it to `converted_labels` + labels.append(converted_labels) # appends resulting list to main list of labels + + encoding["labels"] = labels + return encoding +``` + +Now that our tokenizer function has been written, we can map the function onto the training and testing datasets. + +```{python} +train_dataset = train_dataset.map(tokenize_function, batched=True) +test_dataset = test_dataset.map(tokenize_function, batched=True) + +train_dataset = train_dataset.remove_columns(["clean_text", "labels_encoded"]) #removing the old columns +test_dataset = test_dataset.remove_columns(["clean_text", "labels_encoded"]) + +train_dataset.set_format("torch") # changes the output format of the dataset so that each sample is returned as PyTorch tensors +test_dataset.set_format("torch") +``` + +Note that we haven't actually done any training yet; All we've done is preprocessed our data to a format that is legible by a model. We can now go ahead and create an instance of a BERT model that we are training from HuggingFace, by specifying our model and the problem type (multi-label classification). + +```{python} +model = BertForSequenceClassification.from_pretrained( + 'bert-base-uncased', + num_labels=num_labels, + problem_type="multi_label_classification" +) +``` + +It's one thing to train a model, but how do we go about actually evaluating it's performance? We need a set of concrete metrics. Before diving into writing a evaluation function, let's define some new terminology: + +**Accuracy** is defined as the total number of correct predictions over the total number of predictions. + +$$\text{Accuracy} = \frac{\text{Total \# of Correct Predictions}}{\text{Total \# of Predictions}}$$ + +This metric is simple and intuitive, however, as we will see, in scenarios where one class dominates, a high accuracy can be misleading because the model might simply be predicting the majority class most of the time. + +**Recall** is proportion of actual positives (binary label equals 1) that were correctly identified. + +$$\text{Recall} = \frac{\text{True Positive}}{\text{True Positive + False Positive}}$$ + +Recall is used primarily in situations where making a false positive is extremely costly. For instance, if we were training a model to pick up on AI-generated text submitted by students on assignments, we'd heavily prioritize maximizing recall, as false positives would have lasting impacts on innocent students. + +We'll be evaluating both recall and accuracy. + +```{python} +def compute_metrics(p): + """ + Given a tuple containing an array of logits and an array of labels, compute accuracy and recall by: + 1) Applying a sigmoid activation function to the logits, to convert them into probabilities + 2) Converting the resulting probabilities into binary predictions + 3) Cacliuating resulting accuracy and recall + """ + logits, labels = p + probs = 1 / (1 + np.exp(-logits)) # sigmoid activation function + preds = (probs >= 0.5).astype(int) # # Threshold at 0.5 to get binary predictions + binary_accuracy = np.mean(preds == labels) # binary accuracy + true_positives = np.logical_and(preds == 1, labels == 1).sum() + total_actual = (labels == 1).sum() + micro_recall = true_positives / total_actual if total_actual > 0 else 0 + + return {"binary_accuracy": binary_accuracy, "micro_recall": micro_recall} +``` + +Furthermore, we need to specify the training arguments and the trainer. We do this using HuggingFace's `trainer` and `TrainingArguments` classes, which are handy APIs for easily training a model while abstracting away all of the complicated details. + +The details here aren't important; you can read more about what each argument does [here](https://huggingface.co/docs/transformers/en/main_classes/trainer) and [here](https://huggingface.co/docs/transformers/v4.49.0/en/main_classes/trainer#transformers.TrainingArguments). + +```{python} +training_args = TrainingArguments( + output_dir="./results", + evaluation_strategy="epoch", + num_train_epochs=3, + per_device_train_batch_size=3, + per_device_eval_batch_size=3, + learning_rate=2e-5, + weight_decay=0.01, + logging_steps=10, + load_best_model_at_end=True, + metric_for_best_model="binary_accuracy", + save_total_limit=2, + save_strategy="epoch" +) + + +trainer = Trainer( + model=model, + args=training_args, + train_dataset=train_dataset, + eval_dataset=test_dataset, + compute_metrics=compute_metrics, +) +``` + +Lastly, the easiest part is actually training the model: + +```{python} +# This call will run the full training loop +trainer.train() + +results = trainer.evaluate() +print(results) +``` + +Wow! What a high accuracy, after being trained on just 700 examples! The flaw here is obvious: we are training a BERT classifer on synthetic data generated by a BERT-based model. Furthermore, as we mentioned earlier, our labels are not evenly distributed across the training and evaluation data. Many labels appear in almost every row, while some appear < 10 times. + + +## Test your understanding + +1) (easy) Suppose you are given a list of sets like: `[{1,2},{1,2,3},{2,3},{3,4}]`. What will be the resulting binary matrix after running `mlb.fit_transform()`? + +
+ Hide/Show Solution +

+\begin{bmatrix} +1 & 1 & 0 & 0 \\ +1 & 1 & 1 & 0 \\ +0 & 1 & 1 & 0 \\ +0 & 0 & 1 & 1 +\end{bmatrix} +

+
+ +2) (easy) Another method of calculating accuracy is **total accuracy** which is the number rows where each label is correctly classified over the number of total rows. For instance, if a model outputs `[0,0,1]` as it's classification for a given text and the correct answer is `[1,0,1]`, the binary accuracy would be 2/3, while the total accuracy would be 0. + +What would you expect to happen to our accuracy score, if we used total accuracy as our accuracy metric instead of binary accuracy? + +
+ Hide/Show Solution +

Our accuracy would be much lower, and we are now checking if each row of the predicted matrix of binary labels is identical to each row of the actual matrix, instead of checking if a given value in the predicted matrix matches the respective value in the actual matrix.

+
+ +3) (Hard) Head over to kaggle and check out the [jigsaw classification challenge](https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/data). Note that you'll need an [api key](https://www.kaggle.com/docs/api) to download the dataset below. Using the classification code above, can you train a BERT model on the dataset? + +```{python} +!kaggle competitions download -c jigsaw-toxic-comment-classification-challenge + +# you might need to do !pip install kaggle first! +``` + diff --git a/project/docs/4_Advanced/advanced_encoder_classifier/advanced_encoder_classifier_tests.py b/project/docs/4_Advanced/advanced_encoder_classifier/advanced_encoder_classifier_tests.py new file mode 100644 index 00000000..b8c36db7 --- /dev/null +++ b/project/docs/4_Advanced/advanced_encoder_classifier/advanced_encoder_classifier_tests.py @@ -0,0 +1,26 @@ +import hashlib +from hashlib import sha256 + +def hash(data): + h=hashlib.new("SHA256") + h.update(data.encode()) + return h.hexdigest() + +class Tests(): + + def test1(answer): + if str(hash(answer)) == "df7e70e5021544f4834bbee64a9e3789febc4be81470df629cad6ddb03320a5c": + return print ("Correct!") + + else: + return print("Incorrect, see the introduction part to find what the strengths of encoder classification are.") + + def test2(answer): + if str(hash(answer)) == "6b23c0d5f35d1b11f9b683f0b0a617355deb11277d91ae091d399c655b87940d": + return print ("Correct!") + + elif str(hash(answer)) == "df7e70e5021544f4834bbee64a9e3789febc4be81470df629cad6ddb03320a5c": + return print("Incorrect, recall the description in the question. Is there a way for us to make better use of the given pattern? ") + + else: + return print("Incorrect, revisit the concept and consider what is the strategy that can cope with large amounts of labels? ") \ No newline at end of file diff --git a/project/docs/4_Advanced/advanced_encoder_classifier/r_ubc_scraper.ipynb b/project/docs/4_Advanced/advanced_encoder_classifier/r_ubc_scraper.ipynb new file mode 100644 index 00000000..3031cfdb --- /dev/null +++ b/project/docs/4_Advanced/advanced_encoder_classifier/r_ubc_scraper.ipynb @@ -0,0 +1,1184 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import praw\n", + "import time\n", + "import pandas as pd\n", + "import csv\n", + "from praw.models import Submission\n", + "import os\n", + "from tenacity import retry, stop_after_attempt, wait_exponential\n", + "from tqdm import tqdm " + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "# Define necessary names\n", + "SUBREDDIT_NAME = \"UBC\" \n", + "MAX_POSTS = 10000 \n", + "BATCH_SIZE = 100 \n", + "CHECKPOINT_FILE = \"checkpoint.txt\" \n", + "POSTS_CSV = \"r_ubc_posts.csv\" \n", + "COMMENTS_CSV = \"r_ubc_comments.csv\" \n", + "REQUEST_DELAY = 1" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# Load reddit api throgh praw\n", + "reddit = praw.Reddit(\n", + " client_id=\"qg6qmpLE5NHZ98F71GBiiA\",\n", + " client_secret=\"5flQrSnLQjGwIoLQIrV_YFluxej9GQ\",\n", + " user_agent=\"script:my_script:v1.0 (by /u/NathanPalaiologos)\"\n", + ")\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "# Set csv file\n", + "\n", + "pd.DataFrame(columns=[\n", + " \"post_id\", \"title\", \"author\", \"content\",\n", + " \"score\", \"num_comments\", \"created_utc\", \"url\"\n", + "]).to_csv(POSTS_CSV, index=False)\n", + "\n", + "\n", + "pd.DataFrame(columns=[\n", + " \"post_id\", \"comment_id\", \"author\", \"body\", \n", + " \"score\", \"created_utc\"\n", + "]).to_csv(COMMENTS_CSV, index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "posts_file = open(\"r_ubc_posts.csv\", \"w\", newline=\"\", encoding=\"utf-8\")\n", + "comments_file = open(\"r_ubc_comments.csv\", \"w\", newline=\"\", encoding=\"utf-8\")" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "50" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "posts_writer = csv.writer(posts_file)\n", + "comments_writer = csv.writer(comments_file)\n", + "posts_writer.writerow([\"id\", \"title\", \"author\", \"selftext\", \"score\", \"num_comments\", \"created_utc\", \"url\"])\n", + "comments_writer.writerow([\"post_id\", \"comment_id\", \"author\", \"body\", \"score\", \"created_utc\"])\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "subreddit_name = \"UBC\" \n", + "subreddit = reddit.subreddit(subreddit_name)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Posts: pottery workshop tonight +1 needed\n", + "Posts: Year around housing contract\n", + "Posts: Is there a way to get my transcript for free\n", + "Posts: Does the STAT dept offer graduating class composites?\n", + "Posts: Can You Choose a Specific Building in Totem Park?\n", + "Posts: Can you take courses as an elective then use them as a minor afterwards\n", + "Posts: Has anyone received a response for undergraduate summer math research applications this year?\n", + "Posts: Mid Year Transfer\n", + "Posts: Hydrogen becomes a superfluid at nanoscale, confirming 50-year-old prediction\n", + "Posts: Is there a surefired way to get a work learn position?\n", + "Posts: Help: where to get Grad dresses\n", + "Posts: Free Conference Opportunity in Calgary!\n", + "Posts: BCampus resources\n", + "Posts: UBC Accelerated BSN program inquires\n", + "Posts: Do i need a tutor for cpsc210\n", + "Posts: tuition payment\n", + "Posts: YRH housing offers?\n", + "Posts: $24k stipend in Vancouver\n", + "Posts: midterm today BOOO\n", + "Posts: MY SLEEP SCHEDULE IS FUCKED\n", + "Posts: Grad Application Issue?\n", + "Posts: Do profs/instructors get in trouble and get suspended if students complain about something?\n", + "Posts: Why does the results for the presidential scholarship awards release only a couple of days before the deadline to accept or decline UBC?\n", + "Posts: May I use UBC as one of the settings in my crappy book, I am writing?\n", + "Posts: Housing Extension for Graduating Students\n", + "Posts: I love you guys\n", + "Posts: Why do people hate Sauder students so much\n", + "Posts: 11:38 pm Feb 23\n", + "Posts: Hi does anyone know what is lfs 302-international field studies.\n", + "Posts: Chem 205 midterms\n", + "Posts: groups for incoming WT students\n", + "Posts: to whoever stole my sushi at md\n", + "Posts: looking for tips on being more active in group project\n", + "Posts: Acceptances and rejection period\n", + "Posts: Final 2 participants needed for UBC muscle growth & strength study ❗️❗️❗️\n", + "Posts: Fall Graduating Class Composites\n", + "Posts: Library is kinda packed tonight\n", + "Posts: To people posting relationship posts\n", + "Posts: NOT EXCITED TO SEE Y'ALL TOMORROW\n", + "Posts: Where should I go for math tutoring help?\n", + "Posts: Have all the UBC final exam dates been released?\n", + "Posts: i’m not sure what to do\n", + "Posts: 28 months ago I looked at this grey ocean and smartened up\n", + "Posts: How often do UBC students use cannabis?\n", + "Posts: MIDTERM HELP PLEASE AND THANK YOU: PATH and MICB\n", + "Posts: What do you do on weekends while living on campus?\n", + "Posts: Is anyone else least favourite part of Webwork typing the answers in?\n", + "Posts: Looking for a tutor to teach programming/coding.\n", + "Posts: Saw This Outside H-Mart and remembered it from ENGL 245. Did something happen to Professor McNeilly?\n", + "Posts: to drop a course or no\n", + "Posts: Art history classes at UBC as an elective\n", + "Posts: UBC Student Storage\n", + "Posts: aftershocks last 2 days\n", + "Posts: UBC OT MASTERS PREP\n", + "Posts: PharmD Application\n", + "Posts: Is Fraser Hall worth it?\n", + "Posts: Tons of cops/security arounf the busloop?\n", + "Posts: [Master degree] UCL Creative and Collaborative Enterprise (CCE) vs. UBC Management (MM)\n", + "Posts: i haven’t made a single lasting friend since high school\n", + "Posts: You want a bf/gf until your relationship drains you\n", + "Posts: I’ve stayed up until 3 am every night during reading break\n", + "Posts: WT 2025-2026 schedule options\n", + "Posts: 0% productivity reading break\n", + "Posts: Anyone like Monster Hunter?\n", + "Posts: Is there a Pakistan vs India match screening tonight at the Pit?\n", + "Posts: My job search Sankey diagram!\n", + "Posts: Help with understanding financial awards/scholarships.\n", + "Posts: Rant about measly $20 per physio visit “health coverage” as a student\n", + "Posts: What if we create a dress code to wear for single people at UBC?\n", + "Posts: Thinking about creating a UBC Dating Subreddit\n", + "Posts: I want a girlfriend\n", + "Posts: Whoever was sucking people off in the Brock gloryhole you gotta drink some water\n", + "Posts: The person that stole my uber eats order\n", + "Posts: people are talking about bfs and gfs...what about breakups\n", + "Posts: buses keep getting stopped today?\n", + "Posts: Atp are yall down for a 10 Vs 10 Dating show? UBC only\n", + "Posts: Yo can all of u stfu with the posts\n", + "Posts: T2202 tax form 2024\n", + "Posts: I want an In and Out Burger 😭\n", + "Posts: Am I gonna be alone forever? Or at least for a long time?\n", + "Posts: Where the ABGs at\n", + "Posts: Yo can all of u stfu with the boyfriend/girlfriend posts\n", + "Posts: What do you think about the “flipped classroom” courses\n", + "Posts: 60863 students at UBC Vancouver and you are still single\n", + "Posts: Ngl atp our rank on \"Down Bad Universities\" list is definitely going up 💀\n", + "Posts: orchard dining hall- can you leave the dining hall with their dishes??😭\n", + "Posts: What was the worst grade on your transcript and what faculty did you apply for?\n", + "Posts: I want a TA 😭😭😭😭\n", + "Posts: do most people have a coop for the summer by now?\n", + "Posts: Can I still join UBC clubs?\n", + "Posts: Tax Slips for recent grads after Workday update\n", + "Posts: Any couple planning to go NIFTY clothing optional swim tonight?\n", + "Posts: trolling that fake ubc womens health account\n", + "Posts: Anyone know what was happening in KWTQ this morning ~9am?\n", + "Posts: in 4th year and still no idea what i’m doing\n", + "Posts: Idea for new club\n", + "Posts: Interlibrary Loan Requests??\n", + "Posts: Tuition fees from withdrawn courses ?\n", + "Posts: 2nd year transfer credits question 24 vs 30?\n", + "Posts: WEEPING, CRYING, BAWLING\n", + "Posts: low effort arts courses\n", + "Posts: Tf is wrong with the homeless woman in front of bus number 14\n", + "Posts: Relationship post\n", + "Posts: Accepting Girlfriend Applications. Serious Inquiries Only!\n", + "Posts: Econ courses to take\n", + "Posts: Part Time Jobs Near or Off Campus\n", + "Posts: When do we get our own university AI?\n", + "Posts: Any Design Teams in the Summer with ML positions\n", + "Posts: Winter Housing Waitlist\n", + "Posts: my girlfriend broke up with me a month ago and i feel like a shell of the person i was\n", + "Posts: I want a boyfriend\n", + "Posts: Gyms in Graduate housing\n", + "Posts: Recommendation for summer semester?\n", + "Posts: how to cancel ubc housing contract renewal?\n", + "Posts: Forgot to add my spouse to AMS. What now?\n", + "Posts: Every quad machine broken\n", + "Posts: UBC Forest app UNITEEE\n", + "Posts: Working while in nursing\n", + "Posts: What do I actually do with my life... LOL\n", + "Posts: How do you guys survive the campus with all the rain\n", + "Posts: Update regarding ASIA 326 grades\n", + "Posts: my dad lost his job but ubc is my dream school\n", + "Posts: How much documentation is enough for CFA accommodations?\n", + "Posts: When to tell my supervisor that I want to quit my PhD\n", + "Posts: Who has done a semester or year abroad?\n", + "Posts: I think we had 3 earthquakes just north of Vancouver (Sechelt / Sunshine Coast).\n", + "Posts: I guess we’re Earth quake survivors now?\n", + "Posts: Guys crazy earthquake damage. I dropped my pen :(\n", + "Posts: Anyone else's first time feeling an earthquake??\n", + "Posts: Should I leave my dorm and go outside?\n", + "Posts: EARTHQUAKKKEEEEE 🫨🫨🫨\n", + "Posts: Confirmed “An earthquake of magnitude 4.7 has occurred 27 km NNE of Sechelt”\n", + "Posts: i applied 16th nov, but they haven’t asked for high school transcript (transfer)\n", + "Posts: EARTHQUAKE!!!!!\n", + "Posts: EARTHQUAKE AT UBC\n", + "Posts: Did anyone feel the earthquake\n", + "Posts: Was that Earthquakes?!?!? Anyone felt it?!?!\n", + "Posts: Genuinely not feeling like I'm worth anything out here\n", + "Posts: Anyone taken CONS127 online over the summer?\n", + "Posts: I wish death upon this little shit\n", + "Posts: changing summer semester registration time?\n", + "Posts: Exchange student to Japan\n", + "Posts: Connecting with research supervisors\n", + "Posts: Travel insurance coverage\n", + "Posts: Life sciences parking share\n", + "Posts: Does Combined Major have good recognition in job market?\n", + "Posts: How is your Arts co-op search term going?\n", + "Posts: CPSC 304 In the Summer\n", + "Posts: Taking CHEM 123 during summer advice?\n", + "Posts: Final Exam at 7pm?\n", + "Posts: Need optometrist @ UBC\n", + "Posts: Looking for lost a bracelet in/around IRC\n", + "Posts: getting a job at blue chip\n", + "Posts: I applied in december and still haven't gotten my response back\n", + "Posts: Consequences of credit/d/fail?\n", + "Posts: Is it harder to get in science or engineering?\n", + "Posts: LETSSSS GOOO CANADAAAAAAAAAAAAAAAAAAAAAAA\n", + "Posts: how to deal with not ideal grades (first year)\n", + "Posts: Signed a petition and started feeling nervous\n", + "Posts: Made this throwaway 3 years ago exactly to talk about TA/Student relationships as a TA\n", + "Posts: blue chip jobss\n", + "Posts: Delay graduation/not?\n", + "Posts: Winter session housing waitlist\n", + "Posts: How are you investing??? and where?\n", + "Posts: September 2024 New Lit Requirements\n", + "Posts: Weird survey about circumcision?\n", + "Posts: Automated Storage and Retrieval System (ASRS) at IKB\n", + "Posts: PhD in French program\n", + "Posts: Summer Registration Times\n", + "Posts: My prof used chatgpt for my reference letter ! Confront or not?\n", + "Posts: Why Tf does it smell like that\n", + "Posts: approaching smonee\n", + "Posts: YRH Two months pay\n", + "Posts: Feeling Lost After Graduation: Struggling with Career, Friendships, and My Past\n", + "Posts: Someone stop this break!! It's ending too fast. 😭😭😭\n", + "Posts: Why does the bus loop smell like a compost bin this week?\n", + "Posts: UBC Career Centre Networking Night for International Students\n", + "Posts: Some interesting course data from UBCFinder.com\n", + "Posts: English 110 summer\n", + "Posts: Canadians are nice, but we're 'not going to roll over,' says UBC prof on booing and boycotting\n", + "Posts: Never eating at Tim Hortons again and where can I get coffee\n", + "Posts: I FINALLY received my Trek award\n", + "Posts: ECON355 Midterm, Vaney\n", + "Posts: Too late to apply to masters programs?\n", + "Posts: A familiar name on LSData\n", + "Posts: UBC Grade Calculator\n", + "Posts: Are libraries open 24/7 during exam season\n", + "Posts: Check your LFSM for bursaries\n", + "Posts: Urgent care on campus?\n", + "Posts: spare studio key\n", + "Posts: What is this buzzing noise at Ponderosa that has been going on all night ???\n", + "Posts: Need some guidance\n", + "Posts: Winter Session Term 1 Dates\n", + "Posts: ARTMS Concert anyone?\n", + "Posts: Coop portal keeps kicking me out\n", + "Posts: Whats the best way to find tutors on campus?\n", + "Posts: Kwtq 4 br or marine 4 br\n", + "Posts: Are Exam schedules finalized?\n", + "Posts: Is this Exam hardships ?\n", + "Posts: Summer Online Electives\n", + "Posts: When does summer/spring-course registration for someone going into second year in September open?\n", + "Posts: Acceptance letter workday\n", + "Posts: How much do personal trainers make on campus\n", + "Posts: Hello everyone what are some fun or easy 3rd level summer courses for second semester that are not restricted.\n", + "Posts: I am so sad at UBC\n", + "Posts: Feb 19th Senate Updates\n", + "Posts: Is choral union difficult to get into?\n", + "Posts: Upper Level Courses\n", + "Posts: Webmail not working\n", + "Posts: Comm 191 midterm\n", + "Posts: Update on \"I think I'm falling in love with a girl in a class I'm TA'ing for\"\n", + "Posts: Lost Prescription Glasses\n", + "Posts: Dorm washroom problem\n", + "Posts: i am an international student and i forgot to apply for msp\n", + "Posts: Friends application is messed up and mixed responses from UBC\n", + "Posts: Nursing program\n", + "Posts: Transfer acceptance\n", + "Posts: cop one of these hoodies\n", + "Posts: PANEL: UBC Alumnus & TIFF Winner Filmmaker Bruce Sweeney!\n", + "Posts: The Ubussy: The Gold Standard in Journalism, According to Us\n", + "Posts: To the Girl with a Fibonacci Tattoo\n", + "Posts: How are you guys avoiding doomscrolling throughout the break?\n", + "Posts: Anyone from the class of 2020. How are you?\n", + "Posts: Medical insurance for partner of student\n", + "Posts: Best Residence For Couples\n", + "Posts: What is class standing on workday\n", + "Posts: UBC Graduate Offers\n", + "Posts: Finding engineering co-op in Canada with a criminal record\n", + "Posts: MEd Counselling Psychology Results?\n", + "Posts: Is It Too Early to Book Off Work for Finals?\n", + "Posts: see something, say something..\n", + "Posts: Has anyone got their grades back from ASIA 326 last sem\n", + "Posts: Housing 5% increase?\n", + "Posts: UBC Residence Ranked\n", + "Posts: Everyone in this whole AMS beef just looks bad\n", + "Posts: 9 credits is part time student??\n", + "Posts: Science Peer Academic Coach position worth it?\n", + "Posts: Best Residence Studio\n", + "Posts: How do i stop this feeling?\n", + "Posts: Body image and uni stuffs\n", + "Posts: Anyone in Tokyo rn?\n", + "Posts: any jobs, part time preferred but full time works\n", + "Posts: Are you able to login to student email?\n", + "Posts: Surrounded by thousands, still feeling alone\n", + "Posts: How to get checked for skin cancer??? i'm scared\n", + "Posts: The AMS situation is so stupid.\n", + "Posts: YRH Housing Acceptance Fee\n", + "Posts: UBC in the snow (on film)\n", + "Posts: i'm just a (random girl) 🎀\n", + "Posts: Anybody living in private campus housing? Have you heard of Capstone realty?\n", + "Posts: The Real UBC Experience Stairmaster 3000 Edition\n", + "Posts: COMM 295 wizeprep or PASS/CMP final exam help\n", + "Posts: Evo drivers... wtf\n", + "Posts: UBC featured in Veritasium video\n", + "Posts: UBC overnight parking\n", + "Posts: Be there or be square\n", + "Posts: How common is it for UBC grads to be overqualified for jobs they are doing?\n", + "Posts: To all my big backs...\n", + "Posts: biol 200 final april\n", + "Posts: For those who took exams in wrong exam halls, what happened?\n", + "Posts: Stuff stolen on campus (LSK)\n", + "Posts: The Ubyssey....\n", + "Posts: Final exam schedule is posted!\n", + "Posts: Wow, this article really increased my confidence in the Ubyssey....\n", + "Posts: Survey on off campus house hunting and roommate experiences\n", + "Posts: Year round housing cancellation\n", + "Posts: UBC Gym/Aquatics\n", + "Posts: to those who delete their piazza posts with \"resolved\" after their question was solved\n", + "Posts: Response to recent criticisms on Ubyssey reporting regarding the removal of the AMS VP AUA\n", + "Posts: 2025 Summer Classes\n", + "Posts: I made a free ai resume copilot for ubc co-op students\n", + "Posts: Anyone applied for PhD in computer science UBC 2025?\n", + "Posts: ACCOUNTING ANYONE?\n", + "Posts: LING 201 with Ryan Bochnak/ LING 200 and 201 prof reqs\n", + "Posts: I need help looking for a Job?!\n", + "Posts: How does the Jumpstart Orientation Leader hiring happen?\n", + "Posts: How Healthy is this Traditional Form of Success?\n", + "Posts: Whoever did this...\n", + "Posts: Veritasium filmed at UBC\n", + "Posts: A rant about AI and CS.\n", + "Posts: My friend didn’t realize she had to take an english proficiency test and the deadline was Feb 15.\n", + "Posts: Not able to access UBC email\n", + "Posts: Looking for an idea guy\n", + "Posts: New jake lore dropped\n", + "Posts: Does anyone know if the UBC track is still covered in snow?\n", + "Posts: explain like im 5: the ams drama\n", + "Posts: What the hell is happening with AMS and Ubyssey\n", + "Posts: time for a recap of the OTHER situation\n", + "Posts: Allegations of The Ubyssey Colluding with ex-AMS VPAUA\n", + "Posts: Summer course registration\n", + "Posts: post valentines discounts\n", + "Posts: I HATE GROUP PROJECTS\n", + "Posts: How do you guys end friendships?\n", + "Posts: How to enjoy and be efficient with studying again?\n", + "Posts: Looking for other Basketball Card Collectors to Trade With\n", + "Posts: Do BA Econ students use Iona as well?\n", + "Posts: Been posting old photos from my travels... here's one I snapped at Wreck Beach in 1998.\n", + "Posts: Its been a month UBC are you going to accept me?\n", + "Posts: I love galentines\n", + "Posts: lsat study group?\n", + "Posts: UBC students - how would you feel dating a trans girl?\n", + "Posts: I have a lot of anxiety about starting school again.\n", + "Posts: final exam schedule\n", + "Posts: Does anyone here make birthday cakes?\n", + "Posts: UBC CS vs Sauder Business and CS: Which is the better choice?\n", + "Posts: Resources for Queer Asians?\n", + "Posts: Burnt out and a chronic hater\n", + "Posts: What happens when you drop below the minimum courses while on student loan?\n", + "Posts: chem 205 midterm\n", + "Posts: T5 Tax on a GIC investment account\n", + "Posts: ARC, BIRDCOOP gym during break\n", + "Posts: IKB is so packed\n", + "Posts: AMS Word Salad Buffet\n", + "Posts: time for a recap of the situation\n", + "Posts: Confused about going from Science to Computer Science at UBC\n", + "Posts: UBC researcher develops sustainable bamboo containers free of forever chemicals\n", + "Posts: Fairview Crescent Couches\n", + "Posts: How do I ask out a classmate I find really cute and smart?\n", + "Posts: DROPPING OUT AND TAKING A BREAK IS ALL I NEED RN\n", + "Posts: Data Science job opportunities\n", + "Posts: UBC staff: Bad working environment\n", + "Posts: Is there a water hose around UBC to wash my vehicle??\n", + "Posts: How does a PhD at UBC work?\n", + "Posts: I (Drédyn Fontana) Made this Crème Brulée Last Night. AMA (About the Crème Brulée)\n", + "Posts: How could Newly Minted CS/Programmers Adapt to AI?\n", + "Posts: where is everyone finding suitable off campus housing?\n", + "Posts: Any feedback on the Ph.D in School and Applied Child Psychology?\n", + "Posts: What are you guys doing for the reading break???\n", + "Posts: Frozen fountain from last week\n", + "Posts: I should be the next AMS VP\n", + "Posts: When you live with children in the same building\n", + "Posts: Dear AMS executives,\n", + "Posts: My Response to TMZ's Media Request\n", + "Posts: More Deceptive Ubyssey Coverage: That Time Dredyn Offered to Impeach CK and Make Me AMS President\n", + "Posts: so when's this ck super bowl halftime performance comin?\n", + "Posts: chat is this real\n", + "Posts: Release The Report\n", + "Posts: Why is it impossible to get a bus to UBC on time?\n", + "Posts: To all the single baddies\n", + "Posts: CK and Dredyn should box each other\n", + "Posts: Where are all the hot men on campus\n", + "Posts: At least I don't lie about my bias\n", + "Posts: Online Petition for Palestine 🇵🇸\n", + "Posts: More deceptive Ubyssey coverage of Dredyn Fontana's removal\n", + "Posts: GoGlobal Awards\n", + "Posts: I missed CPSC 210 phase 1 deadline\n", + "Posts: NEW FRIENDS? ANYONE? DROPPED OUT OF FRAT FOR A REASON\n", + "Posts: Life is going good except I’m lonely even though I have three friend groups\n", + "Posts: me waiting for an update on the girl flirting with her TA\n", + "Posts: Just took a quiz worth 10% of my grade and got 93%!! (At least on the part that was graded so far)\n", + "Posts: UBC visitor wifi not working\n", + "Posts: How bad is the commute to Downtown Vancouver in the Summer?\n", + "Posts: affordable haircuts near ubc?\n", + "Posts: Ikb museum & free snacks!\n", + "Posts: Arts International Scholarship\n", + "Posts: REMINDER: It's time to renew your U-Pass! [February 2025]\n", + "Posts: What is going on with the AMS and does it matter?\n", + "Posts: What happens if you join a Prairielearn course that you are not in\n", + "Posts: In Japan over reading week and was happy to see our favourite study spot well represented abroad!\n", + "Posts: R4 got sent to a parallel dimension? (that's not Wesbrook mall)\n", + "Posts: tax forms workday\n", + "Posts: summer courses registration\n", + "Posts: any dance rooms\n", + "Posts: Is the new UBC gym open?\n", + "Posts: 4 cop cars lights and sirens blaring cornering the grass field next to Doug Mitchel Rink?\n", + "Posts: Jacob Elordi Look Alikes\n", + "Posts: Need a low-key gym near UBC—struggling with consistency & gym anxiety\n", + "Posts: Looking for insights on the MDS program at UBC - What's your experience been like so far?\n", + "Posts: Opinion: Translink buses should not have heaters in the back\n", + "Posts: Badminton Buddy\n", + "Posts: MICB 308 Midterm\n", + "Posts: Can I join the UBC surf club without ever surfing before?\n", + "Posts: It’s that time of year again\n", + "Posts: Lost Phone Found by MacInnes Field\n", + "Posts: Feeling like an imposter\n", + "Posts: Printing in scale with Pay for Print\n", + "Posts: Will I get fined\n", + "Posts: ‘Biology’s cookbook’: UBC team discovers new kind of brain cell linked to memory\n", + "Posts: Wow this Valentine’s Day was painful 😂😭\n", + "Posts: ENG 111 Week 6 Notes\n", + "Posts: Be discreet people\n", + "Posts: Laundry Wars are back\n", + "Posts: Every week I go to office hours for help with the course material, but the TA is always busy flirting with another student\n", + "Posts: HOLY LORD CAN SMONE PLS TELL ME HOW TO DO THE FINANCIAL ANALYSIS FOR 101, THERES NTH IN THE CASE BUT THEY STILL WANT US TO GIVE NUMBERS BUT AT THE SAME TIME NOT MAKE ASSUMPTIONS\n", + "Posts: Housing for Graduate students\n", + "Posts: ‘I’m not the first victim’: Removed VP AUA calls out toxic culture, plans to sue AMS for wrongful termination\n", + "Posts: AMS…what’s happening\n", + "Posts: Am I cooked for my major?\n", + "Posts: Dredyn is suing!!\n", + "Posts: I can't trust anyone rn\n", + "Posts: the grind never stops...\n", + "Posts: AMS Controversy\n", + "Posts: exam schedule??\n", + "Posts: Messed up CPSC107 midterm for unclear instructions or I’m dumb\n", + "Posts: Single’s meet up at Wreck Beach for sunset ☀️🌊✨\n", + "Posts: fire alarm at koerner\n", + "Posts: Math & Education Dual Degree\n", + "Posts: How to access Jupyter after CPSC 103\n", + "Posts: Caught bro having a smoke outside of IKB\n", + "Posts: KIN 335 Final Grade Procedures\n", + "Posts: Happy v day to the beloved 49\n", + "Posts: What is the equivalent of arts student center for science\n", + "Posts: What would happen to Canadian universities like UBC if we get annexed by the United States?\n", + "Posts: I think my TA is weirdly uncomfortable around me? Kind of concerned about my grade tbh\n", + "Posts: Look up! Wave at the police! There’s a drone flying above AMS Nest building\n", + "Posts: You guys need to learn how to lock bikes\n", + "Posts: A realistic review of Gold's Gym UBC\n", + "Posts: Will grad school care about two withdrawals?\n", + "Posts: Can I borrow your can opener?\n", + "Posts: Why do guys not ask girls out anymore? Seems like I don't get asked out by normal guys except weirdos\n", + "Posts: Massive drones flying above campus?\n", + "Posts: CAPS 391 Advice\n", + "Posts: Title: The Mating Rituals of the Endangered Computer Science Major\n", + "Posts: odds prince harry is on campus today for invictus games?\n", + "Posts: Name of student file-sharing system/intranet circa. 2012?\n", + "Posts: B.C.'s minimum wage to increase in June to keep pace with inflation.\n", + "Posts: To All Girls of UBC: Happy Valentines Day\n", + "Posts: I NEED FLOWERS!\n", + "Posts: Single as a pringle\n", + "Posts: Interested international student\n", + "Posts: Aphrodite project\n", + "Posts: CAPS 391 midterm 1\n", + "Posts: Blue Chip Matcha Brand\n", + "Posts: Fail a 25% midterm\n", + "Posts: why are there sooo many police vehicles campus?\n", + "Posts: Gotta do it one last time before graduating\n", + "Posts: me on Valentine’s Day every year\n", + "Posts: Turns out neuroscience major is a gpa booster\n", + "Posts: How are the basketball courts like in SRC?\n", + "Posts: GUYS I ASKED HER OUT\n", + "Posts: Reasoning behind this sign?\n", + "Posts: Graduation photos booking ? Did yalll do it\n", + "Posts: chem203 no longer offered?\n", + "Posts: Places or bathrooms to comfortably put on makeup?\n", + "Posts: Creating Affordable Food Options\n", + "Posts: I think I'm falling in love with a girl in a class I'm TA'ing for\n", + "Posts: Bioc 302 midterm\n", + "Posts: reading week plans?\n", + "Posts: Geos 270 midterm\n", + "Posts: TAs: how do you react to seeing someone cheating in an exam?\n", + "Posts: Jealousy on couples\n", + "Posts: Smh men go buy flowers for your girl\n", + "Posts: ECON102 Midterm with Gateman\n", + "Posts: Lost dog? Wesbrook Village by Blenz Coffee\n", + "Posts: Does Yung Kai go here?\n", + "Posts: Alumni plz help me: Does anyone have the pic from when someone spray painted FUCK SOCRATES & DESCARTES on the Buchanan building ~8 years ago?\n", + "Posts: Places to yell on campus\n", + "Posts: Absolutely bombed the Econ398 midterm\n", + "Posts: Regarding conditional offer letter\n", + "Posts: Pharmacy competitive average\n", + "Posts: Saw some guy hanging a tree from ESC\n", + "Posts: I did so bad on econ101\n", + "Posts: UBC to ban winter session subletting in year-round housing starting 2025\n", + "Posts: Opinion: Is IKB the ideal study space? It’s complicated\n", + "Posts: Steve Aoki Headlines AMS Block Party 2025\n", + "Posts: LING 209 this term?\n", + "Posts: Steve Aoki as this year’s block party headliner\n", + "Posts: Police Presence at UBC???\n", + "Posts: Have you ever Googled former classmates, alumni, to see what they are up to now?\n", + "Posts: Do you think that the AMS executives will be more accountable with the EPA outlined in the email?\n", + "Posts: Today’s pamphlet! Hit the Grind!\n", + "Posts: UBC Divest Petitioners - do better, please\n", + "Posts: UBC MDS Scholarship Contact\n", + "Posts: Does anyone have experience with the bachelor of Social Work\n", + "Posts: AMS Petition for Unique Building Addresses (please sign!)\n", + "Posts: transfer courses equivalency\n", + "Posts: Have you ever been contacted by a former classmate years after or would that be weird?\n", + "Posts: Was this always in FSC?\n", + "Posts: How does one get fired from the ams as an executive\n", + "Posts: Student health services dgaf 😭\n", + "Posts: Fraser Parkade this year is a disaster, anyone struggling?\n", + "Posts: AMS Block Party Headliner Announcement Tomorrow\n", + "Posts: Aphrodite Project\n", + "Posts: Fuck exams!!!!’\n", + "Posts: chances I'll get winter housing with this waitlist number?\n", + "Posts: AMS EMAIL TLDR?\n", + "Posts: Chem 123 MT 2 preparing\n", + "Posts: For those who didn’t get the email\n", + "Posts: I don’t know whats going on but I heard another student talk about something that sounded AMS related really loudly to her friend when I went for Lebanese food in October AMA!\n", + "Posts: Vision Insurance\n", + "Posts: PSYC 218 Tutoring Services\n", + "Posts: Anyone else feel like it’s the end of term?\n", + "Posts: Why do i get email from ams\n", + "Posts: My personal experience with VP AUA Dredyn\n", + "Posts: What did homie do 💀\n", + "Posts: TF is going on in the AMS???\n", + "Posts: Removal of VP of Academic and University Affairs?\n", + "Posts: How's everone's psyc101 mt, especially Pro Clark\n", + "Posts: I like to study…..\n", + "Posts: I have 3 midterms tomorrow\n", + "Posts: Math 101 rip me\n", + "Posts: Applying for Jobs Sucks, So I Built My Own Resume Generator With Automatic Formatting\n", + "Posts: Weird loud pop or bang outside near brock commons?\n", + "Posts: How do we feel about MATH 220 mt?\n", + "Posts: Chem 123 MT1 results\n", + "Posts: how bad will i struggle with math 100 without calculus knowledge???\n", + "Posts: chem 123 mt1 grades\n", + "Posts: Help! Lost Sunglasses\n", + "Posts: in the gym rn :)\n", + "Posts: Any advice for foreign postgraduate to get a research opportunities in UBC?\n", + "Posts: CPSC 210 team W\n", + "Posts: in lecture rn :)\n", + "Posts: stat 201 Midterm\n", + "Posts: Late Credit/D/Fail??\n", + "Posts: How do you find undergrad volunteers for your lab?\n", + "Posts: I made a chrome extension for calculating grades like SSC used to do\n", + "Posts: Reading break ideas\n", + "Posts: police everywhere on campus?\n", + "Posts: how long does it take to get UBCcard replacement\n", + "Posts: Can anyone tell me about UBC hiring solutions temp pool recorded interview?\n", + "Posts: blank sheets of paper\n", + "Posts: LING 100 Midterm\n", + "Posts: Good morning everyone! Here’s today’s Marcus pamphlet\n", + "Posts: Should UBC install cameras in parkades??\n", + "Posts: New MacGregor Pamphlet from today\n", + "Posts: Is the Master of Data Science (MDS) program at UBC Worth It?\n", + "Posts: poetry lovers, interested in an on-campus reading during February?\n", + "Posts: Getting to the mountains from UBC\n", + "Posts: Exchange student\n", + "Posts: Don't worry, I've got it.\n", + "Posts: Shoutout to the guy who ordered bubble tea around 7:45 tn!\n", + "Posts: Roommate Change\n", + "Posts: letting it all out today so i can go back to being a calm and happy soul tomorrow. (rant)\n", + "Posts: any tips for fnh 160 midterm?\n", + "Posts: Only one day left - Disabilities Advocacy Survey\n", + "Posts: Petition for IKB microwave\n", + "Posts: Any tips for the Comm 394 midterm?\n", + "Posts: going to piss watch my stuff for me thanks\n", + "Posts: Psyc 101 Quiz Results\n", + "Posts: Anxiety before exam\n", + "Posts: More weird posters popping up around campus\n", + "Posts: Take off your backpack on busses\n", + "Posts: My experience of successfully appealing for re admission after two failed years\n", + "Posts: LFS 252 Midterm\n", + "Posts: ubc cine discord\n", + "Posts: Income tax from part time work\n", + "Posts: Chem courses and labs\n", + "Posts: math 101 midterm\n", + "Posts: What are you doing for housing\n", + "Posts: Good healthcare provider\n", + "Posts: cpsc 210 help part 2\n", + "Posts: Mid semester workday crashout\n", + "Posts: “Whores are for serfs” -MacGregor\n", + "Posts: UBC offer Acceptance and deposit-When Do you actually pay ?\n", + "Posts: film production acceptance\n", + "Posts: What is the easiest course that can fulfill the science requirement for an arts student?\n", + "Posts: Graduate application results\n", + "Posts: cpsc 110 rant .\n", + "Posts: Is UberEats eating your wallet too?\n", + "Posts: What are y’all bringing for grad photos\n", + "Posts: RESEARCH OPPORTUNITY: Participants with Depression needed for UBC & SFU Department of Psychiatry Virtual Reality study\n", + "Posts: lost wallet on campus\n", + "Posts: Summer Sublet Rent (M)\n", + "Posts: CAN MOTHERFUCKERS JUST STOP STEALING ICE CREAM?!!\n", + "Posts: What are they expecting from us? (CPSC 210)\n", + "Posts: Affiliation scholarships\n", + "Posts: Lowkey the research I am doing is way more fun than coursework\n", + "Posts: I wont be buying your donuts to fundraise for your clubs\n", + "Posts: FUCK JAVA WHYYY\n", + "Posts: YRH what are my odds?\n", + "Posts: Cpsc 210 midterm\n", + "Posts: Course Withdrawal Quick Question\n", + "Posts: When is the best time to go to ARC / Birdcoop?\n", + "Posts: Is there UBC BAND club\n", + "Posts: Math 101 hope??\n", + "Posts: MATH 101 last minute review booklet/outline of material\n", + "Posts: There's something profound and nostalgic about this hall when it's quiet at night\n", + "Posts: I had to put my dog down today and I don’t know what to do\n", + "Posts: Math 101 Midterm 1 (Rant)\n", + "Posts: Opinion: Of course UBC is full of Redditors\n", + "Posts: how crowded is arc in the evenings\n", + "Posts: Winter housing wait list numbers\n", + "Posts: [Repost][Academic] [Research Study]: Eating Habits and Social Behaviours (Canadian Residents 18+)\n", + "Posts: Residence with In-Unit Laundry?\n", + "Posts: Who else stressed AF about the math 101 midterm?\n", + "Posts: I found this in my drawer, can't believe it is almost 10 years.\n", + "Posts: Anyone else casually losing their minds rn?\n", + "Posts: Help with employee form\n", + "Posts: Drink water, take care of yourself, celebrate yourself 😌💜 I. Believe. In. Yoouuuuuuu 😤💜\n", + "Posts: Question about Residence\n", + "Posts: Idk how a concession works\n", + "Posts: Profs who don’t record lectures\n", + "Posts: psa to ubc redditors\n", + "Posts: Frat house accommodation for summer\n", + "Posts: UBC DMD: Questions\n", + "Posts: Transfer from lfs to science\n", + "Posts: LING 100 Midterm\n", + "Posts: how to get to ARC main level studio\n", + "Posts: How bad does 3 W’s look on a second year transfer transcript?\n", + "Posts: Is tuition cheaper than UBCV?\n", + "Posts: Tutor for APSC 160\n", + "Posts: Cpsc 103 rant!!\n", + "Posts: JUST LOVE THE WEATHER OUT HERE\n", + "Posts: Feeling lost and stuck with co-op search term\n", + "Posts: Library website not working for anyone else?\n", + "Posts: UBC Dap program acceptance rate\n", + "Posts: trying to make a $400 a month profit off another student is shameful\n", + "Posts: language requirement\n", + "Posts: Question about go global\n", + "Posts: Graduation Photos\n", + "Posts: Does anyone else hate AI?\n", + "Posts: Pamphlet man summary?\n", + "Posts: UBC HOUSING thunderbird\n", + "Posts: UBC Financial Aid\n", + "Posts: How many credits do I need for a minor?\n", + "Posts: Is a Master's in Chemistry in UBC Worth It for Career Growth?\n", + "Posts: SuperBowl SUNDDAAYYY!!\n", + "Posts: Peace Over Drama! 🙏🏻\n", + "Posts: Single, broke, no friends, bad grades\n", + "Posts: what's it like running a student seminar?\n", + "Posts: W on Transcript\n", + "Posts: Anatomy and Physiology requirements\n", + "Posts: Tonight’s Wreck Beach sunset is amazing with the snow on the ground\n", + "Posts: Not sure where else to share this\n", + "Posts: Late night drives?\n", + "Posts: For studying room do I need more than 3 people I really need a quiet place to do my zoom on campus\n", + "Posts: They taking away our r4 screens?\n", + "Posts: linkedin post sigma\n", + "Posts: We are now students of the University of British Car\n", + "Posts: Has anyone taken BIOL 413: Zoogeography? Thanks!\n", + "Posts: I’m sure this will make the Workday experience even better than it already is! 😃\n", + "Posts: Dance group spotted on campus\n", + "Posts: Doorless toilet stalls in the new BMEG building\n", + "Posts: Great Canadians: UBC graduate Rick Hansen\n", + "Posts: Professor z (a ubc professor) is overwhelmed right now!\n", + "Posts: Warm places on campus\n", + "Posts: Am I doing it wrong? Was university supposed to be easier?\n", + "Posts: ABOUT TO GRAB A LARGE COFFEE AND LOCK IN AT THE LIBRARY\n", + "Posts: Sooo many missed connections at the HKSA event\n", + "Posts: Who loves dancing\n", + "Posts: GOOD MORNING! MORNING RUN NEVER FELT SO GOOD 🏃 💨\n", + "Posts: Just one more week before the reading break! Share your energy everyone!!\n", + "Posts: UNI IS MAKING ME CRY A LITTLE AND I DONT KNOW WHAT TO DO\n", + "Posts: Them waters be tasting really damn good while high\n", + "Posts: Combined math and physics + minor comp sci or combined math and physics + minor math?\n", + "Posts: Should I accept this housing offer?\n", + "Posts: What are the chances of being able to accept a grad offer past the deadline?\n", + "Posts: Ikb late night sessions\n", + "Posts: What's the rules around non-degree students and (sublet) housing?\n", + "Posts: Literally me scrolling reddit before my midterm today\n", + "Posts: feeling horrible\n", + "Posts: University Blvd Parking\n", + "Posts: Is there a marching band playing in or near IKB right now?\n", + "Posts: POEM WHICH GOT ME INTO UNI\n", + "Posts: cmd-f 2025 hacker applications open now\n", + "Posts: Frats and sorority wesbrook\n", + "Posts: MICB 211, flopped Figure Facts quiz\n", + "Posts: Can I get a room in the next semester? It seems hard but if I cannot I will have to rent a new house.\n", + "Posts: psyc online courses\n", + "Posts: Do I really need to buy the SOCI 250 203 textbook???\n", + "Posts: How hard is it to find a prof you've never talked to for a reference letter\n", + "Posts: I'm finishing with a 4.0 GPA\n", + "Posts: Opinion: It's really annoying when a prof/TA's only recommendation for getting a good grade on a particular assignment is 'being detailed' (which is some advice with not a lot of DETAIL)\n", + "Posts: Can we not 🫠🫠🫠🫠🫠\n", + "Posts: IKB toilets spray water\n", + "Posts: Beep boop bee bop\n", + "Posts: Gateman 102 Midterm\n", + "Posts: Someone, fix these outlets!!!!\n", + "Posts: Library spot secret\n", + "Posts: I need 1099999999 redbulls and beers together\n", + "Posts: Why has the 99 become so irregular?\n", + "Posts: UNI REALLY GOT ME QUESTIONING IF IM LEARNING OR JUST SURVIVING 😦\n", + "Posts: Free Floopy Discs?\n", + "Posts: Is this CPEN 212 review accurate?\n", + "Posts: Always love seeing out of province cars visiting our beautiful campus!What do you think brought them here?\n", + "Posts: The fountain is frozen, can we skate on it?\n", + "Posts: ubc subreddit found collectively crashing out\n", + "Posts: Did anyone try Dabao at Irving? It is terrible\n", + "Posts: Psyc101 is so hard!!!!!\n", + "Posts: Is it bad to ask a prof for feedback?\n", + "Posts: Will a makeup midterm ever be conducted during reading break?\n", + "Posts: Is blue chip overrated\n", + "Posts: Winter Session Housing Offers for 2025-2026 are out\n", + "Posts: I WANT AN AIR FRYER WAHHHHHHH\n", + "Posts: How it feels to study in Fred Kaiser entrance hall\n", + "Posts: Is this place turning into an Asylum\n", + "Posts: Booking Rooms in the Nest?\n", + "Posts: JUST WHEN LIFE COULD’NT GET ANY WORSE YO BITCH LEAVES YOU 😔 (on top of that midterms)\n", + "Posts: Daily MacGregor phamphlet\n", + "Posts: Who wants to join me to the levels nightclub tn drinks on me 😩🫱🏻‍🫲🏼\n", + "Posts: I WANT MONEYYYY\n", + "Posts: Question for those who took BIOC 302 before\n", + "Posts: I WANT NOTHINGGG\n", + "Posts: I WANT SLEEP AND NO STRESS\n", + "Posts: Queer spaces on campus?\n", + "Posts: Roomate/Unit Mate Complaints - Fairview Crescent\n", + "Posts: I WANT A LOBOTOMY\n", + "Posts: I WANT A JOB!!!!\n", + "Posts: I HAVE A GIRLFRIEND\n", + "Posts: u/Winter_Fix3187 Jake Is not coming back.\n", + "Posts: I WANT A DUCKKK\n", + "Posts: Does anyone want to practice mock interviews for product manager roles? (for tech companies)\n", + "Posts: I WANT A GIRLFRIEND\n", + "Posts: gym needs my course schedule on letterhead to cancel\n", + "Posts: Washing machine\n", + "Posts: Bioc202 Midterm\n", + "Posts: I love Place Vanier Dining Hall!\n", + "Posts: Anyone else in Geog 121??\n", + "Posts: Suggestions for Stuff to do in Vancouver over Reading Break\n", + "Posts: Have you heard of “Launch Your Career in Canada” program at the Career Centre?\n", + "Posts: what do i do with my life\n", + "Posts: My hunch for the faculty that creates ikb toilet situation\n", + "Posts: Feeling like a sag of chemicals\n", + "Posts: Best seat on the 49?\n", + "Posts: lost pink hydro flask\n", + "Posts: Looking for ski buddies\n", + "Posts: Why do tuition refunds take so long to process?\n", + "Posts: Lost scarf near Geography building parking lot\n", + "Posts: PSYC 301 Midterm\n", + "Posts: Lost wallet on campus\n", + "Posts: Econ101 Clive midterm\n", + "Posts: Avatar: The Last Airbender Orchestra\n", + "Posts: hyperpigmentation in SCARFE\n", + "Posts: Directed studies\n", + "Posts: how do u study for cens 307\n", + "Posts: weird poster in the nest?\n", + "Posts: Bus loop guy was very topical today lol\n", + "Posts: UBC student takes down Fortune 500 one co-op at a time\n", + "Posts: Final Exam schedule of term 2\n", + "Posts: clubs to join to make friends\n", + "Posts: Game On at the UBC Gaming Expo!\n", + "Posts: 🚨 A Valentine's Party at UBC You Don’t Want to Miss! 💘🎟️\n", + "Posts: Pre-Med and Undergraduate Advice\n", + "Posts: Reviews on the food at the UBC Hospital?\n", + "Posts: How to find info on psychedelics club?\n", + "Posts: Confused about how graduating works\n", + "Posts: ‘They're treating people poorly’: AMS Food and Beverage employees seek unionization over poor pay, lack of accountability\n", + "Posts: Can a science student do research in Econ at VSE?\n", + "Posts: Help! Transfer question\n", + "Posts: Coffee chats are scary — here’s where to have them on UBC campus\n", + "Posts: math 101 advice\n", + "Posts: [Repost][Academic] [Research Study]: Eating Habits and Social Behaviours (Canadian Residents 18+)\n", + "Posts: MD5 Ambulance rn\n", + "Posts: reference letters\n", + "Posts: How to pay attention in class?\n", + "Posts: Fnis 454? How is it\n", + "Posts: Any suggestions for memorization for math 101 !? Midterm 1\n", + "Posts: What to do with Freshprep recipe meal kit\n", + "Posts: the art of seDUCKtion <3\n", + "Posts: poli 240 readings help pls\n", + "Posts: for all the aot fans 🤭\n", + "Posts: How have you guys been doing for Poli 100: Hope and Happiness?\n", + "Posts: Withdrawal ASAP or Wait out?\n", + "Posts: Something's up with the AMS\n", + "Posts: Valentine’s Day Deliveries\n", + "Posts: How often do you guys skip class?\n", + "Posts: Bigass puffer jackets on the bus\n", + "Posts: squirrel vs. donut 🐿️🍩\n", + "Posts: Go Global Credit Transfer\n", + "Posts: AMS VP external and VP finance resign ahead of general election\n", + "Posts: Anyone know why the 33 was yellow?\n", + "Posts: CHEM 123 Midterm 1\n", + "Posts: Bioc 203 how to study for midterm?\n", + "Posts: Was walking through the woods on campus and saw this\n", + "Posts: TOP FIVE HACKS TO GET 100% ON EVERY EXAM THIS GLORIOUS MIDTERM SEASON!!!\n", + "Posts: IKB toilet smell\n", + "Posts: Is there any chance there's no classes tomorrow?\n", + "Posts: Psyc 322 midterm......\n", + "Posts: In your experience, do employers give you interview tips when you get an interview or does it depend on the job and industry?\n", + "Posts: how the hell do you study for Poli110😭\n", + "Posts: Major in physics\n", + "Posts: Celebrating UBC resident doctors and their transformative impact\n", + "Posts: Campus food horror stories?\n", + "Posts: How hard is it to get a 100 on term project cpsc 210\n", + "Posts: This is so cute what’s going on?\n", + "Posts: MATH 221 201 Zoom Lecture\n", + "Posts: Parkade top floors closed\n", + "Posts: Are there places that sell cheap roast beef sandwiches on campus\n", + "Posts: girls: how do you deal with pms\n", + "Posts: Does the sun appear at rose garden?\n", + "Posts: ubc student email down?\n", + "Posts: Can I ask TA for reference letters?\n", + "Posts: Fairview Crescent weird smell…\n", + "Posts: February UPass?\n", + "Posts: Bubble Tea and Cigarettes Concert\n", + "Posts: UBCSECURE!!!!!!\n", + "Posts: why why did i let myself get to this point\n", + "Posts: CPSC Coop: Are we cooked?\n", + "Posts: I've been one of the cooks for almost a decade. Ask me anything.\n", + "Posts: Entering class statistics document\n", + "Posts: what's the snow like on campus rn?\n", + "Posts: Reminiscent of the big E\n", + "Posts: The moon is absolutely gorgeous tonight\n", + "Posts: how cooked will buses be tomorrow\n", + "Posts: Meanwhile in Calgary\n", + "Posts: Super thai hot pot\n", + "Posts: UBC hosting its first ever Girls & Women In Sports game on Friday, February 7\n", + "Posts: Harrison hotsprings\n", + "Posts: School tomorrow\n", + "Posts: Cracking sound from roof in hemlesem\n", + "Posts: How to apply for CS at UBC?\n", + "Posts: ams tutoring positions\n", + "Posts: Third year psych classes\n", + "Posts: Should I head back to UBC tonight?\n", + "Posts: Gaming At The Great Hall!\n", + "Posts: What is a good Minor for a Psych major? (education? health & society? sociology?...)\n", + "Posts: Chan Centre Ticket Reselling Policy\n", + "Posts: To answer the question \"what is going onnnn\": a campus (ski) tour (📷@evanwongphoto)\n", + "Posts: Which mode of learning do you prefer at UBC?\n", + "Posts: Any Fall 2025 undergrad faculty of LFS admits?\n", + "Posts: Quack Quack Quack Quack Quack Quack Quack Quack\n", + "Posts: frst 302- quiz!\n", + "Posts: Now that the snow is over, when will the roads be cleared?\n", + "Posts: Shoutout to the instructors\n", + "Posts: go global results\n", + "Posts: UBC dating website\n", + "Posts: Are university departments and instructors reconsidering field trips and conferences to the US due to the current situation?\n", + "Posts: Looking for someone to play tennis together\n", + "Posts: Some photos I took around UBC during the snow!\n", + "Posts: UBC Sauder Application and References\n", + "Posts: NEST IS OPEN Feb 4, and other updates\n", + "Posts: good luck to folks commuting to work from ubc today🥲\n", + "Posts: 25 stuck on Dunbar hill\n", + "Posts: Closure at downtown campus?\n", + "Posts: anyone been in the BUDR program?\n", + "Posts: Advise for choosing a major\n", + "Posts: Main Mall Snow Fight!\n", + "Posts: I failed one class last term and withdrawed one class this term that leaves a “w”\n", + "Posts: Prof refused to make a Piazza\n", + "Posts: How likely is it for Wednesday to be cancelled?\n", + "Posts: Exams tomorrow still on?\n", + "Posts: will psyc 314 midterm be cancelled tmrrw since campus isnt technically closed 😭😭😭\n", + "Posts: # input maniacal laughter that classes are cancelled again\n", + "Posts: Classes moved online for tomorrow!!!❄️☃️🌨️\n", + "Posts: Snow day, let’s go!!\n", + "Posts: Class Cancelled Tuesday\n", + "Posts: In class cancelled tomorrow\n", + "Posts: REJOICE FELLOW PROCRASTINATORS REJOICE🗣️🗣️😭😭\n", + "Posts: No Classes Tuesday\n", + "Posts: cancel Hahahahahahhaha\n", + "Posts: cancel Hahahahahahhaha\n", + "Posts: UBC has a fight song???\n", + "Posts: Snow Day again!!\n", + "Posts: Tuesday is officially Snow Day!!\n", + "Posts: classes cancelled tomorrow\n", + "Posts: What happens to labs if it they call it a snow day?\n", + "Posts: Graduation home location\n", + "Posts: UPDATE: the emails are working to cancel class\n", + "Posts: Pablo Dicasso strikes leaving his newest work of art on campus\n", + "Posts: Whoever stole my chipotle at walter gage\n", + "Posts: Reasons why UBC should cancel class tomorrow\n", + "Posts: No Snow Day Tomorrow\n", + "Posts: Hey UBC would you please let us know maybe more than 10 hours in advance pls... Maybe...\n", + "Posts: Snow Day Update - Dec. 4\n", + "Posts: They are thinking about it\n", + "Posts: Help me pick between UBC and UBCO\n", + "Posts: will UBC cancel class tommorow? Dec 4? How bad does it have to be in order for classes to be cancelled?\n", + "Posts: Best flower shop around campus?\n", + "Posts: Cancel Tmmrw Pleaseee\n", + "Posts: Why doesn’t UBC do a HoCo like Western and other Ontario schools?\n", + "Posts: People who took Math 221 before\n", + "Posts: Snowinggggggg❄️🌨️🌨️\n", + "Posts: Anyone wants their photo taken for free on campus today?\n", + "Posts: UBC MDS Admission\n", + "Posts: Is this Acurate for tmmr?\n", + "Posts: what are the chances of being rejected from go global partner after being nominated\n", + "Posts: PSYC 307 Cultural Psychology tomorrow\n", + "Posts: How to Escalate - Felt Misled by Advising\n", + "Posts: what is going onnnn\n", + "Posts: IR requirement application\n", + "Posts: Disabilities United Collective Advocacy Survey\n", + "Posts: UBC Fight? No Snow\n", + "Posts: Is the SRC/BirdCoop still open today?\n", + "Posts: Classes cancelled tomorrow?\n", + "Posts: CPSC 210 1st yr summer vs 2nd yr winter term 1\n", + "Posts: Accommodated Exams- Cancelled or Not?\n", + "Posts: school cancelled on Tues?\n", + "Posts: Examlet cancelled!!!\n", + "Posts: Does any one received master school offer ?\n", + "Posts: BIOC 202 midterm tips?\n", + "Posts: Arts Co-op Summer Search Term Details?\n", + "Posts: Photos from last year\n", + "Posts: AMS Winter Weather Update\n", + "Posts: How much snow is on campus anyway?\n", + "Posts: is there any valentine events\n", + "Posts: Best Snow Day Creation!\n", + "Posts: Will Libraries Be Open Tomorrow?\n", + "Posts: Link to ubc cancellation?\n", + "Posts: Possibility of Tuesday snow day ….\n", + "Posts: UBC Campus Notifications | Weather Advisory\n", + "Feb. 02, 2025 – 5:54 p.m. PST\n", + "Please note, the campus is NOT closed, but in-person learning activities are cancelled out of an abundance of caution.\n", + "Posts: Snow Creations!\n", + "Posts: Snow day… kinda\n", + "Posts: Is my exam cancelled for tomorrow?\n", + "Posts: In-Person Learning Activities are Cancelled Tomorrow, February 3rd.\n", + "Posts: International PhD admissions?\n", + "Posts: 3Mth research visit to UBC, looking for Esim?\n", + "Posts: Just made my second creation.. with lots of friends. Enjoy\n", + "Posts: Summer session?\n", + "Posts: Told to not deadlift on the deadlift platforms at bird coop\n", + "Posts: Reminder to wear your sensible footwear!\n", + "Posts: Feedback on Visiting International Research Students (VIRS) at UBC?\n", + "Posts: anyone excited for the snow tmr?\n", + "Posts: Who wants to study at Allard with me and my friends?\n", + "Posts: You love to see it\n", + "Posts: Classes have began moving online!\n", + "Posts: University yaoi 👍\n", + "Posts: Will you graduate with or without distinction?\n", + "Posts: Happy snow day everyone!\n", + "Posts: People living on or close to campus, how's it looking?\n", + "Posts: Send us your snow creations on campus ⛄️\n", + "Posts: What we’ve all been waiting for ❄️\n", + "Posts: BSc Specialization Averages\n", + "Posts: Guys I’m looking for people to join me for the snowball fight tomorrow (I don’t have friends lol).\n", + "Posts: Its sunny outside\n", + "Posts: UBC frat house!\n", + "Posts: Has anyone received their Master’s admission decision yet?\n", + "Posts: I put together a tool that lets you check where an item is from, and then search for Canadian alternatives. Looking for feedback!\n", + "Posts: UBC Food Services Gift Cards\n", + "Posts: Pacific Spirit Park this morning ❄️\n", + "Posts: who else js single on valentines?\n", + "Posts: Snow closure Monday\n", + "Posts: Looking for bus advice today\n", + "Posts: UBC Students - Your Opportunity Right Now, Is More Significant Than You Realize\n", + "Posts: X/Twitter and UBC services that are non-Canadian\n", + "Posts: A snowy UBC, Good luck with the snow today!\n", + "Posts: Hope ur happy to the guy saying where's my snow\n", + "Posts: ITS SNOWINGGGG ❄️\n", + "Posts: Praying for Snow Day to recover from Luka Doncic Trade\n", + "Posts: Snow Day on Monday?!\n", + "Posts: Shouting to nowhere at 4am is not ok\n", + "Posts: has anyone here ever visited the Library PARC on wesbrook?\n", + "Posts: Grinding a paper\n", + "Posts: Anyone else tired of friendship drama?\n", + "Posts: What to do after Kinesiology degree?\n", + "Posts: Should I graduate in 6 years for another Coop\n", + "Posts: How much will Cost of Living be affected after the Tarrifs?\n", + "Posts: Why do people like Dr. Wickenden???\n", + "Posts: Worklearn hours\n", + "Posts: UBC Summer Storage?\n", + "Posts: Buy Canadian Made! 🇨🇦\n", + "Posts: Confused about taking a psyc minor\n", + "Posts: SOCI 250 TB PLS!!!\n", + "Posts: KVL & KCL Explained | Master Kirchhoff’s Laws for Circuit Analysis #kirc...\n", + "Posts: EvoEstimator - An App I Created!\n", + "Posts: Go Global matching process\n", + "Posts: Reservation for ARC\n", + "Posts: Stop talking in Walter Gage study rooms!\n", + "Posts: psa, law library has best ocean view!\n", + "Posts: What should i do when i haven’t received a cheque from ubc after 2-3 months?\n", + "Posts: Yall is it gonna snow enough to cancel classes?\n", + "Posts: I feel so sick and i have a midterm in 5 days\n", + "Posts: Why did they change the names of some building’s?\n", + "Posts: I.K.B. Confession\n", + "Posts: Found LeCanvas and it looks funny ash\n", + "Posts: Grad School Application 2 or 3 LOR\n", + "Posts: Feel sad after OL interview...\n", + "Posts: Should we even apply for co-ops, internships or jobs in the US, and at what point should someone draw the line?\n", + "Posts: Should I get a bus pass during my 4 nights at Gage Suites?\n", + "Posts: Snow in Vancouver\n", + "Posts: Dropping Courses\n", + "Posts: WHERE IS THE SNOW\n", + "Posts: Filling out Award Section of NSERC Form 202 - Part 1\n", + "Posts: How much will grocery prices rise after Trump's tariffs today?\n", + "Posts: Ontario students: Register to vote!\n", + "Posts: Standard Snow Procedure???\n", + "Posts: People.. learn how to park PLEASE\n", + "Posts: I'm being cyber bullied by other students for joining a club and I feel victimized\n", + "Posts: Tariff Issues impacting Canada\n", + "Posts: Did you get your Go Global Results?\n", + "Posts: Just curious - absolute cinema\n", + "Posts: Go Global Summer\n", + "Posts: ADHD assessment\n", + "Posts: Where to find dining hall cookies??\n", + "Posts: Go global exchange to UTokyo\n", + "Posts: Stupid chem 123 lab mistake is HAUNTING ME\n", + "Posts: LOST: Black film camera\n", + "Posts: Sexiest snowman competition!\n", + "Posts: Go Global Results\n", + "Posts: WHO’S READY TO BUILD SOME SNOWMEN WITH 20cm SNOW\n", + "Posts: RCMP, Ambulance, Fire Truck on 16th by Pacific Spirit Park\n", + "Posts: go global Australia\n", + "Posts: Microbiology/Immunology Program Inquiry\n", + "Posts: Prospective Transfer Student Housing Super Confused!\n", + "Posts: Advice Needed for RCC\n", + "Posts: THANK YOU AMS ADVOCACY!!\n", + "Posts: What’s the difference between the ois the IMES and the int scholars?\n", + "Posts: ISO late-night study buddies ^_^\n", + "Posts: ❄️ UBC Snowy Weather Megathread – Updates & Resources ❄️\n", + "Posts: Heads up: Naloxone kit expiry and where to get a new one on campus\n", + "Posts: How to Get a Prof to Know You? RANT\n", + "Posts: Sad and snowy soon\n", + "Posts: if you get accepted for your second choice in go global, can you still be evaluated for the other 2?\n", + "Posts: Submitted program completion application and graduation application but Workday is telling me I haven't completed my program application in Workday?\n", + "Posts: B.C. universities and colleges brace for financial shortfalls as Ottawa reveals new international student caps\n", + "Posts: Go global results are out!!\n", + "Posts: what are popular courses that are extremely hard to get in?\n", + "Posts: Dating events or matchup forms?\n", + "Posts: Econ 317 friends\n" + ] + } + ], + "source": [ + "# Process comments\n", + "def process_comment(comment, post_id, csv_writer):\n", + " csv_writer.writerow([\n", + " post_id,\n", + " comment.id,\n", + " comment.author,\n", + " comment.body,\n", + " comment.score,\n", + " comment.created_utc\n", + " ])\n", + " for reply in comment.replies:\n", + " process_comment(reply, post_id, csv_writer)\n", + "\n", + "# collect data\n", + "try:\n", + " for submission in subreddit.new(limit=10000):\n", + " print(f\"Posts: {submission.title}\")\n", + " posts_writer.writerow([\n", + " submission.id,\n", + " submission.title,\n", + " submission.author,\n", + " submission.selftext,\n", + " submission.score,\n", + " submission.num_comments,\n", + " submission.created_utc,\n", + " submission.url\n", + " ])\n", + "\n", + " # comments collection\n", + " submission.comments.replace_more(limit=10)\n", + " for comment in submission.comments.list():\n", + " process_comment(comment, submission.id, comments_writer)\n", + "\n", + " time.sleep(1) # Set delay to avoid blocking\n", + "finally:\n", + " posts_file.close()\n", + " comments_file.close()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}