ta-data-mex · salvador-carrasco · Nov 8, 2021
diff --git a/your-code/.ipynb_checkpoints/Q1-checkpoint.ipynb b/your-code/.ipynb_checkpoints/Q1-checkpoint.ipynb
@@ -0,0 +1,178 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In the cell below, create a Python function that wraps your previous solution for the Bag of Words lab.\n",
+    "\n",
+    "Requirements:\n",
+    "\n",
+    "1. Your function should accept the following parameters:\n",
+    "    * `docs` [REQUIRED] - array of document paths.\n",
+    "    * `stop_words` [OPTIONAL] - array of stop words. The default value is an empty array.\n",
+    "\n",
+    "1. Your function should return a Python object that contains the following:\n",
+    "    * `bag_of_words` - array of strings of normalized unique words in the corpus.\n",
+    "    * `term_freq` - array of the term-frequency vectors."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Import required libraries\n",
+    "\n",
+    "# Define function\n",
+    "def get_bow_from_docs(docs, stop_words=[]):\n",
+    "    \n",
+    "    # In the function, first define the variables you will use such as `corpus`, `bag_of_words`, and `term_freq`.\n",
+    "    \n",
+    "    \n",
+    "    \n",
+    "    \"\"\"\n",
+    "    Loop `docs` and read the content of each doc into a string in `corpus`.\n",
+    "    Remember to convert the doc content to lowercases and remove punctuation.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    \n",
+    "    \n",
+    "    \"\"\"\n",
+    "    Loop `corpus`. Append the terms in each doc into the `bag_of_words` array. The terms in `bag_of_words` \n",
+    "    should be unique which means before adding each term you need to check if it's already added to the array.\n",
+    "    In addition, check if each term is in the `stop_words` array. Only append the term to `bag_of_words`\n",
+    "    if it is not a stop word.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    \n",
+    "    \n",
+    "    \n",
+    "    \"\"\"\n",
+    "    Loop `corpus` again. For each doc string, count the number of occurrences of each term in `bag_of_words`. \n",
+    "    Create an array for each doc's term frequency and append it to `term_freq`.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    \n",
+    "    \n",
+    "    # Now return your output as an object\n",
+    "    return {\n",
+    "        \"bag_of_words\": bag_of_words,\n",
+    "        \"term_freq\": term_freq\n",
+    "    }\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Test your function without stop words. You should see the output like below:\n",
+    "\n",
+    "```{'bag_of_words': ['ironhack', 'is', 'cool', 'i', 'love', 'am', 'a', 'student', 'at'], 'term_freq': [[1, 1, 1, 0, 0, 0, 0, 0, 0], [1, 0, 0, 1, 1, 0, 0, 0, 0], [1, 0, 0, 1, 0, 1, 1, 1, 1]]}```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define doc paths array\n",
+    "docs = []\n",
+    "\n",
+    "# Obtain BoW from your function\n",
+    "bow = get_bow_from_docs(docs)\n",
+    "\n",
+    "# Print BoW\n",
+    "print(bow)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If your attempt above is successful, nice work done!\n",
+    "\n",
+    "Now test your function again with the stop words. In the previous lab we defined the stop words in a large array. In this lab, we'll import the stop words from Scikit-Learn."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.feature_extraction import stop_words\n",
+    "print(stop_words.ENGLISH_STOP_WORDS)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You should have seen a large list of words that looks like:\n",
+    "\n",
+    "```frozenset({'across', 'mine', 'cannot', ...})```\n",
+    "\n",
+    "`frozenset` is a type of Python object that is immutable. In this lab you can use it just like an array without conversion."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next, test your function with supplying `stop_words.ENGLISH_STOP_WORDS` as the second parameter."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "bow = get_bow_from_docs(bow, stop_words.ENGLISH_STOP_WORDS)\n",
+    "\n",
+    "print(bow)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You should have seen:\n",
+    "\n",
+    "```{'bag_of_words': ['ironhack', 'cool', 'love', 'student'], 'term_freq': [[1, 1, 0, 0], [1, 0, 1, 0], [1, 0, 0, 1]]}```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/your-code/.ipynb_checkpoints/Q2-checkpoint.ipynb b/your-code/.ipynb_checkpoints/Q2-checkpoint.ipynb
@@ -0,0 +1,119 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we want to enhance the `get_bow_from_docs` function so that it will work with HTML webpages. In HTML, there are a lot of messy codes such as HTML tags, Javascripts, [unicodes](https://www.w3schools.com/charsets/ref_utf_misc_symbols.asp) that will mess up your bag of words. We need to clean up those junk before generating BoW.\n",
+    "\n",
+    "Next, what you will do is to define several new functions each of which is specialized to clean up the HTML codes in one aspect. For instance, you can have a `strip_html_tags` function to remove all HTML tags, a `remove_punctuation` function to remove all punctuation, a `to_lower_case` function to convert string to lowercase, and a `remove_unicode` function to remove all unicodes.\n",
+    "\n",
+    "Then in your `get_bow_from_doc` function, you will call each of those functions you created to clean up the HTML before you generate the corpus.\n",
+    "\n",
+    "Note: Please use Python string operations and regular expression only in this lab. Do not use extra libraries such as `beautifulsoup` because otherwise you loose the purpose of practicing."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 60,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define your string handling functions below\n",
+    "# Minimal 3 functions\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next, paste your previously written `get_bow_from_docs` function below. Call your functions above at the appropriate place."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 61,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def get_bow_from_docs(docs, stop_words=[]):\n",
+    "    # In the function, first define the variables you will use such as `corpus`, `bag_of_words`, and `term_freq`.\n",
+    "    corpus = []\n",
+    "    bag_of_words = []\n",
+    "    term_freq = []\n",
+    "    \n",
+    "    # write your codes here\n",
+    "    \n",
+    "    return {\n",
+    "        \"bag_of_words\": bag_of_words,\n",
+    "        \"term_freq\": term_freq\n",
+    "    }\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next, read the content from the three HTML webpages in the `your-codes` directory to test your function."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.feature_extraction import stop_words\n",
+    "bow = get_bow_from_docs([\n",
+    "        'www.coursereport.com_ironhack.html',\n",
+    "        'en.wikipedia.org_Data_analysis.html',\n",
+    "        'www.lipsum.com.html'\n",
+    "    ],\n",
+    "    stop_words.ENGLISH_STOP_WORDS\n",
+    ")\n",
+    "\n",
+    "print(bow)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Do you see any problem in the output? How do you improve the output?\n",
+    "\n",
+    "A good way to improve your codes is to look into the HTML data sources and try to understand where the messy output came from. A good data analyst always learns about the data in depth in order to perform the job well.\n",
+    "\n",
+    "Spend 20-30 minutes to improve your functions or until you feel you are good at string operations. This lab is just a practice so you don't need to stress yourself out. If you feel you've practiced enough you can stop and move on the next challenge question."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}