ta-data-mex · diegoperezo97 · Jul 5, 2021 · Jul 22, 2021
diff --git a/.DS_Store b/.DS_Store
diff --git a/your-code/.ipynb_checkpoints/Q1-checkpoint.ipynb b/your-code/.ipynb_checkpoints/Q1-checkpoint.ipynb
@@ -0,0 +1,226 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In the cell below, create a Python function that wraps your previous solution for the Bag of Words lab.\n",
+    "\n",
+    "Requirements:\n",
+    "\n",
+    "1. Your function should accept the following parameters:\n",
+    "    * `docs` [REQUIRED] - array of document paths.\n",
+    "    * `stop_words` [OPTIONAL] - array of stop words. The default value is an empty array.\n",
+    "\n",
+    "1. Your function should return a Python object that contains the following:\n",
+    "    * `bag_of_words` - array of strings of normalized unique words in the corpus.\n",
+    "    * `term_freq` - array of the term-frequency vectors."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Import required libraries\n",
+    "import os\n",
+    "import glob\n",
+    "import re\n",
+    "from sklearn.feature_extraction import _stop_words\n",
+    "\n",
+    "# Define function\n",
+    "def get_bow_from_docs(docs, stop_words = []):\n",
+    "    \n",
+    "    # In the function, first define the variables you will use such as `corpus`, `bag_of_words`, and `term_freq`.\n",
+    "    corpus = []\n",
+    "    bag_of_words = []\n",
+    "    term_freq = []\n",
+    "    corpus_l = []\n",
+    "    term_freq_b = []\n",
+    "    \n",
+    "    \"\"\"\n",
+    "    Loop `docs` and read the content of each doc into a string in `corpus`.\n",
+    "    Remember to convert the doc content to lowercases and remove punctuation.\n",
+    "    \"\"\"\n",
+    "    for index in range(len(docs)):\n",
+    "        with open(docs[index], \"r\") as file:\n",
+    "              for text in file:\n",
+    "                stripped_text = text.strip()\n",
+    "                corpus.append(stripped_text)\n",
+    "    corpus = [char.lower().replace('.', '') for char in corpus]\n",
+    "                \n",
+    "    \"\"\"\n",
+    "    Loop `corpus`. Append the terms in each doc into the `bag_of_words` array. The terms in `bag_of_words` \n",
+    "    should be unique which means before adding each term you need to check if it's already added to the array.\n",
+    "    In addition, check if each term is in the `stop_words` array. Only append the term to `bag_of_words`\n",
+    "    if it is not a stop word.\n",
+    "    \"\"\"\n",
+    "    for index in range(len(corpus)):\n",
+    "        words = re.split(\" \", (corpus[index]))\n",
+    "        for word in words:\n",
+    "            if word in bag_of_words or word in stop_words:\n",
+    "                pass\n",
+    "            else:\n",
+    "                bag_of_words.append(word)\n",
+    "\n",
+    "    \"\"\"\n",
+    "    Loop `corpus` again. For each doc string, count the number of occurrences of each term in `bag_of_words`. \n",
+    "    Create an array for each doc's term frequency and append it to `term_freq`.\n",
+    "    \"\"\"\n",
+    "    for index in range(len(corpus)):\n",
+    "        corpus_l.append((corpus[index]))\n",
+    "        \n",
+    "    for index in range(len(corpus_l)):\n",
+    "        for word in bag_of_words:\n",
+    "            if word in corpus_l[index].split():\n",
+    "                term_freq_b.append(1)\n",
+    "            else:\n",
+    "                term_freq_b.append(0)\n",
+    "        term_freq.append(term_freq_b)\n",
+    "    \n",
+    "    \n",
+    "    # Now return your output as an object\n",
+    "    return {\n",
+    "        \"bag_of_words\": bag_of_words,\n",
+    "        \"term_freq\": term_freq\n",
+    "    }\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Test your function without stop words. You should see the output like below:\n",
+    "\n",
+    "```{'bag_of_words': ['ironhack', 'is', 'cool', 'i', 'love', 'am', 'a', 'student', 'at'], 'term_freq': [[1, 1, 1, 0, 0, 0, 0, 0, 0], [1, 0, 0, 1, 1, 0, 0, 0, 0], [1, 0, 0, 1, 0, 1, 1, 1, 1]]}```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'bag_of_words': [], 'term_freq': []}\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Define doc paths array\n",
+    "\n",
+    "pwd = '/Users/diegoperezo97/Documents/Ironhack – Data Analytics Bootcamp/Module 1/Week 1/Day 4/lab-functional-programming/your-code'\n",
+    "os.chdir(pwd)\n",
+    "file_extention = '.txt'\n",
+    "file_names = [file for file in glob.glob(f'*{file_extention}')]\n",
+    "docs = []\n",
+    "\n",
+    "for file in file_names:\n",
+    "    docs.append(file)\n",
+    "\n",
+    "# Obtain BoW from your function\n",
+    "bow = get_bow_from_docs(docs)\n",
+    "\n",
+    "# Print BoW\n",
+    "print(bow)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If your attempt above is successful, nice work done!\n",
+    "\n",
+    "Now test your function again with the stop words. In the previous lab we defined the stop words in a large array. In this lab, we'll import the stop words from Scikit-Learn."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.feature_extraction import _stop_words"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You should have seen a large list of words that looks like:\n",
+    "\n",
+    "```frozenset({'across', 'mine', 'cannot', ...})```\n",
+    "\n",
+    "`frozenset` is a type of Python object that is immutable. In this lab you can use it just like an array without conversion."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next, test your function with supplying `stop_words.ENGLISH_STOP_WORDS` as the second parameter."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "stop_words = list(_stop_words.ENGLISH_STOP_WORDS)\n",
+    "bow = get_bow_from_docs(docs, stop_words)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'bag_of_words': [], 'term_freq': []}\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(bow)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You should have seen:\n",
+    "\n",
+    "```{'bag_of_words': ['ironhack', 'cool', 'love', 'student'], 'term_freq': [[1, 1, 0, 0], [1, 0, 1, 0], [1, 0, 0, 1]]}```"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}