ta-data-bcn · aitorquinza · May 25, 2020 · May 25, 2020 · May 25, 2020 · May 25, 2020
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,133 @@
+
+# Created by https://www.gitignore.io/api/macos,pycharm+all,jupyternotebooks
+# Edit at https://www.gitignore.io/?templates=macos,pycharm+all,jupyternotebooks
+
+### JupyterNotebooks ###
+# gitignore template for Jupyter Notebooks
+# website: http://jupyter.org/
+
+.ipynb_checkpoints
+*/.ipynb_checkpoints/*
+
+# IPython
+profile_default/
+ipython_config.py
+
+# Remove previous ipynb_checkpoints
+#   git rm -r .ipynb_checkpoints/
+
+### macOS ###
+# General
+.DS_Store
+.AppleDouble
+.LSOverride
+
+# Icon must end with two \r
+Icon
+
+# Thumbnails
+._*
+
+# Files that might appear in the root of a volume
+.DocumentRevisions-V100
+.fseventsd
+.Spotlight-V100
+.TemporaryItems
+.Trashes
+.VolumeIcon.icns
+.com.apple.timemachine.donotpresent
+
+# Directories potentially created on remote AFP share
+.AppleDB
+.AppleDesktop
+Network Trash Folder
+Temporary Items
+.apdisk
+
+### PyCharm+all ###
+# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio and WebStorm
+# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839
+
+# User-specific stuff
+.idea/**/workspace.xml
+.idea/**/tasks.xml
+.idea/**/usage.statistics.xml
+.idea/**/dictionaries
+.idea/**/shelf
+
+# Generated files
+.idea/**/contentModel.xml
+
+# Sensitive or high-churn files
+.idea/**/dataSources/
+.idea/**/dataSources.ids
+.idea/**/dataSources.local.xml
+.idea/**/sqlDataSources.xml
+.idea/**/dynamic.xml
+.idea/**/uiDesigner.xml
+.idea/**/dbnavigator.xml
+
+# Gradle
+.idea/**/gradle.xml
+.idea/**/libraries
+
+# Gradle and Maven with auto-import
+# When using Gradle or Maven with auto-import, you should exclude module files,
+# since they will be recreated, and may cause churn.  Uncomment if using
+# auto-import.
+# .idea/modules.xml
+# .idea/*.iml
+# .idea/modules
+# *.iml
+# *.ipr
+
+# CMake
+cmake-build-*/
+
+# Mongo Explorer plugin
+.idea/**/mongoSettings.xml
+
+# File-based project format
+*.iws
+
+# IntelliJ
+out/
+
+# mpeltonen/sbt-idea plugin
+.idea_modules/
+
+# JIRA plugin
+atlassian-ide-plugin.xml
+
+# Cursive Clojure plugin
+.idea/replstate.xml
+
+# Crashlytics plugin (for Android Studio and IntelliJ)
+com_crashlytics_export_strings.xml
+crashlytics.properties
+crashlytics-build.properties
+fabric.properties
+
+# Editor-based Rest Client
+.idea/httpRequests
+
+# Android studio 3.1+ serialized cache file
+.idea/caches/build_file_checksums.ser
+
+### PyCharm+all Patch ###
+# Ignores the whole .idea folder and all .iml files
+# See https://github.com/joeblau/gitignore.io/issues/186 and https://github.com/joeblau/gitignore.io/issues/360
+
+.idea/
+
+# Reason: https://github.com/joeblau/gitignore.io/issues/186#issuecomment-249601023
+
+*.iml
+modules.xml
+.idea/misc.xml
+*.ipr
+
+# Sonarlint plugin
+.idea/sonarlint
+
+# End of https://www.gitignore.io/api/macos,pycharm+all,jupyternotebooks
diff --git a/Kick-Off.md b/Kick-Off.md
diff --git a/README.md b/README.md
@@ -0,0 +1,79 @@
+<img src="https://bit.ly/2VnXWr2" alt="Ironhack Logo" width="100"/>
+
+# Predicting Job Salaries
+*Aitor Quinza*
+
+*[Data, Barcelona & March 2020]*
+
+## Content
+- [Project Description](#project-description)
+- [Hypotheses / Questions](#hypotheses-questions)
+- [Dataset](#dataset)
+- [Cleaning](#cleaning)
+- [Analysis](#analysis)
+- [Model Training and Evaluation](#model-training-and-evaluation)
+- [Future Work](#future-work)
+- [Organization](#organization)
+- [Links](#links)
+
+## Project Description
+This project aims to help people on interviews, helping them to say a salary range.
+
+## Hypotheses / Questions
+* Affects the salary depending on the state?
+* Affects the salary the company industry?
+* Can We predict the salary based on job title, geography and required skills?
+
+
+## Dataset
+* For this project, I scraped GlassDoor website because in EEUU they have in some offers a salary range estimation and I'm going to work with this estimation to make my own.
+* The script is in the folder scripts/glassdoor.py
+
+## Cleaning
+*	Parsed numeric data out of salary 
+*	Removed rows without salary 
+*	Made columns for employer provided salary and hourly wages 
+*	Made columns for if different skills were listed in the job description:
+    * Python  
+    * R  
+    * Excel  
+    * AWS  
+    * Spark 
+    * SQL
+    * Tableau
+*	Parsed rating out of company text 
+*	Made a new column for company state 
+*	Added a column for if the job was at the company’s headquarters 
+*	Transformed founded date into age of company 
+*	Column for simplified job title and Seniority 
+*	Column for description length 
+
+
+## Analysis
+Visit my [Tableau Graphs](https://public.tableau.com/profile/aitor2544#!/vizhome/DataScienceJobEEUU/Story1)
+
+## Model Training and Evaluation
+I used 3 Algorithms:
+* Multivariable Linear Regression
+* Lasso Regression
+* Random FOrest
+
+
+
+## Future Work
+* Test SVM algorithm
+* Improve skills extraction
+* Add more keywords for jobs
+
+
+## Organization
+The structure has 3 folders:
+* Datasets -> CSV files
+* Notebooks -> Data cleaning, EDA and model building
+* Scripts -> Python scripts for scraping and save the the model
+
+## Links
+
+[Repository](https://github.com/aitorquinza/Project-Week-8-Final-Project/)  
+[Slides](https://docs.google.com/presentation/d/1lD6bA32RghmyEhmh5p3Ni35dZikB0xhQMjr6udIGW9w/edit?usp=sharing)  
+[Kanban](https://github.com/aitorquinza/Project-Week-8-Final-Project/projects/1)