William Ponton: LinkedIn
Email: @gorbulus
REPL: @gorbulus
Github: gorbulus
This notebook will be a collection of Data Science basics, examples, and best practices for use as a reference guide.
There are five sections of this guide broken down to the basic steps of the Data Analysis process. The first step is related to importing a dataset to your environment to be able to analyze and do the work. Some estimates show that Data Scientists can spend up to 80% of their time cleaning and organizing data for analysis and modeling. The second step is to define best practices for cleaning and organizing, how to handle NULL values, and how to merge and organize messy data. Once the dataset is normalized and cleaned, this guide will detail common statistical methods and define the values needed for visualization and final stats for the Interpretation section. Numerical Analysis is the 'magic' of Data Science, as this step often can expose anomalies and patterns in the data that humans alone might not have been able to interpret. The output of the Numerical Analysis step also powers the Visualizations that will be presented to the stakeholders in the final reporting, and is vital for the subsequent step of Interpretation and Reporting. Finally, the guide covers creating a deliverable to be passed off to other departments. The final result must be understandable by all audiences it is intended for, so knowing the goals of the project up front is imperative for keeping the results in the scope of the audience's understanding of the analysis.
-
0.0 Importing Data
-
0.1 Cleaning & Organizing
-
0.2 Numerical Analysis
-
0.3 Visualizations
-
0.4 Interpretation & Reporting
Python has a rich Data Science functionality that has been motivated by teams of scientists and engineers trying to solve scientific and engineering problems. Python's Object Oriented Design, ease of syntax, and available libraries make it the industry standard for Data Analysis. A 2016 study done by O'Reily shows that Python is now dominant over R throughout the Data Science community, favoring Python 3.6 to the soon to be extinct Python 2.7. I also plan to create a Data Science Playbook for R techniques in the future (I am still learning!).
Python has become the fastest growing programming language of 2019, and continues to remain the industry standard for modeling and analysis in the scientific and engineering industries. The Scientific Python Stack is an array of technologies that make Python so powerful for Data analysis and statistical prediction.
To get everything running in this project, use pip install -r requirements.txt
- Python 3.6 (replacing legacy Python 2.7 in 2020)
- Cython (a speedy C library for backing up numpy)
- SciPy
- NumPy
- SciKitLearn
- Anaconda IDE
- IPython Notebooks
- GitHub (version control)
- RMOTR Notebooks
- Analysis tools
- NumPy
- Pandas
- Cython
- Visualization tools
- Matplotlib
- Seaborn
- Bokeh

