Author: Gaspard-Fauvelle Angel
License: CC BY-NC-SA 2.0
Voici la version française du document End-to-End Data Science Workflow
This document is the last of a series:
- SQL_CheatSheet_Eng.ipynb
- Python_Pandas_Eng.ipynb
- Numpy_Pandas_Eng
- Data_Preprocessing_Methods_Eng.ipynb
- 🧭 I. Subject Introduction
- 🗂️ II. Environment Setup
- 🧹 III. Data Cleaning and Transformation - Handling polluted data
- 🧠 IV. Feature Engineering
- 📈 V. Exploratory Data Analysis - EDA
- 🧪 VI. Inferential Analysis
- 🤖 VII. Predictive Modeling
- 📌 VIII. Prescriptive Analysis
Why did you choose this dataset ? What are the stakeholders ? What are deliverables at the end ?
- Planning of your deadline / How much time to work on it.
- Ressources Compute, Data Quality and Graniness
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
# Import the secondary libraries
from sklearn.sublibraries import MachineLearningsTools as mltools...
from scipy.sublibrairies import MathsTools as mathtools...
from datetime import date, datetime...
# Import the support libraries# Load the dataset into a Pandas DataFrame
df = pd.read_csv('your_dataset.csv') # To read CSV file
# Read data from a URL
df_url = pd.read_csv('https://example.com/your_data.csv') # To read data from a URL
# Read Excel file
df_excel = pd.read_excel('your_file.xlsx', sheet_name='Sheet1') # To read Excel file
# Read JSON file
df_json = pd.read_json('your_file.json') # To read JSON file
# Read SQLite File
import sqlite3
con = sqlite3.connect("database.sqlite") # Connect Jupyter to database.sqlite
df = pd.read_sql_query("SELECT * from table", con) # Create DataFrame from SQL query into last variable
# Read SQL File (Same as last example, with another API)
from sqlalchemy import create_engine # To insert in the secondary libraries
engine = create_engine('sqlite:///your_database.db') # Create a SQLite database engine
df_sql = pd.read_sql('SELECT * FROM your_table', engine) # Read data from a SQL table
# Verify that result of SQL query is stored in the dataframe
print(df.head())
# Read Parquet file
df_parquet = pd.read_parquet('your_file.parquet') # To read Parquet file
# The file type above are the main file types you will encounter in Data Science. But here are some other file types below such as :
# Read HDF5 file
df_hdf5 = pd.read_hdf('your_file.h5', key='your_key') # To read HDF5 file
# Read CERN ROOT file
#from rootpy.io import ROOTFile # To insert in the secondary libraries
df_root = ROOT.TFile.Open('your_file.root') # To read ROOT file
# Read Feather file
df_feather = pd.read_feather('your_file.feather') # To read Feather file
# Read fixed-width file
#column_widths = [10, 15, 20] ## To define column widths
df_fixed_width = pd.read_fwf('your_file.txt', widths=column_widths) # To read fixed-width file
# Read data from the clipboard
df_clipboard = pd.read_clipboard() # To read data from the clipboard2.2.1. Inspect dimensions and dtypes (Variables dictionary with .info(), and deep memory_usage diagnostic)
df.info(memory_usage='deep', show_counts='True')- 2.2.2. Inspect a fixed amount of random lines with .sample()
df.sample(5)- Best method is to
add dimensionfrom columns with missing values by creating aboolean column(1 for missing values and 0 for values filled) - Second method is if data missing is
<30%, to usemedianormode(preferable)tofillmissing values - Last resort is if data missing is
>30%,dropthose rows with .dropna(), for Columns, data missing is>40%, .dropna('Columntodrop', axis=1) - "These thresholds (e.g. 30% for rows, 40% for columns) are empirical and must be adapted based on domain knowledge, sample size, and model sensitivity."
# Detection made for Distribution Law
z_scores = ((data['column1'] - means) / stds)
z_outliers = (np.abs(z_scores) > zscore_threshold).any(axis=1)
# Detection made for pretty much anything
q1, q3 = np.percentile(df['column1'], [25, 75])
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
iqr_outliers = df[(df['column1'] >= lower_bound) & (df['column1'] <= upper_bound)]
# Before thinking about removing outliers, check if those outliers are errors, or a valid signal (rare events)
df_no_z_outliers = data[~z_outliers].copy()
df_no_iqr_outliers = df[(df['column1'] >= lower_bound) & (df['column1'] <= upper_bound)].copy()# Use a dictionnary to store columns as keys, and types you want to set as values
dict_convert = {
'column1': 'dtypes'
}
# Create a loop to apply .astype(values_from_dict) method
common_cols = set(df.columns) & set(dict_convert.keys())
# Vérifier que les colonnes spécifiées dans le dictionnaire 'conversions' existent bien dans le DataFrame avant de les convertir
for col in common_cols:
df[col] = df[col].astype(dict_convert[col])df.set_index(df['dateColumn'], inplace = True)# Comprehension list to sort data types / Timeseries(dates) can be used as index, converted to ordinal/category
timeseries_dtypes = ['datetime64']
numerical_features = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = df.select_dtypes(exclude=np.number).columns.tolist()
TimeSeries_features = df.select_dtypes(include=timeseries_dtypes).columns.tolist()# Print the statistics descriptions# Frequencies, percentages, fractions and/or relative frequencies# Resampling(.resample()), Moving Average(.rolling()), Exponential Weighted Function(.ewm())# Comparative graphics between 2 or more variables (ScatterPlot, Regression/tendances Curves, BubbleChart, LmPlot, PairPlot, histogram etc...)# Comparative graphics between 2 or more variables (Groupped/Staked BarChart, MosaicPlot, Heatmap, PairPlot etc...)# Comparative graphics between 2 or more variables (BoxPlot, ViolinPlot, Mean/error BarChart, Dot/Strip Plot, CatPlot, PairPlot, histogram etc...)# Comparative graphics between 2 or more variables (statsmodels.tsa.seasonal_decompose, Histogram, PairPlot etc...)Define your Null-Hypothesis (H0)
6.2.1. Normality Hypothesis and Homoscedasticity (homogeneity of variance) Tests (Preliminary tests before choosing Parametric or Non-parametric Tests)
# Normality Tests: Shapiro-Wilk's Test, Kolmogorov-Smirnov's Test, Anderson-Darling's Test
# Homoscedasticity Tests: Levene's Test, Bartlett's Test
# Autocorrelation Test: Durbin-Watson's Test, White's Test
# Multicollinearity Test: VIF(Variance Inflation Factor)# For Numerical only data (Correlation Analysis: Pearson's Correlation, Linear Regression T-Test
# For Categorical only data (Confidence Intervals): Logistic Regression, Log-Linear Models
# For Numerical & Categorical data (Variance Analysis): Student's T-Test, MANOVA/ANOVA(Variance Analysis)# For Numerical only data (Correlation Analysis): Spearman's Correlation, Kendall's(Tau) Correlation
# For Categorical only data (Variance Analysis): Chi², Fisher's Exact Test, NcNemar's Test, Cochran-Armitage's Test for Trends(Correlation Analysis)
# For Numerical & Categorical data (Variance Analysis: Mann-Whitney's Test, Wilcoxon's Rank-Sum, Kruskal-Wallis's ANOVA# From various tests, which columns will you keep ?
df = df[['column1', 'column2', '...', 'columnN']]
# If you have more column to keep than to drop, which columns will you drop ?
df = df.drop(['column1', 'Column2', '...', 'columnN'], axis=1])# Features & Target
X = df.drop(['TargetColumn'], axis=1)
y = df[['TargetColumn']]# Common partition is 80/20 (80% for Train and 20% for Test), Shuffle should be set as True
X_train, X_test, y_train, y_test = tts(X, y, train_size=0.8, random_state=1, shuffle=True, stratify=None)# Global Hyperparameters: Grid Search (GridSearchCV), Randomized Search (RandomizedSearchCV), Bayesian Optimization (BayesianSearchCV Optuna, Hyperopt)
# Regression only Hyperparameter: Cross-validation integrated tuning
# Classifier only Hyperparameter: Cross-validation with stratified sampling (StratifiedKFold)# Regressors (e.g., Linear Regression, Random Forest Regressor, Gradient Boosting Regressor...)
# Classifiers (e.g., Logistic Regression, Decision Trees, Random Forest Classifier, Support Vector Machines)
# Pipelines for preprocessing and training
# Cross-validation techniques (e.g., Group/K Fold, StratifiedKFold, TimeSeriesSplit for sequential data)# Regression Metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared (R²)
# Classifier Analysis: Accuracy, Precision, Recall, F1-Score and Visual Error Analysis like Confusion Matrix, ROC Curve / AUC, Precision-Recall Curves# Regression: Scatter Plots (actual vs. predicted), Line Plots, Residual Plots, Learning Curves, Box Plots
# Classifier: Heatmaps (for Confusion Matrices), ROC Curves, Precision-Recall Curves# Inference on unseen/real-world data using the trained model
# Utilize preprocessing/feature engineering pipelines on real test data
# Generate predictions via model.predict (batch or streaming)
# Perform final evaluation and error analysis on real test set performance# Model serialization and export (e.g., pickle, joblib, ONNX, SavedModel)
# API deployment frameworks (e.g., Flask, FastAPI, TensorFlow Serving, TorchServe)
# Containerization and orchestration (e.g., Docker, Kubernetes)
# Monitoring, logging, and performance tuning for production- Translate analytical insights into actionable strategies.
- Leverage data storytelling to inform strategic decision-making.
- Utilize frameworks such as SWOT analysis, KPI definition, and performance monitoring.
- Engage with domain experts to validate recommendations.
- Develop simulation models to explore alternative future scenarios.
- Employ methods like Monte Carlo simulations and sensitivity analysis.
- Use scenario planning to assess the impact of varying assumptions.
- Evaluate risk and uncertainty across different operational or market situations.
- Build interactive and informative dashboards for real-time data visualization.
- Use tools such as Tableau, PowerBI, or Plotly Dash for ease-of-use and accessibility.
- Focus on user-centric design principles: clarity, simplicity, and actionable insights.
- Ensure scalability, maintainability, and data security in dashboard deployment.
🎉 Congratulations on completing this end-to-end data science guide!
You're now ready to take the next step and publish your project.
Head over to the Git & GitHub Environment to learn how to upload your work on GitHub and make it shareable.
Edited: 2025-04-22