Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Dataprep

## Description
- Dataprep is an open-source library designed to simplify the data preparation
process for machine learning projects.
- It provides a user-friendly interface to clean, transform, and visualize
datasets, making it accessible for users with varying levels of expertise.
- Key features include automated data cleaning, visualization tools to explore
data distributions, and the ability to handle missing values effectively.
- Dataprep supports integration with popular data science libraries like Pandas
and Scikit-learn, facilitating seamless transitions from data preparation to
model building.
- The tool is particularly useful for exploratory data analysis (EDA), feature
engineering, and data validation, ensuring that datasets are ready for machine
learning tasks.

## Project Objective
The goal of this project is to develop a machine learning model that predicts
housing prices based on various features such as location, size, and amenities.
The project will focus on optimizing the model's accuracy and interpretability
using the Dataprep library for data preparation.

## Dataset Suggestions
1. **Kaggle Housing Prices Dataset**
- **Source:** Kaggle
- **URL:**
[Housing Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)
- **Data Contains:** Features of houses (e.g., size, number of rooms,
location) and their sale prices.
- **Access Requirements:** Free account on Kaggle for download.

2. **Zillow Home Value Index**
- **Source:** Zillow
- **URL:** [Zillow API](https://www.zillow.com/howto/api/APIOverview.htm)
- **Data Contains:** Historical home values, property characteristics, and
market trends.
- **Access Requirements:** No authentication required for basic access.

3. **OpenStreetMap Data**
- **Source:** OpenStreetMap
- **URL:** [Overpass API](http://overpass-api.de/)
- **Data Contains:** Geospatial data related to housing, amenities, and
infrastructure in specific areas.
- **Access Requirements:** Publicly accessible API with no authentication
needed.

4. **UCI Machine Learning Repository - California Housing**
- **Source:** UCI
- **URL:**
[California Housing Data Set](https://archive.ics.uci.edu/ml/datasets/California+Housing+Prices)
- **Data Contains:** Housing data from California, including features like
median income, housing age, and house prices.
- **Access Requirements:** Direct download without authentication.

## Tasks
- **Data Loading:** Use Dataprep to load the selected dataset and explore its
structure and features.
- **Data Cleaning:** Apply automated data cleaning techniques to handle missing
values, outliers, and data type conversions.
- **Feature Engineering:** Utilize Dataprep's visualization tools to identify
important features and create new features that may improve model performance.
- **Model Training:** Split the dataset into training and testing sets, and
train a regression model (e.g., Linear Regression) using Scikit-learn.
- **Model Evaluation:** Evaluate the model's performance using metrics such as
Mean Absolute Error (MAE) and R-squared, and visualize the results.
- **Reporting:** Create a comprehensive report summarizing the data preparation
steps, model performance, and insights gained from the analysis.

## Bonus Ideas
- Experiment with different regression algorithms (e.g., Decision Trees, Random
Forests) and compare their performances.
- Implement hyperparameter tuning to optimize model parameters for better
accuracy.
- Create interactive visualizations using libraries like Plotly or Streamlit to
present findings and model predictions.
- Explore the impact of additional features from the OpenStreetMap dataset on
housing prices.

## Useful Resources
- [Dataprep Documentation](https://dataprep.readthedocs.io/en/latest/)
- [Scikit-learn Documentation](https://scikit-learn.org/stable/documentation.html)
- [Kaggle API Documentation](https://www.kaggle.com/docs/api)
- [Zillow API Overview](https://www.zillow.com/howto/api/APIOverview.htm)
- [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Use Python 3.12 slim (already has Python and pip).
FROM python:3.12-slim

# Avoid interactive prompts during apt operations.
ENV DEBIAN_FRONTEND=noninteractive

# Install CA certificates (needed for HTTPS).
RUN apt-get update && apt-get install -y \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*

# Install project specific packages.
RUN mkdir -p /install
COPY requirements.txt /install/requirements.txt
RUN pip install --upgrade pip && \
pip install --no-cache-dir jupyterlab jupyterlab_vim jupytext -r /install/requirements.txt

# Config.
COPY etc_sudoers /install/
COPY etc_sudoers /etc/sudoers
COPY bashrc /root/.bashrc

# Report package versions.
COPY version.sh /install/
RUN /install/version.sh 2>&1 | tee version.log

# Jupyter.
EXPOSE 8888

CMD ["/bin/bash"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Use Python 3.12 slim (already has Python and pip).
FROM python:3.12-slim

# Avoid interactive prompts during apt operations.
ENV DEBIAN_FRONTEND=noninteractive

# Install CA certificates (needed for HTTPS).
RUN apt-get update && apt-get install -y \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*

# Install project specific packages.
RUN mkdir -p /install
COPY requirements.txt /install/requirements.txt
RUN pip install --upgrade pip && \
pip install --no-cache-dir jupyterlab jupyterlab_vim jupytext -r /install/requirements.txt

# Config.
COPY etc_sudoers /install/
COPY etc_sudoers /etc/sudoers
COPY bashrc /root/.bashrc

# Report package versions.
COPY version.sh /install/
RUN /install/version.sh 2>&1 | tee version.log

# Jupyter.
EXPOSE 8888
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
FROM ubuntu:24.04
ENV DEBIAN_FRONTEND noninteractive

# Install system utilities and Python in a single layer.
RUN apt-get update && \
apt-get upgrade -y && \
apt-get install -y --no-install-recommends \
sudo \
curl \
git \
build-essential \
python3 \
python3-pip \
python3-dev \
python3-venv \
&& rm -rf /var/lib/apt/lists/*

# Create virtual environment.
RUN python3 -m venv /opt/venv

# Make the venv the default Python.
ENV PATH="/opt/venv/bin:$PATH"

# Install project specific packages.
RUN mkdir /install
COPY requirements.txt /install/requirements.txt
RUN pip install --upgrade pip && \
pip install --no-cache-dir jupyterlab jupyterlab_vim jupytext -r /install/requirements.txt

# Config.
COPY etc_sudoers /install/
COPY etc_sudoers /etc/sudoers
COPY bashrc /root/.bashrc

# Report package versions.
COPY version.sh /install/
RUN /install/version.sh 2>&1 | tee version.log

# Jupyter.
EXPOSE 8888
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
FROM ubuntu:24.04
ENV DEBIAN_FRONTEND noninteractive

# Install system utilities and Python in a single layer.
RUN apt-get update && \
apt-get upgrade -y && \
apt-get install -y --no-install-recommends \
sudo \
curl \
git \
build-essential \
python3 \
python3-pip \
python3-dev \
python3-venv \
libgomp1 \
g++ \
&& rm -rf /var/lib/apt/lists/*

# Install uv for package management.
RUN curl -LsSf https://astral.sh/uv/install.sh | sh
ENV PATH="/root/.local/bin:$PATH"

# Install project specific packages using uv.
COPY pyproject.toml uv.lock /app/
WORKDIR /app
RUN uv sync
ENV PATH="/app/.venv/bin:$PATH"

# Install Jupyter.
RUN pip install --upgrade pip && \
pip install --no-cache-dir jupyterlab jupyterlab_vim jupytext

# Copy project files.
COPY . /app

RUN mkdir /install

# Config.
COPY etc_sudoers /install/
COPY etc_sudoers /etc/sudoers
COPY bashrc /root/.bashrc

# Report package versions.
COPY version.sh /install/
RUN /install/version.sh 2>&1 | tee version.log

# Jupyter.
EXPOSE 8888
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# DataPrep Tutorial
This project demonstrates how to use DataPrep (Python) for data analysis, specifically to predict the California Housing Dataset target variable (MedHouseVal)

# Workflow

From the directory, build the docker image
```bash
> ./docker_build.sh
```
Open Jupyter Lab
```bash
> ./docker_jupyter.sh -p 8888
```

After opening Jupyter in a browser, running the following notebooks (.ipynb)

1. DataPrep.API
- Introduces basic DataPrep uses and documentation
- Shows how to use get_report() to generate data summary
- Shows how to clean data and plot using dataprep


2. DataPrep.example
- loads California Housing Data
- exploratory analyses using Dataprep summary visualizations
- feature engineering for strong model performance
- LinearRegression model trained and evaluated
- More feature engineering according to data spreads as shown in DataPrep visualizations
- LinearRegression 2nd model (attempt higher accuracy)
- RandomForest model deployment and evaluation
- Compare and contrast model performance
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
set -o vi
Loading