This function allows users to upload a dataset for preprocessing. It supports CSV and Excel files.
- Users can upload a dataset through the file uploader.
- The app determines the file type and reads the data accordingly.
- The uploaded dataset is stored in session state for further processing.
- Users upload a CSV or Excel file.
- The app reads the file into a Pandas DataFrame.
- The dataset is stored in session state.
- A "Continue" button allows users to proceed to the next step.
This function enables users to visualize the dataset by selecting a column for visualization.
- Users select a column from the dataset.
- The app determines the data type and chooses an appropriate visualization:
- Categorical columns: Displayed using bar charts.
- Numerical columns: Displayed using histograms.
- Users select a column from the dataset.
- If the column is categorical, a bar chart is generated.
- If the column is numerical, a histogram with a KDE plot is generated.
- The visualization is displayed using Matplotlib and Seaborn.
This feature allows you to manage missing values in your dataset efficiently by providing options to:
- Drop Columns: Remove columns with a high percentage of missing values.
- Drop Null Values: Drop rows containing missing values for specific features.
- Fill Null Values: Fill missing values using various strategies, including:
- Zero
- Mean
- Median
- Mode
- Forward Fill
- Back Fill
- Show Dataset: View the updated dataset after handling nulls.
- The app calculates and displays null information for each feature, showing:
- Count of missing values
- Percentage of missing values
- Suggested action based on threshold values:
- Drop Feature: When null percentage ≥ 80%
- Fill Values: When 30% ≤ null percentage < 80%
- Drop Observations: When null percentage < 30%
- Users can interactively choose the preferred action using buttons and selection tools.
- The dataset is updated in real-time and displayed after each operation.
This feature identifies and removes features with high redundancy, such as columns with:
- A single repeated value in the majority of rows.
- Values exceeding a specified redundancy threshold (default is 80%).
- The app analyzes each feature to determine if a single value dominates the column.
- Displays a list of redundant features with their redundancy percentage.
- Allows users to select a redundant feature from a dropdown and remove it with a button click.
- The updated dataset is displayed after the feature is removed.
This function is used to identify the important features in DS, remove the irrelevant or redundant ones through column selection, and visualize correlations.
- Displays a correlation heatmap: Helps identify relationships between numerical columns.
- Allows users to select columns to drop: The dataset is refined by removing unwanted numerical features, then updated.
- Handles the non-existence of numerical data: Informs the user through a message.
- Extract numerical columns from dataset.
- Displays a heatmap that shows correlations among features.
- A drop-down is provided to select the unneeded columns.
- Updates and displays the refined dataset.
This function converts the selected categorical data into numerical variables so that ML models can process them, either through Label Encoding or One-Hot Encoding.
- Lists categorical columns: Displays the number of unique values in each.
- Provides encoding options:
- Label Encoding: Converts categorical labels to numerical values.
- One-Hot Encoding: Creates binary columns for each category in a selected column (avoiding multicollinearity by using
drop_first = True).
- After encoding, it updates the dataset.
- Displays categorical columns in a dataset.
- Displays the count of unique values per column.
- User selects columns to encode.
- Two encoding methods are available:
- Label Encoding
- One-Hot Encoding
- After encoding, the dataset is updated and displayed.
Here's a structured summary of your code:
This function allows users to apply feature scaling techniques to numerical columns.
- Users select between Normalization (MinMax Scaling) and Standardization (Standard Scaler).
- The selected scaling technique is applied to all numerical columns.
- The processed dataset is displayed.
- The user selects a scaling method.
- The selected transformation is applied using
scaler.fit_transform(). - The transformed dataset is displayed.
This function provides automated data preprocessing, including handling missing values, removing duplicates, addressing redundancy, encoding categorical features, and feature scaling.
- Handle Null Values:
- Drops columns with ≥ 80% missing values.
- Drops rows with ≤ 3% missing values.
- Fills numerical columns with the mean and categorical columns with the mode.
- Remove Duplicates: Drops duplicate rows.
- Remove Redundant Features:
- Identifies columns where a single value appears ≥ 80% of the time.
- Drops highly redundant columns.
- Encoding:
- Binary categorical columns: Label Encoding.
- Multi-category columns: One-hot Encoding.
- Feature Scaling (Optional, requires encoding):
- Applies MinMax normalization to numerical columns.
- Users select preprocessing options via checkboxes.
- The selected operations are applied sequentially.
- The processed dataset is displayed and can be downloaded.
This function allows users to download the processed dataset.
- Users provide a filename.
- The dataset is converted to CSV format.
- A download button is generated.
- Users enter a filename.
- The dataset is encoded as a CSV file.
- A download button allows users to save the file.
This function provides an overview of the dataset preprocessing steps.
- Uploading Data: Supports CSV and Excel.
- Data Visualization: Displays statistical summaries and charts.
- Handling Null Values: Various methods for filling or removing missing data.
- Encoding Data: Converts categorical data into numerical form.
- Feature Scaling: Normalization and standardization options.
- Feature Selection: Identifies important features and removes redundant ones.
- Download Preprocessed Data: Saves the cleaned dataset.
- Resetting Data: Restores the dataset to its original state.
This summary captures the core functionalities of your script in a structured format. Let me know if you want any refinements!
- Python (Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn)
- Streamlit (for interactive UI)