Skip to content

Latest commit

 

History

History
201 lines (150 loc) · 7.62 KB

File metadata and controls

201 lines (150 loc) · 7.62 KB

🚀 Advanced Data Preprocessing App

Streamlit App Python 3.8+ License: MIT

Transform your raw data into machine learning-ready datasets with professional-grade preprocessing techniques in just a few clicks.

🎯 What This App Does

The Advanced Data Preprocessing App is a comprehensive tool that automates the tedious and time-consuming process of cleaning and preparing data for machine learning. Whether you're a data scientist, analyst, or ML engineer, this app saves you hours of manual preprocessing work.

✨ Key Features

  • 🧹 Intelligent Data Cleaning: Remove duplicates, handle missing values, fix data formats
  • ⚙️ Advanced Feature Engineering: Extract datetime features, create interactions, apply transformations
  • 🏷️ Smart Categorical Encoding: Automatic encoding strategies based on data characteristics
  • 📏 Feature Scaling: StandardScaler, MinMaxScaler, RobustScaler options
  • 🎯 Feature Selection: Remove low-variance and highly correlated features
  • 📉 Dimensionality Reduction: PCA and t-SNE implementations with visualization
  • 📊 Interactive Visualizations: Real-time plots and data exploration
  • 📥 ML-Ready Export: Download processed datasets ready for any ML framework

🚀 Live Demo

Try the App Now →

📋 Supported File Formats

  • CSV files (.csv)
  • Excel files (.xlsx, .xls)
  • Maximum file size: 200MB

🛠️ Preprocessing Capabilities

Data Cleaning

  • Duplicate Removal: Eliminates identical rows
  • Missing Value Handling: Mean, median, KNN imputation strategies
  • Format Standardization: Date parsing, data type optimization
  • Outlier Management: IQR and Z-score methods

Feature Engineering

  • DateTime Features: Extract year, month, day, weekday, hour, weekend flags
  • Text Features: Character count, word count, text statistics
  • Numeric Interactions: Create multiplication features between variables
  • Mathematical Transformations: Log and square root transformations
  • Binning: Convert continuous variables to categorical ranges

Machine Learning Preparation

  • Categorical Encoding: One-hot, label, and frequency encoding
  • Feature Scaling: Multiple scaling strategies for optimal performance
  • Feature Selection: Variance-based and correlation-based filtering
  • Dimensionality Reduction: PCA for linear reduction, t-SNE for visualization

📖 How to Use

1. 📁 Upload Your Data

  • Click "Choose your data file"
  • Upload CSV or Excel files up to 200MB
  • Preview your data structure and statistics

2. 🎯 Configure Processing

  • Select Target Column (optional): Choose what you want to predict
  • Choose Task Type: Classification, Regression, or Exploration
  • Select Preprocessing Steps: Enable the techniques you need

3. 🛠️ Customize Settings

  • Data Cleaning: Configure missing value strategies and outlier handling
  • Feature Engineering: Enable datetime, text, and interaction features
  • Encoding & Scaling: Choose optimal encoding and scaling methods
  • Advanced Options: Set up feature selection and dimensionality reduction

4. 🚀 Process & Download

  • Click "Start Preprocessing"
  • Monitor progress and view processing logs
  • Download your ML-ready dataset

💻 Local Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Setup

# Clone the repository
git clone https://github.com/yourusername/data-preprocessing-app.git
cd data-preprocessing-app

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run the app
streamlit run app.py

The app will open in your browser at http://localhost:8501

📊 Example Use Cases

🏠 Real Estate Price Prediction

  • Input: Property data with mixed types (dates, categories, numbers)
  • Processing: Handle missing values, encode neighborhoods, scale prices
  • Output: Clean dataset ready for regression models

📧 Email Spam Classification

  • Input: Email data with text content and metadata
  • Processing: Extract text features, encode categorical variables
  • Output: Balanced dataset ready for classification algorithms

🛒 Customer Segmentation

  • Input: Customer transaction data with timestamps
  • Processing: Extract datetime features, create interaction terms, apply PCA
  • Output: Reduced-dimension dataset perfect for clustering

📈 Stock Market Analysis

  • Input: Financial time series with multiple indicators
  • Processing: Handle outliers, create technical indicators, scale features
  • Output: Normalized dataset ready for time series modeling

🔧 Technical Specifications

Built With

  • Frontend: Streamlit with custom CSS styling
  • Backend: Python with pandas, scikit-learn, scipy
  • Visualization: Plotly, Matplotlib, Seaborn
  • Machine Learning: scikit-learn preprocessing and decomposition modules

Performance

  • Memory Efficient: Optimized for large datasets
  • Fast Processing: Vectorized operations with NumPy and pandas
  • Scalable: Works with datasets from small samples to 200MB files

Browser Compatibility

  • ✅ Chrome (recommended)
  • ✅ Firefox
  • ✅ Safari
  • ✅ Edge

📚 API Reference

The app uses a comprehensive ComprehensiveDataPreprocessor class with methods for:

# Main preprocessing pipeline
processed_df, report = preprocessor.process_data(
    file_path="data.csv",
    preprocessing_choices=config_dict,
    target_column="target_variable"
)

# Individual preprocessing steps
preprocessor.handle_missing_values_advanced(df)
preprocessor.encode_categorical_advanced(df)
preprocessor.apply_scaling(df)
preprocessor.dimensionality_reduction(df)

🤝 Contributing

We welcome contributions! Here's how you can help:

  1. 🍴 Fork the repository
  2. 🌿 Create a feature branch: git checkout -b feature/amazing-feature
  3. 💾 Commit changes: git commit -m 'Add amazing feature'
  4. 📤 Push to branch: git push origin feature/amazing-feature
  5. 🔄 Open a Pull Request

Development Guidelines

  • Follow PEP 8 style guidelines
  • Add docstrings for new functions
  • Test with various dataset types
  • Update README if adding new features

🐛 Bug Reports

Found a bug? Please create an issue with:

  • Description: Clear description of the problem
  • Steps to Reproduce: Detailed steps to recreate the issue
  • Expected vs Actual: What should happen vs what actually happens
  • Dataset Info: File type, size, and structure (anonymized)
  • Browser/Environment: Your system details

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🌟 Acknowledgments

  • Streamlit Team: For the amazing framework
  • Scikit-learn Contributors: For comprehensive ML preprocessing tools
  • Plotly Team: For interactive visualization capabilities
  • Open Source Community: For the countless libraries that make this possible

Made with ❤️ for the Data Science Community by Arnab