Transform your raw data into machine learning-ready datasets with professional-grade preprocessing techniques in just a few clicks.
The Advanced Data Preprocessing App is a comprehensive tool that automates the tedious and time-consuming process of cleaning and preparing data for machine learning. Whether you're a data scientist, analyst, or ML engineer, this app saves you hours of manual preprocessing work.
- 🧹 Intelligent Data Cleaning: Remove duplicates, handle missing values, fix data formats
- ⚙️ Advanced Feature Engineering: Extract datetime features, create interactions, apply transformations
- 🏷️ Smart Categorical Encoding: Automatic encoding strategies based on data characteristics
- 📏 Feature Scaling: StandardScaler, MinMaxScaler, RobustScaler options
- 🎯 Feature Selection: Remove low-variance and highly correlated features
- 📉 Dimensionality Reduction: PCA and t-SNE implementations with visualization
- 📊 Interactive Visualizations: Real-time plots and data exploration
- 📥 ML-Ready Export: Download processed datasets ready for any ML framework
- CSV files (
.csv) - Excel files (
.xlsx,.xls) - Maximum file size: 200MB
- ✅ Duplicate Removal: Eliminates identical rows
- ✅ Missing Value Handling: Mean, median, KNN imputation strategies
- ✅ Format Standardization: Date parsing, data type optimization
- ✅ Outlier Management: IQR and Z-score methods
- ✅ DateTime Features: Extract year, month, day, weekday, hour, weekend flags
- ✅ Text Features: Character count, word count, text statistics
- ✅ Numeric Interactions: Create multiplication features between variables
- ✅ Mathematical Transformations: Log and square root transformations
- ✅ Binning: Convert continuous variables to categorical ranges
- ✅ Categorical Encoding: One-hot, label, and frequency encoding
- ✅ Feature Scaling: Multiple scaling strategies for optimal performance
- ✅ Feature Selection: Variance-based and correlation-based filtering
- ✅ Dimensionality Reduction: PCA for linear reduction, t-SNE for visualization
- Click "Choose your data file"
- Upload CSV or Excel files up to 200MB
- Preview your data structure and statistics
- Select Target Column (optional): Choose what you want to predict
- Choose Task Type: Classification, Regression, or Exploration
- Select Preprocessing Steps: Enable the techniques you need
- Data Cleaning: Configure missing value strategies and outlier handling
- Feature Engineering: Enable datetime, text, and interaction features
- Encoding & Scaling: Choose optimal encoding and scaling methods
- Advanced Options: Set up feature selection and dimensionality reduction
- Click "Start Preprocessing"
- Monitor progress and view processing logs
- Download your ML-ready dataset
- Python 3.8 or higher
- pip package manager
# Clone the repository
git clone https://github.com/yourusername/data-preprocessing-app.git
cd data-preprocessing-app
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run the app
streamlit run app.pyThe app will open in your browser at http://localhost:8501
- Input: Property data with mixed types (dates, categories, numbers)
- Processing: Handle missing values, encode neighborhoods, scale prices
- Output: Clean dataset ready for regression models
- Input: Email data with text content and metadata
- Processing: Extract text features, encode categorical variables
- Output: Balanced dataset ready for classification algorithms
- Input: Customer transaction data with timestamps
- Processing: Extract datetime features, create interaction terms, apply PCA
- Output: Reduced-dimension dataset perfect for clustering
- Input: Financial time series with multiple indicators
- Processing: Handle outliers, create technical indicators, scale features
- Output: Normalized dataset ready for time series modeling
- Frontend: Streamlit with custom CSS styling
- Backend: Python with pandas, scikit-learn, scipy
- Visualization: Plotly, Matplotlib, Seaborn
- Machine Learning: scikit-learn preprocessing and decomposition modules
- Memory Efficient: Optimized for large datasets
- Fast Processing: Vectorized operations with NumPy and pandas
- Scalable: Works with datasets from small samples to 200MB files
- ✅ Chrome (recommended)
- ✅ Firefox
- ✅ Safari
- ✅ Edge
The app uses a comprehensive ComprehensiveDataPreprocessor class with methods for:
# Main preprocessing pipeline
processed_df, report = preprocessor.process_data(
file_path="data.csv",
preprocessing_choices=config_dict,
target_column="target_variable"
)
# Individual preprocessing steps
preprocessor.handle_missing_values_advanced(df)
preprocessor.encode_categorical_advanced(df)
preprocessor.apply_scaling(df)
preprocessor.dimensionality_reduction(df)We welcome contributions! Here's how you can help:
- 🍴 Fork the repository
- 🌿 Create a feature branch:
git checkout -b feature/amazing-feature - 💾 Commit changes:
git commit -m 'Add amazing feature' - 📤 Push to branch:
git push origin feature/amazing-feature - 🔄 Open a Pull Request
- Follow PEP 8 style guidelines
- Add docstrings for new functions
- Test with various dataset types
- Update README if adding new features
Found a bug? Please create an issue with:
- Description: Clear description of the problem
- Steps to Reproduce: Detailed steps to recreate the issue
- Expected vs Actual: What should happen vs what actually happens
- Dataset Info: File type, size, and structure (anonymized)
- Browser/Environment: Your system details
This project is licensed under the MIT License - see the LICENSE file for details.
- Streamlit Team: For the amazing framework
- Scikit-learn Contributors: For comprehensive ML preprocessing tools
- Plotly Team: For interactive visualization capabilities
- Open Source Community: For the countless libraries that make this possible
Made with ❤️ for the Data Science Community by Arnab