Skip to content

AshayK003/XadaptiveEDA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

20 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

X-Adaptive EDA

Explore. Adapt. Understand.

An adaptive data analysis tool that learns your priorities and recommends the most relevant analyses.

License: MIT Python 3.10+ Streamlit Tests pandas GitHub Stars

Quick Start β€’ Features β€’ Demo β€’ Architecture β€’ Contributing


What is X-Adaptive EDA?

X-Adaptive EDA is a Streamlit-based exploratory data analysis tool that goes beyond static reporting. It adapts to how you work β€” learning from your feedback, prioritizing what matters to you, and explaining why each recommendation scored the way it did.

Upload a dataset β†’ Get intelligent recommendations β†’ Explore with interactive charts β†’ Chat with your data β†’ Your preferences evolve as you go.


Demo

Screenshots coming soon. Run locally to see the full experience.


Features

Core Analytics

  • 8 Analysis Types β€” Distribution, Correlation, Missing Values, Categorical, Outliers, Time Series, Clustering, Feature Importance
  • Adaptive Scoring β€” Recommendations learn from your feedback and adjust in real-time
  • Explainable Recommendations β€” Every score decomposes into its components with confidence intervals
  • Interactive Visualizations β€” Plotly charts with zoom, pan, hover, and download

Intelligence

  • AI-Powered Insights β€” LLM-generated observations for each analysis (Ollama, OpenRouter, Groq, or Custom API)
  • Chat with Your Data β€” Ask natural language questions about your dataset
  • Smart Column Naming β€” AI suggests names for unnamed columns
  • NLQ Classifier β€” Understands queries like "show me outliers in revenue"

Adaptation

  • Preference Tracking β€” πŸ‘/πŸ‘Ž feedback permanently adjusts analysis priorities
  • Temporal Decay β€” Older preferences fade over time
  • Novelty Dampening β€” Avoids repeating the same analyses
  • Column Affinity β€” Boosts analyses involving columns you frequently explore
  • Ξ΅-Greedy Exploration β€” Occasionally shows unexpected analyses to discover new insights

Data Quality

  • 10-Step Quality Pipeline β€” Normalizes, deduplicates, infers types, and scores your data
  • Per-Row Outlier Explainability β€” See which column triggered each outlier and why
  • Progressive Sampling β€” Large datasets (>50k rows) offer stratified sampling to ~10k

Developer Experience

  • Session Persistence β€” Save/load via SQLite
  • 68 Tests β€” Comprehensive test suite
  • Rate Limiting β€” Remote API calls capped at 10/minute
  • GPU Acceleration β€” Ollama auto-uses GPU with CPU fallback

Why This Project Exists

Most EDA tools give you static reports. X-Adaptive EDA does three things differently:

  1. It learns β€” Every πŸ‘/πŸ‘Ž shifts future recommendations toward what you care about
  2. It explains β€” No black boxes. Every score shows its formula. Counterfactual sliders let you ask "what if?"
  3. It adapts in real-time β€” No waiting for retraining. Feedback takes effect immediately.

This makes it ideal for:

  • Data scientists doing exploratory analysis
  • Analysts who need quick, relevant insights
  • Students learning data analysis
  • Teams exploring unfamiliar datasets

Tech Stack

Layer Technology
UI Streamlit (β‰₯1.36)
Data pandas, NumPy
Visualization Plotly
LLM Ollama (local), OpenRouter, Groq, Custom API
NLP Custom tokenizer + stemmer (no external deps)
Persistence SQLite, JSON
Testing pytest-compatible test files

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Streamlit UI (app.py)                β”‚
β”‚  Sidebar: Dataset β€’ Priorities β€’ AI β€’ Sessions          β”‚
β”‚  Main: Recommendations β€’ Visualizations β€’ Chat          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚                                   β”‚
    β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”                     β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
    β”‚   Data     β”‚                     β”‚  Recommendationβ”‚
    β”‚  Processor β”‚                     β”‚    Engine      β”‚
    β”‚  + Quality β”‚                     β”‚  (scoring,     β”‚
    β”‚  Pipeline  β”‚                     β”‚   ranking)     β”‚
    β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜                     β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚                                   β”‚
    β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”                     β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
    β”‚    LLM     β”‚                     β”‚   Preference  β”‚
    β”‚   Adapter  β”‚                     β”‚    Tracker    β”‚
    β”‚  (insights, β”‚                     β”‚  (adaptation) β”‚
    β”‚   chat)    β”‚                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Flow

Upload β†’ Cleanse β†’ Profile β†’ Score β†’ Rank β†’ Visualize β†’ Feedback β†’ Adapt
                ↓                              ↑
          Quality Report              Counterfactual Slider

Quick Start

Prerequisites

  • Python 3.10+
  • (Optional) Ollama for local LLM features

Installation

# Clone the repository
git clone https://github.com/AshayK003/XadaptiveEDA.git
cd XadaptiveEDA

# Create virtual environment
python -m venv venv
.\venv\Scripts\activate      # Windows
# source venv/bin/activate   # macOS/Linux

# Install dependencies
pip install -r requirements.txt

# Run the app
streamlit run app.py

Open http://localhost:8501 in your browser.

Environment Setup (Optional)

For LLM features, copy .env.example to .env and add your API keys:

cp .env.example .env
# Edit .env with your keys

No API key needed for local Ollama β€” just install and run.


Usage

Basic Workflow

  1. Upload a CSV, Excel, or JSON file
  2. Rename unnamed columns (AI suggestions or manual)
  3. Finalize to generate the full analysis
  4. Explore recommended analyses ranked by relevance
  5. Give feedback (πŸ‘/πŸ‘Ž) to refine future recommendations
  6. Chat with your data in natural language

Example Session

# The app runs via Streamlit β€” no Python code needed
# Just run:
streamlit run app.py

# Then in the browser:
# 1. Upload sales_data.csv
# 2. Click "Finalize Dataset"
# 3. Click πŸ‘ on "Distribution Analysis"
# 4. Ask: "What's the correlation between price and quantity?"

Expert Mode

Toggle Dev Mode in the sidebar to reveal:

  • Raw DataFrame viewer
  • CSV download button
  • Full recommendation JSON with all scoring components

Configuration

Analysis Goals

Choose a preset goal to automatically weight analysis types:

Goal Distribution Correlation Missing Categorical Outliers Time Series Clustering Feature Imp
General 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
Distributions 0.9 0.3 0.3 0.3 0.8 0.3 0.3 0.4
Relationships 0.3 0.9 0.3 0.5 0.3 0.8 0.6 0.7
Data Quality 0.3 0.3 0.9 0.7 0.5 0.3 0.3 0.5

Scoring Formula

final_score = base_score Γ— data_relevance Γ— user_pref Γ— quality_adj
            Γ— diversity_penalty Γ— novelty_penalty Γ— avoidance_penalty Γ— affinity_boost

All multipliers are documented in recommendation_engine.py.

LLM Providers

Provider Key Default Model Rate Limit
Local (Ollama) None qwen2.5-coder:7b Unlimited
OpenRouter OPENROUTER_API_KEY qwen/qwen2.5-7b-instruct 10/60s
Groq GROQ_API_KEY llama-3.3-70b-versatile 10/60s
Custom CUSTOM_API_KEY + endpoint Configurable 10/60s

Project Structure

x-adaptive-eda/
β”œβ”€β”€ app.py                    # Streamlit UI (orchestration, ~970 lines)
β”œβ”€β”€ data_processor.py         # File loading, cleansing, profiling
β”œβ”€β”€ data_quality.py           # 10-step quality pipeline, QualityReport
β”œβ”€β”€ recommendation_engine.py  # Scoring, ranking, penalties, bootstrap CI
β”œβ”€β”€ preference_learner.py     # Fixed-delta adaptation, goals, decay
β”œβ”€β”€ insight_generator.py      # Explainable recommendations, comparisons
β”œβ”€β”€ visualization_generator.py# Plotly charts (8 types, k-means, MI)
β”œβ”€β”€ constants.py              # Analysis types, preferences, goals
β”œβ”€β”€ llm_adapter.py            # LLM integration, rate limiting, chat
β”œβ”€β”€ nlq_engine.py             # NLP query classifier (no external deps)
β”œβ”€β”€ session_persistence.py    # SQLite save/load for sessions
β”œβ”€β”€ requirements.txt          # 6 dependencies
β”œβ”€β”€ .env.example              # Environment variable template
β”œβ”€β”€ LICENSE                   # MIT License
β”œβ”€β”€ README.md                 # This file
β”œβ”€β”€ test_phase1.py            # Core tests (diversity, tracking, explanations)
β”œβ”€β”€ test_phase2.py            # Column interestingness, sampling, summary
β”œβ”€β”€ test_phase3.py            # Goals, decay, save/load
β”œβ”€β”€ test_phase4.py            # NLQ classifier (stemming, synonyms, TF)
β”œβ”€β”€ test_data_quality.py      # 12 quality pipeline tests
β”œβ”€β”€ test_session_persistence.py # SQLite persistence tests
└── test_rate_limit.py        # Rate limiting tests

Development Setup

# Install in development mode
pip install -r requirements.txt

# Run tests
python test_phase1.py && python test_phase2.py && python test_phase3.py && python test_phase4.py && python test_data_quality.py && python test_session_persistence.py && python test_rate_limit.py

# Run the app
streamlit run app.py

Code Style

  • snake_case for functions/variables
  • PascalCase for classes
  • Docstrings on all public functions
  • Structured logging via logging.getLogger(__name__)
  • No print() in source files (only in tests)

Testing

68 tests across 7 test files:

File Tests Coverage
test_phase1.py 5 Diversity, tracking, explanations, regression
test_phase2.py 5 Column interestingness, sampling, summary
test_phase3.py 7 Goals, decay, save/load round-trip
test_phase4.py 12 NLQ classifier (stemming, synonyms, TF scoring)
test_data_quality.py 12 10-step quality pipeline
test_session_persistence.py 7 SQLite save/load/list/delete
test_rate_limit.py 5 Rate limiting (local, remote, separate providers)
# Run all tests
python test_phase1.py && python test_phase2.py && python test_phase3.py && python test_phase4.py && python test_data_quality.py && python test_session_persistence.py && python test_rate_limit.py

Roadmap

Completed

  • 8 analysis types with adaptive scoring
  • Explainable recommendations with score decomposition
  • Interactive Plotly visualizations
  • LLM integration (Ollama, OpenRouter, Groq, Custom)
  • Chat with your data
  • Session persistence (SQLite)
  • Ξ΅-greedy exploration
  • Rate limiting for remote APIs
  • 68 tests passing
  • MIT License

Planned

  • Spearman/Kendall correlation options
  • Custom k-means cluster count slider
  • Export analysis report as PDF/HTML
  • Multi-dataset comparison
  • Dashboard mode (persistent charts)
  • Plugin system for custom analysis types
  • Collaborative sessions (multi-user)

Contributing

Contributions welcome! Here's how:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Guidelines

  • Follow existing code style
  • Add tests for new features
  • Update README if needed
  • Keep PRs focused (one feature per PR)

License

This project is licensed under the MIT License β€” see LICENSE for details.


Acknowledgements

  • Streamlit β€” Web framework
  • Plotly β€” Interactive visualizations
  • Ollama β€” Local LLM hosting
  • pandas β€” Data manipulation

FAQ

Q: Do I need an API key to use this? A: No. Local Ollama works without any API keys. API keys are only needed for OpenRouter, Groq, or Custom API providers.

Q: What file formats are supported? A: CSV, XLSX, XLS, and JSON files up to ~50 MB.

Q: How does the adaptation work? A: Fixed-delta adjustments (not ML). πŸ‘ adds +0.10, πŸ‘Ž subtracts -0.10, column selection adds +0.03. All weights stay in [0.1, 1.0].

Q: Can I save my session? A: Yes. Click "Save Session" in the sidebar. Sessions persist in SQLite at ~/.eda_assistant_sessions.db.

Q: How accurate are the AI insights? A: Insights are generated from your actual data values β€” no pre-written templates. Quality depends on the LLM provider and model used.

Q: Is my data sent to external servers? A: Only if you use OpenRouter, Groq, or Custom API. Local Ollama keeps everything on your machine.


Troubleshooting

Issue Solution
App won't start Check Python version (3.10+), run pip install -r requirements.txt
Ollama not reachable Run ollama serve in a terminal
GPU not detected Install NVIDIA drivers, restart Ollama
Slow LLM responses Use CPU mode: set OLLAMA_NUM_GPU=0 before starting Ollama
Large file warning Files >50 MB may be slow; use sampling for datasets >50k rows
Import errors Ensure virtual environment is activated

Security

  • API keys stored in .env (gitignored)
  • No hardcoded secrets in source code
  • Parameterized SQL queries (no injection risk)
  • Local Ollama keeps data on your machine
  • Remote API calls rate-limited to 10/60s

Built with ❀️ for the data community

⭐ Star this repo β€’ πŸ› Report Bug β€’ πŸ’‘ Request Feature β€’ β˜• Support the developer

About

Adaptive exploratory data analysis that learns your preferences. Intelligent recommendations, Plotly charts, LLM insights, and natural language queries. Built with Streamlit.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages