An adaptive data analysis tool that learns your priorities and recommends the most relevant analyses.
Quick Start β’ Features β’ Demo β’ Architecture β’ Contributing
X-Adaptive EDA is a Streamlit-based exploratory data analysis tool that goes beyond static reporting. It adapts to how you work β learning from your feedback, prioritizing what matters to you, and explaining why each recommendation scored the way it did.
Upload a dataset β Get intelligent recommendations β Explore with interactive charts β Chat with your data β Your preferences evolve as you go.
Screenshots coming soon. Run locally to see the full experience.
- 8 Analysis Types β Distribution, Correlation, Missing Values, Categorical, Outliers, Time Series, Clustering, Feature Importance
- Adaptive Scoring β Recommendations learn from your feedback and adjust in real-time
- Explainable Recommendations β Every score decomposes into its components with confidence intervals
- Interactive Visualizations β Plotly charts with zoom, pan, hover, and download
- AI-Powered Insights β LLM-generated observations for each analysis (Ollama, OpenRouter, Groq, or Custom API)
- Chat with Your Data β Ask natural language questions about your dataset
- Smart Column Naming β AI suggests names for unnamed columns
- NLQ Classifier β Understands queries like "show me outliers in revenue"
- Preference Tracking β π/π feedback permanently adjusts analysis priorities
- Temporal Decay β Older preferences fade over time
- Novelty Dampening β Avoids repeating the same analyses
- Column Affinity β Boosts analyses involving columns you frequently explore
- Ξ΅-Greedy Exploration β Occasionally shows unexpected analyses to discover new insights
- 10-Step Quality Pipeline β Normalizes, deduplicates, infers types, and scores your data
- Per-Row Outlier Explainability β See which column triggered each outlier and why
- Progressive Sampling β Large datasets (>50k rows) offer stratified sampling to ~10k
- Session Persistence β Save/load via SQLite
- 68 Tests β Comprehensive test suite
- Rate Limiting β Remote API calls capped at 10/minute
- GPU Acceleration β Ollama auto-uses GPU with CPU fallback
Most EDA tools give you static reports. X-Adaptive EDA does three things differently:
- It learns β Every π/π shifts future recommendations toward what you care about
- It explains β No black boxes. Every score shows its formula. Counterfactual sliders let you ask "what if?"
- It adapts in real-time β No waiting for retraining. Feedback takes effect immediately.
This makes it ideal for:
- Data scientists doing exploratory analysis
- Analysts who need quick, relevant insights
- Students learning data analysis
- Teams exploring unfamiliar datasets
| Layer | Technology |
|---|---|
| UI | Streamlit (β₯1.36) |
| Data | pandas, NumPy |
| Visualization | Plotly |
| LLM | Ollama (local), OpenRouter, Groq, Custom API |
| NLP | Custom tokenizer + stemmer (no external deps) |
| Persistence | SQLite, JSON |
| Testing | pytest-compatible test files |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Streamlit UI (app.py) β
β Sidebar: Dataset β’ Priorities β’ AI β’ Sessions β
β Main: Recommendations β’ Visualizations β’ Chat β
βββββββββββ¬ββββββββββββββββββββββββββββββββββββ¬ββββββββββββ
β β
βββββββΌββββββ βββββββββΌββββββββ
β Data β β Recommendationβ
β Processor β β Engine β
β + Quality β β (scoring, β
β Pipeline β β ranking) β
βββββββ¬ββββββ βββββββββ¬ββββββββ
β β
βββββββΌββββββ βββββββββΌββββββββ
β LLM β β Preference β
β Adapter β β Tracker β
β (insights, β β (adaptation) β
β chat) β βββββββββββββββββ
βββββββββββββ
Upload β Cleanse β Profile β Score β Rank β Visualize β Feedback β Adapt
β β
Quality Report Counterfactual Slider
- Python 3.10+
- (Optional) Ollama for local LLM features
# Clone the repository
git clone https://github.com/AshayK003/XadaptiveEDA.git
cd XadaptiveEDA
# Create virtual environment
python -m venv venv
.\venv\Scripts\activate # Windows
# source venv/bin/activate # macOS/Linux
# Install dependencies
pip install -r requirements.txt
# Run the app
streamlit run app.pyOpen http://localhost:8501 in your browser.
For LLM features, copy .env.example to .env and add your API keys:
cp .env.example .env
# Edit .env with your keysNo API key needed for local Ollama β just install and run.
- Upload a CSV, Excel, or JSON file
- Rename unnamed columns (AI suggestions or manual)
- Finalize to generate the full analysis
- Explore recommended analyses ranked by relevance
- Give feedback (π/π) to refine future recommendations
- Chat with your data in natural language
# The app runs via Streamlit β no Python code needed
# Just run:
streamlit run app.py
# Then in the browser:
# 1. Upload sales_data.csv
# 2. Click "Finalize Dataset"
# 3. Click π on "Distribution Analysis"
# 4. Ask: "What's the correlation between price and quantity?"Toggle Dev Mode in the sidebar to reveal:
- Raw DataFrame viewer
- CSV download button
- Full recommendation JSON with all scoring components
Choose a preset goal to automatically weight analysis types:
| Goal | Distribution | Correlation | Missing | Categorical | Outliers | Time Series | Clustering | Feature Imp |
|---|---|---|---|---|---|---|---|---|
| General | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 |
| Distributions | 0.9 | 0.3 | 0.3 | 0.3 | 0.8 | 0.3 | 0.3 | 0.4 |
| Relationships | 0.3 | 0.9 | 0.3 | 0.5 | 0.3 | 0.8 | 0.6 | 0.7 |
| Data Quality | 0.3 | 0.3 | 0.9 | 0.7 | 0.5 | 0.3 | 0.3 | 0.5 |
final_score = base_score Γ data_relevance Γ user_pref Γ quality_adj
Γ diversity_penalty Γ novelty_penalty Γ avoidance_penalty Γ affinity_boost
All multipliers are documented in recommendation_engine.py.
| Provider | Key | Default Model | Rate Limit |
|---|---|---|---|
| Local (Ollama) | None | qwen2.5-coder:7b | Unlimited |
| OpenRouter | OPENROUTER_API_KEY |
qwen/qwen2.5-7b-instruct | 10/60s |
| Groq | GROQ_API_KEY |
llama-3.3-70b-versatile | 10/60s |
| Custom | CUSTOM_API_KEY + endpoint |
Configurable | 10/60s |
x-adaptive-eda/
βββ app.py # Streamlit UI (orchestration, ~970 lines)
βββ data_processor.py # File loading, cleansing, profiling
βββ data_quality.py # 10-step quality pipeline, QualityReport
βββ recommendation_engine.py # Scoring, ranking, penalties, bootstrap CI
βββ preference_learner.py # Fixed-delta adaptation, goals, decay
βββ insight_generator.py # Explainable recommendations, comparisons
βββ visualization_generator.py# Plotly charts (8 types, k-means, MI)
βββ constants.py # Analysis types, preferences, goals
βββ llm_adapter.py # LLM integration, rate limiting, chat
βββ nlq_engine.py # NLP query classifier (no external deps)
βββ session_persistence.py # SQLite save/load for sessions
βββ requirements.txt # 6 dependencies
βββ .env.example # Environment variable template
βββ LICENSE # MIT License
βββ README.md # This file
βββ test_phase1.py # Core tests (diversity, tracking, explanations)
βββ test_phase2.py # Column interestingness, sampling, summary
βββ test_phase3.py # Goals, decay, save/load
βββ test_phase4.py # NLQ classifier (stemming, synonyms, TF)
βββ test_data_quality.py # 12 quality pipeline tests
βββ test_session_persistence.py # SQLite persistence tests
βββ test_rate_limit.py # Rate limiting tests
# Install in development mode
pip install -r requirements.txt
# Run tests
python test_phase1.py && python test_phase2.py && python test_phase3.py && python test_phase4.py && python test_data_quality.py && python test_session_persistence.py && python test_rate_limit.py
# Run the app
streamlit run app.py- snake_case for functions/variables
- PascalCase for classes
- Docstrings on all public functions
- Structured logging via
logging.getLogger(__name__) - No
print()in source files (only in tests)
68 tests across 7 test files:
| File | Tests | Coverage |
|---|---|---|
| test_phase1.py | 5 | Diversity, tracking, explanations, regression |
| test_phase2.py | 5 | Column interestingness, sampling, summary |
| test_phase3.py | 7 | Goals, decay, save/load round-trip |
| test_phase4.py | 12 | NLQ classifier (stemming, synonyms, TF scoring) |
| test_data_quality.py | 12 | 10-step quality pipeline |
| test_session_persistence.py | 7 | SQLite save/load/list/delete |
| test_rate_limit.py | 5 | Rate limiting (local, remote, separate providers) |
# Run all tests
python test_phase1.py && python test_phase2.py && python test_phase3.py && python test_phase4.py && python test_data_quality.py && python test_session_persistence.py && python test_rate_limit.py- 8 analysis types with adaptive scoring
- Explainable recommendations with score decomposition
- Interactive Plotly visualizations
- LLM integration (Ollama, OpenRouter, Groq, Custom)
- Chat with your data
- Session persistence (SQLite)
- Ξ΅-greedy exploration
- Rate limiting for remote APIs
- 68 tests passing
- MIT License
- Spearman/Kendall correlation options
- Custom k-means cluster count slider
- Export analysis report as PDF/HTML
- Multi-dataset comparison
- Dashboard mode (persistent charts)
- Plugin system for custom analysis types
- Collaborative sessions (multi-user)
Contributions welcome! Here's how:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow existing code style
- Add tests for new features
- Update README if needed
- Keep PRs focused (one feature per PR)
This project is licensed under the MIT License β see LICENSE for details.
- Streamlit β Web framework
- Plotly β Interactive visualizations
- Ollama β Local LLM hosting
- pandas β Data manipulation
Q: Do I need an API key to use this? A: No. Local Ollama works without any API keys. API keys are only needed for OpenRouter, Groq, or Custom API providers.
Q: What file formats are supported? A: CSV, XLSX, XLS, and JSON files up to ~50 MB.
Q: How does the adaptation work? A: Fixed-delta adjustments (not ML). π adds +0.10, π subtracts -0.10, column selection adds +0.03. All weights stay in [0.1, 1.0].
Q: Can I save my session?
A: Yes. Click "Save Session" in the sidebar. Sessions persist in SQLite at ~/.eda_assistant_sessions.db.
Q: How accurate are the AI insights? A: Insights are generated from your actual data values β no pre-written templates. Quality depends on the LLM provider and model used.
Q: Is my data sent to external servers? A: Only if you use OpenRouter, Groq, or Custom API. Local Ollama keeps everything on your machine.
| Issue | Solution |
|---|---|
| App won't start | Check Python version (3.10+), run pip install -r requirements.txt |
| Ollama not reachable | Run ollama serve in a terminal |
| GPU not detected | Install NVIDIA drivers, restart Ollama |
| Slow LLM responses | Use CPU mode: set OLLAMA_NUM_GPU=0 before starting Ollama |
| Large file warning | Files >50 MB may be slow; use sampling for datasets >50k rows |
| Import errors | Ensure virtual environment is activated |
- API keys stored in
.env(gitignored) - No hardcoded secrets in source code
- Parameterized SQL queries (no injection risk)
- Local Ollama keeps data on your machine
- Remote API calls rate-limited to 10/60s
Built with β€οΈ for the data community
β Star this repo β’ π Report Bug β’ π‘ Request Feature β’ β Support the developer