An intelligent data cleaning assistant that combines deterministic statistical profiling (Pandas) with probabilistic reasoning (Llama-3). This tool analyzes CSV datasets, identifies quality issues (missing values, duplicates, type mismatches), and automatically generates Python code to fix them.
- Hybrid Architecture: Uses Pandas for accurate calculations (counting rows, nulls) and LLM for qualitative interpretation.
- Automated EDA: Instantly profiles datasets to find outliers and anomalies.
- Code Generation: Instead of just pointing out errors, it writes the exact
pandascode to fix them. - Scalable: Since only the metadata (statistics) is sent to the LLM, it can handle large datasets without hitting token limits.
- Analysis Engine: Python (Pandas)
- Reasoning Engine: Llama-3.3-70b (via Groq API)
- UI: Streamlit
- Architecture: Hybrid (Deterministic + Generative)
-
Clone the repository:
git clone [https://github.com/Mervecaliskann/AI-Data-Analyst.git](https://github.com/Mervecaliskann/AI-Data-Analyst.git) cd AI-Data-Analyst -
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables: Create a
.envfile:GROQ_API_KEY=your_groq_api_key_here
-
Run the application:
streamlit run app.py
- Upload a CSV file (e.g., raw sales data, customer logs).
- View the automated statistical profile (missing values, duplicates).
- Click "Analyze with AI" to get a detailed cleaning report and copy-pasteable Python code fixes.
Developed by Merve ΓalΔ±Εkan