Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Python artifacts
__pycache__/
*.pyc
*.pyo
*.pyd
*.egg-info/
.pytest_cache/
.mypy_cache/
.ruff_cache/

# Virtual environments
.venv/
venv/
env/

# Jupyter
.ipynb_checkpoints/

# Git / tooling
.git/
.gitignore
.gitattributes
.claude/

# Editor / OS
.DS_Store
.idea/
.vscode/
*.swp

# Secrets and local env
.env
.env.local
*.pem
*.key

# Generated / regenerable artifacts
data/
logs/
*.log
collection_stats.json
storage_status.json
docker_build.version.log
tmp.pytest.log

# Docker build scripts (not needed inside the image)
docker_*.sh
docker_name.sh
Dockerfile.python_slim
Dockerfile.uv

# Docs (the runtime doesn't read these)
*.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
# =============================================================================
# Python / General
# =============================================================================

# Python virtual environments
.venv/
venv/
env/
ENV/
.conda/

# Python cache and bytecode
__pycache__/
*.py[cod]
*$py.class
*.pyc
*.pyo
.Python

# Python build artifacts
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
*.egg
*.manifest
*.spec

# =============================================================================
# Testing and Coverage
# =============================================================================
.pytest_cache/
.coverage
.coverage.*
htmlcov/
.tox/
.nox/
coverage.xml
*.cover
*.py,cover

# =============================================================================
# Type Checkers and Linters
# =============================================================================
.mypy_cache/
.dmypy.json
dmypy.json
.pyrightcache/
.ruff_cache/
.ruff_cache/
.lint_cache/

# =============================================================================
# IDE and Editor Settings
# =============================================================================

# JetBrains (PyCharm, IntelliJ, etc.)
.idea/
*.iml
*.ipr
*.iws

# VS Code
.vscode/
*.code-workspace

# Vim
*.swp
*.swo
*~
*.vim

# Emacs
*~
\#*\#
.\#*

# macOS
.DS_Store
.AppleDouble
.LSOverride

# Windows
Thumbs.db
ehthumbs.db
Desktop.ini

# =============================================================================
# Environment Variables and Secrets
# =============================================================================
.env
.env.local
.env.*.local
.env.development
.env.production
.env.test
*.env

# Secrets and credentials
*.pem
*.key
secrets.json
credentials.json
service_account.json

# =============================================================================
# Claude and AI Tools
# =============================================================================
.claude/
.claude-code/
*.claude.local.*

# =============================================================================
# Jupyter Notebooks
# =============================================================================
.ipynb_checkpoints/
.jupyter/
*.ipynb_checkpoints

# =============================================================================
# Logs and Temp Files
# =============================================================================
*.log
logs/
tmp/
temp/
tmp/
scratch/
.scratch/

# =============================================================================
# Local pipeline artifacts (regenerated by collector / status scripts)
# =============================================================================
collection_stats.json
storage_status.json

# =============================================================================
# Data and Model Artifacts
# =============================================================================
data/
!data/.gitkeep
models/
!models/.gitkeep
checkpoints/
*.h5
*.pth
*.pt
*.onnx

# =============================================================================
# Docker
# =============================================================================
*.dockerfile.local
docker-compose.override.yml

# =============================================================================
# Local Configuration (Keep Templates)
# =============================================================================
*.local.yaml
*.local.json
*.local.yml
config.local.*

# =============================================================================
# Documentation Build
# =============================================================================
docs/_build/
site/
__pypackages__/
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# CLAUDE.md

- Guidance for Claude Code when working in this project
- The repo-wide `CLAUDE.md` at `/Users/gprakash/src/umd_classes1/CLAUDE.md` still
applies (coding style, testing, notebook, markdown rules) — this file only
adds project-specific context

# What This Project Is

- A txtai-based market research platform for the DATA605 Spring 2026 course
(`UmdTask430`)
- End-to-end it does three things:
- **Collect**: SEC EDGAR filings + news (NewsAPI, Alpha Vantage) into a
four-tier store
- **Search**: an agentic pipeline routes a query to sub-agents (`sec`,
`news`), retrieves top-k chunks, and synthesizes a cited answer
- **Serve**: a FastAPI service with a Streamlit UI
- See `RUN_INSTRUCTIONS.md` for the full quickstart, env vars, and
troubleshooting — do not duplicate that here

# Repository Layout

- `app/`: application code (importable as `app.*`)
- `agents/research_agent.py`: core agentic pipeline; `run_research(query)`
streams events, `run_research_sync(query)` returns a single dict
- `agents/{diligence,earnings,regulatory,sentiment,web_research,orchestrator}.py`:
domain agents used by the dashboard/chat UI
- `api/server.py`: FastAPI app — `GET /`, `POST /research`,
`POST /research/stream` (SSE)
- `collectors/`: `base_collector.py`, `sec_collector.py`,
`news_collector.py`
- `pipeline/`: `ingest.py` (chunking/normalization), `embeddings.py` (txtai
index)
- `storage/`: four-tier storage clients
- `hot_storage/`: KeyDB (live prices, semantic cache, sessions)
- `warm_storage/`: PostgreSQL + pgvector (filings, chunks, XBRL facts)
- `cold_storage/`: MinIO (raw filings archive)
- `cache_manager.py`: thin wrapper over KeyDB
- `ui/`: Streamlit pages — `research.py` (agent chat), `dashboard.py`,
`chat.py`; entrypoint is `app/main.py`
- `scripts/`: one-shot CLIs (`run_sec_collector`, `run_sec_bulk`,
`run_all_collectors`, `backfill_txtai_from_chunks`, `eval_research`,
`check_storage_status`)
- `sql/init.sql`: PostgreSQL schema (mounted into the postgres container)
- `data/`: persisted txtai index (`documents`, `embeddings`, `index.db`,
`config.json`) — large binary files, do not commit
- `notebooks/`: jupytext-paired notebooks
- `txtai.API.{ipynb,py}`: txtai library primitives in isolation
- `txtai.example.{ipynb,py}`: full ingest → search → agent demo
- `docker-compose.yml`: brings up KeyDB, PostgreSQL+pgvector, MinIO

# Common Commands

- Bring up infra (KeyDB, Postgres+pgvector, MinIO):
```bash
> docker-compose up -d
> docker-compose ps
```
- Install deps (Python 3.11+):
```bash
> python -m venv .venv && source .venv/bin/activate
> pip install -r requirements.txt
```
- Collect data (one-time):
```bash
> python -m scripts.run_sec_bulk --group all --skip-existing --limit 10
> python -m scripts.run_all_collectors --tickers AAPL,MSFT,NVDA --skip-sec --no-search
> python -m scripts.backfill_txtai_from_chunks --from-scratch
```
- Run API + UI:
```bash
> uvicorn app.api.server:app --host 127.0.0.1 --port 8000 &
> streamlit run app/ui/research.py --server.port 8501
```
- Evaluate the pipeline:
```bash
> python -m scripts.eval_research --warmup
> python -m scripts.eval_research --repeats 5 --json logs/eval.json
```
- Inspect storage state:
```bash
> python -m scripts.check_storage_status
```

# Configuration

- Secrets and connection strings live in `.env` (template: `.env.example`)
- Required keys for a full run:
- `SEC_USER_AGENT`: SEC EDGAR requires a real contact email
- `NEWSAPI_KEY`, `ALPHAVANTAGE_API_KEY`: news collectors
- `OPENAI_API_KEY`: embeddings (txtai default backend)
- Optional for LLM-backed answer synthesis:
- `LLM_BASE_URL`, `LLM_API_KEY`, `LLM_MODEL` (any OpenAI-compatible endpoint,
including local Ollama) — without these the synthesizer falls back to an
extractive template
- Never read or write secrets in code; always pull from environment variables

# Conventions Specific to This Project

- Python imports use module-qualified names (`from app.storage import ...`,
`from app.agents.research_agent import ...`) — keep it that way
- Storage clients are accessed via factory helpers (`get_keydb_client`,
`get_postgres_client`, `get_minio_client`, `get_cache_manager`,
`get_embeddings`) — do not instantiate clients directly in new code
- Collectors inherit from `app/collectors/base_collector.py`; new sources
should follow the same `fetch → normalize → store` flow
- Long-running collection scripts must support `--skip-existing` and
`--no-search` style toggles so partial runs are cheap
- Logs go to `logs/` (gitignored); persistent artifacts go to `data/`

# Things to Avoid

- Do not commit `.env`, `data/`, `logs/`, `.venv/`, or anything in
`**/__pycache__/`
- Do not edit `data/{documents,embeddings,index.db}` by hand — rebuild via
`scripts.backfill_txtai_from_chunks`
- Do not bypass the storage tier abstraction (e.g. talking directly to psycopg
or boto from agent code) — go through `app.storage`
- Do not add new top-level docs files for one-off notes; extend
`RUN_INSTRUCTIONS.md` or this file

# When in Doubt

- Architecture / data flow: `RUN_INSTRUCTIONS.md` and `app/storage/README.md`
- Schema: `sql/init.sql`
- Agent behavior: `app/agents/research_agent.py` (router → retrievers →
synthesizer)
- Eval / benchmarks: `scripts/eval_research.py`
Loading