diff --git a/README.md b/README.md index cc2f0cf..6dce38c 100644 --- a/README.md +++ b/README.md @@ -5,6 +5,25 @@ DataHelm is a data engineering framework focused on the following: - Source ingestion and orchestration - dbt transformation workflows - Notebook-based dashboard execution +- Reusable provider connectors (SharePoint, GCS, S3, BigQuery) +- Optional local LLM analytics query scaffolding + +## Table of Contents + + +- [Core Capabilities](#core-capabilities) +- [High-Level Architecture](#high-level-architecture) +- [Repository Structure](#repository-structure) +- [Local Setup](#local-setup) +- [Configuration Model](#configuration-model) +- [Reusable Connectors](#reusable-connectors) +- [Local LLM Analytics Module](#local-llm-analytics-module) +- [Testing](#testing) +- [CI/CD and Branching](#cicd-and-branching) +- [Containerization](#containerization) +- [Deployment](#deployment) +- [Contributing and Governance](#contributing-and-governance) +- [Detailed Technical Documentation](#detailed-technical-documentation) - Reusable provider connectors (SharePoint, GCS, S3, and BigQuery) - Optional local LLM analytics query scaffolding @@ -12,6 +31,7 @@ DataHelm is a data engineering framework focused on the following: ## Core Capabilities + - **Config-driven ingestion** using YAML in `config/api/` - **Dagster orchestration** for managing jobs, schedules, and sensors - **dbt project execution** through `analytics/dbt_runner.py` and dbt configuration files @@ -21,18 +41,28 @@ DataHelm is a data engineering framework focused on the following: ## High-Level Architecture + +The repository follows layered responsibilities: The repository follows a layered responsibility structure: - `handlers/`: provider-specific source connectors and API handlers -- `ingestion/`: ingestion factory + native ingestion implementations +- `ingestion/`: ingestion factory and native ingestion implementations - `analytics/`: dbt, dashboard, and optional NL-query modules - `dagster_op/`: orchestration objects (jobs, schedules, repository) -- `config/`: all runtime configuration (api, dbt, dashboard, analytics metadata) +- `config/`: all runtime configuration (API, dbt, dashboard, analytics metadata) - `tests/`: unit tests for handlers, ingestion, analytics, and scripts +![alt text](https://github.com/DevStrikerTech/datahelm/blob/master/docs/architecture.png?raw=true) + ## Repository Structure + ```text +dagster_op/ +ingestion/ +tests/ +scripts/ +docs/ config/ api/ dbt/ @@ -57,6 +87,7 @@ docs/ ## Local Setup + ### Prerequisites Python 3.12+ @@ -76,6 +107,7 @@ pip install -e . ### Environment Variables +Create a `.env` file in the repository root with required values, for example: Create a file named `.env` in the root of the repository with the required values, for example: ```text @@ -103,6 +135,8 @@ python scripts/run_dagster_dev.py --print-only ## Configuration Model + +### Ingestion Config (`config/api/*.yaml`) ### Ingestion Config (config/api/*.yaml) Defines source-level extraction, publish targets, schedules, and column mapping. @@ -110,7 +144,7 @@ Example included: CLASHOFCLANS_PLAYER_STATS ### dbt Config (config/dbt/projects.yaml) -Defines dbt units, selection/exclusion rules, vars, and schedules. +Defines dbt units, selection/exclusion rules, variables, and schedules. ### Dashboard Config (config/dashboard/projects.yaml) @@ -122,6 +156,32 @@ Defines dataset metadata for the isolated NL-to-SQL module. ## Reusable Connectors + +The repository includes reusable connector classes under `handlers/`: + +- `handlers/sharepoint/sharepoint.py` + - Microsoft Graph authentication and site/file access helpers +- `handlers/gcs/gcs.py` + - Upload/download/list/delete/signed URL helpers +- `handlers/s3/s3.py` + - Upload/download/list/delete/presigned URL helpers +- `handlers/bigquery/bigquery.py` + - Query, row fetch, dataframe load, schema helpers + +## Local LLM Analytics Module + + +`analytics/nl_query/` is an isolated module for natural-language-to-SQL generation using local Ollama: + +- Semantic catalog loader +- SQL read-only safety guard +- Ollama client wrapper +- Orchestration service + +## Testing + + +Run all tests: The repository includes reusable connector classes under handlers/: handlers/sharepoint/sharepoint.py – Microsoft Graph auth + site/file access helpers @@ -148,6 +208,17 @@ Run all tests with the following command: The current test suite includes coverage for: +- Ingestion and handler behavior +- Analytics factory and runner logic +- Connector modules (SharePoint, GCS, S3, BigQuery) +- Script behavior +- NL-query safety and service paths + +## CI/CD and Branching + + +- `dev`: integration branch +- `master`: release/production branch * Ingestion and handler behavior * Analytics factory and runner logic * Connector modules (SharePoint, GCS, S3, BigQuery) @@ -156,8 +227,8 @@ The current test suite includes coverage for: ## CI/CD and Branching -* dev: integration branch -* master: release/production branch +* `dev`: integration branch +* `master`: release/production branch Workflows: @@ -167,6 +238,10 @@ Workflows: ## Containerization + +The container image is defined via `Dockerfile`. + +The default runtime command starts Dagster gRPC: Container image is defined via Dockerfile. Default runtime command starts the Dagster gRPC server: @@ -177,13 +252,25 @@ python -m dagster api grpc -m dagster_op.repository ## Deployment + Deployment flow is workflow-based: +- Production auto-path after successful Docker release +- Manual staging/production dispatch path + +## Contributing and Governance + + +- Contribution guide: [`CONTRIBUTING.md`](CONTRIBUTING.md) +- Code of conduct: [`CODE_OF_CONDUCT.md`](CODE_OF_CONDUCT.md) +- Security reporting: [`SECURITY.md`](SECURITY.md) * Production auto-path after successful Docker release * Manual staging/production dispatch path ## Detailed Technical Documentation + For complete, long-form project documentation (operations, architecture, and runbook-style details), see: +- [`docs/document.md`](docs/document.md) docs/document.md