From 9e4b0a916efca6aae236babc7aa2172949177bc9 Mon Sep 17 00:00:00 2001 From: MURALIKUMAR J Date: Sun, 15 Mar 2026 13:54:19 +0530 Subject: [PATCH 1/4] docs: README improvements --- README.md | 105 ++++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 79 insertions(+), 26 deletions(-) diff --git a/README.md b/README.md index f9440bd..5b485b4 100644 --- a/README.md +++ b/README.md @@ -1,15 +1,37 @@ # DataHelm +============ + DataHelm is a data engineering framework focused on: -- source ingestion and orchestration +- Source ingestion and orchestration - dbt transformation workflows -- notebook-based dashboard execution -- reusable provider connectors (SharePoint, GCS, S3, BigQuery) -- optional local LLM analytics query scaffolding +- Notebook-based dashboard execution +- Reusable provider connectors (SharePoint, GCS, S3, BigQuery) +- Optional local LLM analytics query scaffolding + +## Table of Contents + +--- + +- [Core Capabilities](#core-capabilities) +- [High-Level Architecture](#high-level-architecture) +- [Repository Structure](#repository-structure) +- [Local Setup](#local-setup) +- [Configuration Model](#configuration-model) +- [Reusable Connectors](#reusable-connectors) +- [Local LLM Analytics Module](#local-llm-analytics-module) +- [Testing](#testing) +- [CI/CD and Branching](#cicd-and-branching) +- [Containerization](#containerization) +- [Deployment](#deployment) +- [Contributing and Governance](#contributing-and-governance) +- [Detailed Technical Documentation](#detailed-technical-documentation) ## Core Capabilities +--- + - **Config-driven ingestion** using YAML in `config/api/` - **Dagster orchestration** for jobs, schedules, and sensors - **dbt project execution** through `analytics/dbt_runner.py` and dbt configs @@ -19,18 +41,29 @@ DataHelm is a data engineering framework focused on: ## High-Level Architecture +--- + The repository follows layered responsibilities: - `handlers/`: provider-specific source connectors and API handlers -- `ingestion/`: ingestion factory + native ingestion implementations +- `ingestion/`: ingestion factory and native ingestion implementations - `analytics/`: dbt, dashboard, and optional NL-query modules - `dagster_op/`: orchestration objects (jobs, schedules, repository) -- `config/`: all runtime configuration (api, dbt, dashboard, analytics metadata) +- `config/`: all runtime configuration (API, dbt, dashboard, analytics metadata) - `tests/`: unit tests for handlers, ingestion, analytics, and scripts +![alt text](https://github.com/DevStrikerTech/datahelm/blob/master/docs/architecture.png?raw=true) + ## Repository Structure +--- + ```text +dagster_op/ +ingestion/ +tests/ +scripts/ +docs/ config/ api/ dbt/ @@ -55,6 +88,8 @@ docs/ ## Local Setup +--- + ### Prerequisites - Python 3.12+ @@ -72,7 +107,7 @@ pip install -e . ### Environment Variables -Create a `.env` file in repository root with required values, for example: +Create a `.env` file in the repository root with required values, for example: ```env DB_HOST=${DB_HOST} @@ -97,6 +132,8 @@ python scripts/run_dagster_dev.py --print-only ## Configuration Model +--- + ### Ingestion Config (`config/api/*.yaml`) Defines source-level extraction, publish targets, schedules, and column mapping. @@ -107,7 +144,7 @@ Example currently included: ### dbt Config (`config/dbt/projects.yaml`) -Defines dbt units, selection/exclusion rules, vars, and schedules. +Defines dbt units, selection/exclusion rules, variables, and schedules. ### Dashboard Config (`config/dashboard/projects.yaml`) @@ -119,28 +156,34 @@ Defines dataset metadata for the isolated NL-to-SQL module. ## Reusable Connectors +--- + The repository includes reusable connector classes under `handlers/`: - `handlers/sharepoint/sharepoint.py` - - Microsoft Graph auth + site/file access helpers + - Microsoft Graph authentication and site/file access helpers - `handlers/gcs/gcs.py` - - upload/download/list/delete/signed URL helpers + - Upload/download/list/delete/signed URL helpers - `handlers/s3/s3.py` - - upload/download/list/delete/presigned URL helpers + - Upload/download/list/delete/presigned URL helpers - `handlers/bigquery/bigquery.py` - - query, row fetch, dataframe load, schema helpers + - Query, row fetch, dataframe load, schema helpers ## Local LLM Analytics Module +--- + `analytics/nl_query/` is an isolated module for natural-language-to-SQL generation using local Ollama: -- semantic catalog loader +- Semantic catalog loader - SQL read-only safety guard - Ollama client wrapper -- orchestration service +- Orchestration service ## Testing +--- + Run all tests: ```bash @@ -149,14 +192,16 @@ Run all tests: The current test suite covers: -- ingestion and handler behavior -- analytics factory and runner logic -- connector modules (SharePoint, GCS, S3, BigQuery) -- script behavior +- Ingestion and handler behavior +- Analytics factory and runner logic +- Connector modules (SharePoint, GCS, S3, BigQuery) +- Script behavior - NL-query safety and service paths ## CI/CD and Branching +--- + - `dev`: integration branch - `master`: release/production branch @@ -168,9 +213,11 @@ Workflows: ## Containerization -Container image is defined via `Dockerfile`. +--- + +The container image is defined via `Dockerfile`. -Default runtime command starts Dagster gRPC: +The default runtime command starts Dagster gRPC: ```bash python -m dagster api grpc -m dagster_op.repository @@ -178,19 +225,25 @@ python -m dagster api grpc -m dagster_op.repository ## Deployment +--- + Deployment flow is workflow-based: -- production auto-path after successful Docker release -- manual staging/production dispatch path +- Production auto-path after successful Docker release +- Manual staging/production dispatch path ## Contributing and Governance -- Contribution guide: `CONTRIBUTING.md` -- Code of conduct: `CODE_OF_CONDUCT.md` -- Security reporting: `SECURITY.md` +--- + +- Contribution guide: [`CONTRIBUTING.md`](CONTRIBUTING.md) +- Code of conduct: [`CODE_OF_CONDUCT.md`](CODE_OF_CONDUCT.md) +- Security reporting: [`SECURITY.md`](SECURITY.md) ## Detailed Technical Documentation +--- + For complete, long-form project documentation (operations, architecture, and runbook-style details), see: -- `docs/document.md` +- [`docs/document.md`](docs/document.md) From 171fc1e824a4a436b8064283a35b7119cf999ba7 Mon Sep 17 00:00:00 2001 From: MURALIKUMAR J Date: Sun, 15 Mar 2026 13:57:48 +0530 Subject: [PATCH 2/4] docs: README improvements --- README.md | 16 ---------------- 1 file changed, 16 deletions(-) diff --git a/README.md b/README.md index 5b485b4..4083b51 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,5 @@ # DataHelm -============ - DataHelm is a data engineering framework focused on: - Source ingestion and orchestration @@ -12,7 +10,6 @@ DataHelm is a data engineering framework focused on: ## Table of Contents ---- - [Core Capabilities](#core-capabilities) - [High-Level Architecture](#high-level-architecture) @@ -30,7 +27,6 @@ DataHelm is a data engineering framework focused on: ## Core Capabilities ---- - **Config-driven ingestion** using YAML in `config/api/` - **Dagster orchestration** for jobs, schedules, and sensors @@ -41,7 +37,6 @@ DataHelm is a data engineering framework focused on: ## High-Level Architecture ---- The repository follows layered responsibilities: @@ -56,7 +51,6 @@ The repository follows layered responsibilities: ## Repository Structure ---- ```text dagster_op/ @@ -88,7 +82,6 @@ docs/ ## Local Setup ---- ### Prerequisites @@ -132,7 +125,6 @@ python scripts/run_dagster_dev.py --print-only ## Configuration Model ---- ### Ingestion Config (`config/api/*.yaml`) @@ -156,7 +148,6 @@ Defines dataset metadata for the isolated NL-to-SQL module. ## Reusable Connectors ---- The repository includes reusable connector classes under `handlers/`: @@ -171,7 +162,6 @@ The repository includes reusable connector classes under `handlers/`: ## Local LLM Analytics Module ---- `analytics/nl_query/` is an isolated module for natural-language-to-SQL generation using local Ollama: @@ -182,7 +172,6 @@ The repository includes reusable connector classes under `handlers/`: ## Testing ---- Run all tests: @@ -200,7 +189,6 @@ The current test suite covers: ## CI/CD and Branching ---- - `dev`: integration branch - `master`: release/production branch @@ -213,7 +201,6 @@ Workflows: ## Containerization ---- The container image is defined via `Dockerfile`. @@ -225,7 +212,6 @@ python -m dagster api grpc -m dagster_op.repository ## Deployment ---- Deployment flow is workflow-based: @@ -234,7 +220,6 @@ Deployment flow is workflow-based: ## Contributing and Governance ---- - Contribution guide: [`CONTRIBUTING.md`](CONTRIBUTING.md) - Code of conduct: [`CODE_OF_CONDUCT.md`](CODE_OF_CONDUCT.md) @@ -242,7 +227,6 @@ Deployment flow is workflow-based: ## Detailed Technical Documentation ---- For complete, long-form project documentation (operations, architecture, and runbook-style details), see: From dddb34bd4cf4c9dbf8a0cbe26ebf2fd80fe11d7f Mon Sep 17 00:00:00 2001 From: rahuldkjain Date: Sun, 15 Mar 2026 16:28:36 +0530 Subject: [PATCH 3/4] docs: small README improvements --- README.md | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index cc2f0cf..4f95f65 100644 --- a/README.md +++ b/README.md @@ -103,35 +103,35 @@ python scripts/run_dagster_dev.py --print-only ## Configuration Model -### Ingestion Config (config/api/*.yaml) +### Ingestion Config (`config/api/*.yaml`) Defines source-level extraction, publish targets, schedules, and column mapping. -Example included: CLASHOFCLANS_PLAYER_STATS +Example included: `CLASHOFCLANS_PLAYER_STATS` -### dbt Config (config/dbt/projects.yaml) +### dbt Config (`config/dbt/projects.yaml`) Defines dbt units, selection/exclusion rules, vars, and schedules. -### Dashboard Config (config/dashboard/projects.yaml) +### Dashboard Config (`config/dashboard/projects.yaml`) -Defines notebook path, source table mapping, chart columns, and cadence. +Defines notebook paths, source table mapping, chart columns, and cadence. -### Analytics Semantic Config (config/analytics/semantic_catalog.yaml) +### Analytics Semantic Config (`config/analytics/semantic_catalog.yaml`) Defines dataset metadata for the isolated NL-to-SQL module. ## Reusable Connectors -The repository includes reusable connector classes under handlers/: +The repository includes reusable connector classes under `handlers/`: -handlers/sharepoint/sharepoint.py – Microsoft Graph auth + site/file access helpers -handlers/gcs/gcs.py – Upload/download/list/delete/signed URL helpers -handlers/s3/s3.py – Upload/download/list/delete/presigned URL helpers -handlers/bigquery/bigquery.py – Query, row fetch, dataframe load, schema helpers +- `handlers/sharepoint/sharepoint.py` – Microsoft Graph auth and site/file access helpers +- `handlers/gcs/gcs.py` – Upload, download, list, delete, and signed URL helpers +- `handlers/s3/s3.py` – Upload, download, list, delete, and presigned URL helpers +- `handlers/bigquery/bigquery.py` – Query, row fetch, dataframe load, and schema helpers ## Local LLM Analytics Module -analytics/nl_query/ is an isolated module for natural-language-to-SQL generation using local Ollama: +`analytics/nl_query/` is an isolated module for natural-language-to-SQL generation using local Ollama: * Semantic catalog loader * SQL read-only safety guard @@ -186,4 +186,4 @@ Deployment flow is workflow-based: For complete, long-form project documentation (operations, architecture, and runbook-style details), see: -docs/document.md +[`docs/document.md`](docs/document.md) From ee2466107348445ae2aaf10e45b032514ff62d78 Mon Sep 17 00:00:00 2001 From: Emah Khujaemah Date: Sun, 15 Mar 2026 23:51:50 +0900 Subject: [PATCH 4/4] docs: small README improvements for branching --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index cc2f0cf..04b0d29 100644 --- a/README.md +++ b/README.md @@ -156,8 +156,8 @@ The current test suite includes coverage for: ## CI/CD and Branching -* dev: integration branch -* master: release/production branch +* `dev`: integration branch +* `master`: release/production branch Workflows: