Skip to content
97 changes: 92 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,33 @@ DataHelm is a data engineering framework focused on the following:
- Source ingestion and orchestration
- dbt transformation workflows
- Notebook-based dashboard execution
- Reusable provider connectors (SharePoint, GCS, S3, BigQuery)
- Optional local LLM analytics query scaffolding

## Table of Contents


- [Core Capabilities](#core-capabilities)
- [High-Level Architecture](#high-level-architecture)
- [Repository Structure](#repository-structure)
- [Local Setup](#local-setup)
- [Configuration Model](#configuration-model)
- [Reusable Connectors](#reusable-connectors)
- [Local LLM Analytics Module](#local-llm-analytics-module)
- [Testing](#testing)
- [CI/CD and Branching](#cicd-and-branching)
- [Containerization](#containerization)
- [Deployment](#deployment)
- [Contributing and Governance](#contributing-and-governance)
- [Detailed Technical Documentation](#detailed-technical-documentation)
- Reusable provider connectors (SharePoint, GCS, S3, and BigQuery)
- Optional local LLM analytics query scaffolding

![DataHelm Architecture](https://github.com/DevStrikerTech/datahelm/blob/master/docs/architecture.png?raw=true)

## Core Capabilities


- **Config-driven ingestion** using YAML in `config/api/`
- **Dagster orchestration** for managing jobs, schedules, and sensors
- **dbt project execution** through `analytics/dbt_runner.py` and dbt configuration files
Expand All @@ -21,18 +41,28 @@ DataHelm is a data engineering framework focused on the following:

## High-Level Architecture


The repository follows layered responsibilities:
The repository follows a layered responsibility structure:

- `handlers/`: provider-specific source connectors and API handlers
- `ingestion/`: ingestion factory + native ingestion implementations
- `ingestion/`: ingestion factory and native ingestion implementations
- `analytics/`: dbt, dashboard, and optional NL-query modules
- `dagster_op/`: orchestration objects (jobs, schedules, repository)
- `config/`: all runtime configuration (api, dbt, dashboard, analytics metadata)
- `config/`: all runtime configuration (API, dbt, dashboard, analytics metadata)
- `tests/`: unit tests for handlers, ingestion, analytics, and scripts

![alt text](https://github.com/DevStrikerTech/datahelm/blob/master/docs/architecture.png?raw=true)

## Repository Structure


```text
dagster_op/
ingestion/
tests/
scripts/
docs/
config/
api/
dbt/
Expand All @@ -57,6 +87,7 @@ docs/

## Local Setup


### Prerequisites

Python 3.12+
Expand All @@ -76,6 +107,7 @@ pip install -e .

### Environment Variables

Create a `.env` file in the repository root with required values, for example:
Create a file named `.env` in the root of the repository with the required values, for example:

```text
Expand Down Expand Up @@ -103,14 +135,16 @@ python scripts/run_dagster_dev.py --print-only

## Configuration Model


### Ingestion Config (`config/api/*.yaml`)
### Ingestion Config (config/api/*.yaml)

Defines source-level extraction, publish targets, schedules, and column mapping.
Example included: CLASHOFCLANS_PLAYER_STATS

### dbt Config (config/dbt/projects.yaml)

Defines dbt units, selection/exclusion rules, vars, and schedules.
Defines dbt units, selection/exclusion rules, variables, and schedules.

### Dashboard Config (config/dashboard/projects.yaml)

Expand All @@ -122,6 +156,32 @@ Defines dataset metadata for the isolated NL-to-SQL module.

## Reusable Connectors


The repository includes reusable connector classes under `handlers/`:

- `handlers/sharepoint/sharepoint.py`
- Microsoft Graph authentication and site/file access helpers
- `handlers/gcs/gcs.py`
- Upload/download/list/delete/signed URL helpers
- `handlers/s3/s3.py`
- Upload/download/list/delete/presigned URL helpers
- `handlers/bigquery/bigquery.py`
- Query, row fetch, dataframe load, schema helpers

## Local LLM Analytics Module


`analytics/nl_query/` is an isolated module for natural-language-to-SQL generation using local Ollama:

- Semantic catalog loader
- SQL read-only safety guard
- Ollama client wrapper
- Orchestration service

## Testing


Run all tests:
The repository includes reusable connector classes under handlers/:

handlers/sharepoint/sharepoint.py – Microsoft Graph auth + site/file access helpers
Expand All @@ -148,6 +208,17 @@ Run all tests with the following command:

The current test suite includes coverage for:

- Ingestion and handler behavior
- Analytics factory and runner logic
- Connector modules (SharePoint, GCS, S3, BigQuery)
- Script behavior
- NL-query safety and service paths

## CI/CD and Branching


- `dev`: integration branch
- `master`: release/production branch
* Ingestion and handler behavior
* Analytics factory and runner logic
* Connector modules (SharePoint, GCS, S3, BigQuery)
Expand All @@ -156,8 +227,8 @@ The current test suite includes coverage for:

## CI/CD and Branching

* dev: integration branch
* master: release/production branch
* `dev`: integration branch
* `master`: release/production branch

Workflows:

Expand All @@ -167,6 +238,10 @@ Workflows:

## Containerization


The container image is defined via `Dockerfile`.

The default runtime command starts Dagster gRPC:
Container image is defined via Dockerfile.

Default runtime command starts the Dagster gRPC server:
Expand All @@ -177,13 +252,25 @@ python -m dagster api grpc -m dagster_op.repository

## Deployment


Deployment flow is workflow-based:

- Production auto-path after successful Docker release
- Manual staging/production dispatch path

## Contributing and Governance


- Contribution guide: [`CONTRIBUTING.md`](CONTRIBUTING.md)
- Code of conduct: [`CODE_OF_CONDUCT.md`](CODE_OF_CONDUCT.md)
- Security reporting: [`SECURITY.md`](SECURITY.md)
* Production auto-path after successful Docker release
* Manual staging/production dispatch path

## Detailed Technical Documentation


For complete, long-form project documentation (operations, architecture, and runbook-style details), see:

- [`docs/document.md`](docs/document.md)
docs/document.md
Loading