Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 90 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,33 @@ DataHelm is a data engineering framework focused on the following:
- Source ingestion and orchestration
- dbt transformation workflows
- Notebook-based dashboard execution
- Reusable provider connectors (SharePoint, GCS, S3, BigQuery)
- Optional local LLM analytics query scaffolding

## Table of Contents


- [Core Capabilities](#core-capabilities)
- [High-Level Architecture](#high-level-architecture)
- [Repository Structure](#repository-structure)
- [Local Setup](#local-setup)
- [Configuration Model](#configuration-model)
- [Reusable Connectors](#reusable-connectors)
- [Local LLM Analytics Module](#local-llm-analytics-module)
- [Testing](#testing)
- [CI/CD and Branching](#cicd-and-branching)
- [Containerization](#containerization)
- [Deployment](#deployment)
- [Contributing and Governance](#contributing-and-governance)
- [Detailed Technical Documentation](#detailed-technical-documentation)
- Reusable provider connectors (SharePoint, GCS, S3, and BigQuery)
- Optional local LLM analytics query scaffolding

![DataHelm Architecture](https://github.com/DevStrikerTech/datahelm/blob/master/docs/architecture.png?raw=true)

## Core Capabilities


- **Config-driven ingestion** using YAML in `config/api/`
- **Dagster orchestration** for managing jobs, schedules, and sensors
- **dbt project execution** through `analytics/dbt_runner.py` and dbt configuration files
Expand All @@ -21,18 +41,28 @@ DataHelm is a data engineering framework focused on the following:

## High-Level Architecture


The repository follows layered responsibilities:
The repository follows a layered responsibility structure:

- `handlers/`: provider-specific source connectors and API handlers
- `ingestion/`: ingestion factory + native ingestion implementations
- `ingestion/`: ingestion factory and native ingestion implementations
- `analytics/`: dbt, dashboard, and optional NL-query modules
- `dagster_op/`: orchestration objects (jobs, schedules, repository)
- `config/`: all runtime configuration (api, dbt, dashboard, analytics metadata)
- `config/`: all runtime configuration (API, dbt, dashboard, analytics metadata)
- `tests/`: unit tests for handlers, ingestion, analytics, and scripts

![alt text](https://github.com/DevStrikerTech/datahelm/blob/master/docs/architecture.png?raw=true)

## Repository Structure


```text
dagster_op/
ingestion/
tests/
scripts/
docs/
config/
api/
dbt/
Expand All @@ -57,6 +87,7 @@ docs/

## Local Setup


### Prerequisites

Python 3.12+
Expand All @@ -76,6 +107,7 @@ pip install -e .

### Environment Variables

Create a `.env` file in the repository root with required values, for example:
Create a file named `.env` in the root of the repository with the required values, for example:

```text
Expand Down Expand Up @@ -103,14 +135,16 @@ python scripts/run_dagster_dev.py --print-only

## Configuration Model


### Ingestion Config (`config/api/*.yaml`)
### Ingestion Config (config/api/*.yaml)

Defines source-level extraction, publish targets, schedules, and column mapping.
Example included: CLASHOFCLANS_PLAYER_STATS

### dbt Config (config/dbt/projects.yaml)

Defines dbt units, selection/exclusion rules, vars, and schedules.
Defines dbt units, selection/exclusion rules, variables, and schedules.

### Dashboard Config (config/dashboard/projects.yaml)

Expand All @@ -122,6 +156,32 @@ Defines dataset metadata for the isolated NL-to-SQL module.

## Reusable Connectors


The repository includes reusable connector classes under `handlers/`:

- `handlers/sharepoint/sharepoint.py`
- Microsoft Graph authentication and site/file access helpers
- `handlers/gcs/gcs.py`
- Upload/download/list/delete/signed URL helpers
- `handlers/s3/s3.py`
- Upload/download/list/delete/presigned URL helpers
- `handlers/bigquery/bigquery.py`
- Query, row fetch, dataframe load, schema helpers

## Local LLM Analytics Module


`analytics/nl_query/` is an isolated module for natural-language-to-SQL generation using local Ollama:

- Semantic catalog loader
- SQL read-only safety guard
- Ollama client wrapper
- Orchestration service

## Testing


Run all tests:
The repository includes reusable connector classes under handlers/:

handlers/sharepoint/sharepoint.py – Microsoft Graph auth + site/file access helpers
Expand All @@ -148,6 +208,17 @@ Run all tests with the following command:

The current test suite includes coverage for:

- Ingestion and handler behavior
- Analytics factory and runner logic
- Connector modules (SharePoint, GCS, S3, BigQuery)
- Script behavior
- NL-query safety and service paths

## CI/CD and Branching


- `dev`: integration branch
- `master`: release/production branch
* Ingestion and handler behavior
* Analytics factory and runner logic
* Connector modules (SharePoint, GCS, S3, BigQuery)
Expand All @@ -167,6 +238,10 @@ Workflows:

## Containerization


The container image is defined via `Dockerfile`.

The default runtime command starts Dagster gRPC:
Container image is defined via Dockerfile.

Default runtime command starts the Dagster gRPC server:
Expand All @@ -177,13 +252,25 @@ python -m dagster api grpc -m dagster_op.repository

## Deployment


Deployment flow is workflow-based:

- Production auto-path after successful Docker release
- Manual staging/production dispatch path

## Contributing and Governance


- Contribution guide: [`CONTRIBUTING.md`](CONTRIBUTING.md)
- Code of conduct: [`CODE_OF_CONDUCT.md`](CODE_OF_CONDUCT.md)
- Security reporting: [`SECURITY.md`](SECURITY.md)
* Production auto-path after successful Docker release
* Manual staging/production dispatch path

## Detailed Technical Documentation


For complete, long-form project documentation (operations, architecture, and runbook-style details), see:

- [`docs/document.md`](docs/document.md)
docs/document.md
Loading