Data-Forge

Data-Forge is an asynchronous service for generating reference files (starting with Kerchunk) and publishing searchable catalogs for large climate datasets. It enables data publishers to convert NetCDF (and similar) data into cloud-friendly formats and catalog them, streamlining data access and discovery for the scientific community.

Features

Asynchronous job handling and monitoring
Kerchunk reference file generation (NetCDF to Zarr references)
User-specified output paths (S3, HTTPS, local filesystem)
Automatic STAC catalog publishing (ESGF-NG integration)
REST API and CLI for job submission and tracking
Robust support for remote/cloud data sources
Secure authentication with Globus Auth
Scalable Dask-powered parallel processing
No internal storage—the service writes directly to user-managed destinations

Typical Workflow

Authenticate using Globus Auth via CLI or API.
Submit a job with NetCDF file URLs and parameters (e.g., chunking, output path).
Monitor job status and progress Async via API/CLI.
Download or access the generated Kerchunk reference files at your storage endpoint.
View published entries in a STAC catalog (optionally).

Example CLI Usage

# Authenticate via Globus
$ data-forge login

# Submit a NetCDF-to-Kerchunk job
$ data-forge submit \
  --input "s3://my-bucket/dataset/*.nc" \
  --dataset-id "CMIP6.Project.Inst.Model.Experiment.Variable" \
  --concat-dims time \
  --output-path "s3://my-refs-bucket/output/" \
  --metadata '{"project": "CMIP6"}'

# Monitor job progress
$ data-forge status <job-id> --watch

# Get reference file URL or download
$ data-forge get-url <job-id>
$ data-forge download <job-id> --output ./local_refs/

High-Level Architecture

API: FastAPI (REST endpoints), job monitoring/status, OpenAPI docs
Job Queue: Dramatiq + Redis (asynchronous processing)
Workers: Process Kerchunk conversion, Dask parallelization, write outputs directly to user location
STAC Integration: Optional, for automatic catalog publishing (ESGF-NG, server-authenticated)
Authentication: Globus Auth (OAuth2), user-scoped job management
No Internal Storage: Reference files always stored in user-specified outputs (e.g., S3 or HTTPS endpoints)

Deployment

Docker Compose for local/single-node deployment
Helm chart for Kubernetes (production, scalable)
Minimal required services: API, worker(s), Redis

Documentation

See the docs/ directory for:

Full user guide and CLI reference
API specification (OpenAPI/Swagger)
Deployment guides (Docker, Kubernetes)
Architecture/design docs
Contribution instructions

Data-Forge aims to make FAIR, cloud-optimized data publishing simple and scalable for the global climate data community.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docs		docs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data-Forge

Features

Typical Workflow

Example CLI Usage

High-Level Architecture

Deployment

Documentation

About

Uh oh!

Releases

Packages

License

esgf2-us/data-forge

Folders and files

Latest commit

History

Repository files navigation

Data-Forge

Features

Typical Workflow

Example CLI Usage

High-Level Architecture

Deployment

Documentation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages