Data-Forge is an asynchronous service for generating reference files (starting with Kerchunk) and publishing searchable catalogs for large climate datasets. It enables data publishers to convert NetCDF (and similar) data into cloud-friendly formats and catalog them, streamlining data access and discovery for the scientific community.
- Asynchronous job handling and monitoring
- Kerchunk reference file generation (NetCDF to Zarr references)
- User-specified output paths (S3, HTTPS, local filesystem)
- Automatic STAC catalog publishing (ESGF-NG integration)
- REST API and CLI for job submission and tracking
- Robust support for remote/cloud data sources
- Secure authentication with Globus Auth
- Scalable Dask-powered parallel processing
- No internal storage—the service writes directly to user-managed destinations
- Authenticate using Globus Auth via CLI or API.
- Submit a job with NetCDF file URLs and parameters (e.g., chunking, output path).
- Monitor job status and progress Async via API/CLI.
- Download or access the generated Kerchunk reference files at your storage endpoint.
- View published entries in a STAC catalog (optionally).
# Authenticate via Globus
$ data-forge login
# Submit a NetCDF-to-Kerchunk job
$ data-forge submit \
--input "s3://my-bucket/dataset/*.nc" \
--dataset-id "CMIP6.Project.Inst.Model.Experiment.Variable" \
--concat-dims time \
--output-path "s3://my-refs-bucket/output/" \
--metadata '{"project": "CMIP6"}'
# Monitor job progress
$ data-forge status <job-id> --watch
# Get reference file URL or download
$ data-forge get-url <job-id>
$ data-forge download <job-id> --output ./local_refs/- API: FastAPI (REST endpoints), job monitoring/status, OpenAPI docs
- Job Queue: Dramatiq + Redis (asynchronous processing)
- Workers: Process Kerchunk conversion, Dask parallelization, write outputs directly to user location
- STAC Integration: Optional, for automatic catalog publishing (ESGF-NG, server-authenticated)
- Authentication: Globus Auth (OAuth2), user-scoped job management
- No Internal Storage: Reference files always stored in user-specified outputs (e.g., S3 or HTTPS endpoints)
- Docker Compose for local/single-node deployment
- Helm chart for Kubernetes (production, scalable)
- Minimal required services: API, worker(s), Redis
See the docs/ directory for:
- Full user guide and CLI reference
- API specification (OpenAPI/Swagger)
- Deployment guides (Docker, Kubernetes)
- Architecture/design docs
- Contribution instructions
Data-Forge aims to make FAIR, cloud-optimized data publishing simple and scalable for the global climate data community.