A light-weight data lakehouse for Open research information based on DuckLake.
You need to have the package manager uv installed. Then you can clone the repo and set up the required dependencies:
git clone https://github.com/surf-ori/sprouts.git
cd sprouts
uv syncEither you use marimo to run the build pipeline in an interactive notebook environment:
uv run marimo edit ingest-pipeline.pyOr you simply run the pipeline as a script:
uv run ingest-pipeline.py -dataset openapcBased on your config.json your data lakehouse will be setup in the build directory (default) or whereever else you wish. You can supply an S3 URL as data path (for example s3://my-bucket/data) as long as you also provide the required key and secret as well (only works with the SURF object store at the moment).