Sprouts

A light-weight data lakehouse for Open research information based on DuckLake.

Prerequisites

You need to have the package manager uv installed. Then you can clone the repo and set up the required dependencies:

git clone https://github.com/surf-ori/sprouts.git
cd sprouts
uv sync

Usage

Either you use marimo to run the build pipeline in an interactive notebook environment:

uv run marimo edit ingest-pipeline.py

Or you simply run the pipeline as a script:

uv run ingest-pipeline.py -dataset openapc

Based on your config.json your data lakehouse will be setup in the build directory (default) or whereever else you wish. You can supply an S3 URL as data path (for example s3://my-bucket/data) as long as you also provide the required key and secret as well (only works with the SURF object store at the moment).

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
notebooks		notebooks
queries		queries
sources		sources
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
config.json		config.json
ingest-pipeline.py		ingest-pipeline.py
overview.html		overview.html
pyproject.toml		pyproject.toml
setup.sh		setup.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sprouts

Prerequisites

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sprouts

Prerequisites

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages