diff --git a/.readthedocs.yaml b/.readthedocs.yaml index 0ebc37b..df03fed 100644 --- a/.readthedocs.yaml +++ b/.readthedocs.yaml @@ -5,7 +5,7 @@ # Required version: 2 -# Set the version of Python and other tools you might need +# Set the version of Python and other tools that might be needed build: os: ubuntu-lts-latest tools: @@ -18,6 +18,6 @@ build: sphinx: configuration: docs/source/conf.py -# Optionally declare the Python requirements required to build your docs +# Optionally declare the Python requirements required to build the docs conda: environment: environment.yml diff --git a/README.md b/README.md index 41023c9..e79e846 100644 --- a/README.md +++ b/README.md @@ -1,16 +1,92 @@ -# The CADET-Research Data Management toolbox +# CADET-RDM -Welcome to CADET-RDM, a project by the Forschungszentrum Jülich. +[![CI](https://github.com/cadet/CADET-RDM/actions/workflows/CI.yml/badge.svg)](https://github.com/cadet/CADET-RDM/actions/workflows/CI.yml) +[![Documentation](https://readthedocs.org/projects/cadet-rdm/badge/?version=latest)](https://cadet-rdm.readthedocs.io) +[![License](https://img.shields.io/github/license/cadet/cadet-rdm)](LICENSE) +[![Python](https://img.shields.io/badge/python-3.9%2B-blue)](https://www.python.org/) + +CADET-RDM is a Research Data Management toolbox developed at Forschungszentrum Jülich. +It supports computational research projects by tracking code, data, environments, and generated results in a reproducible and shareable way. + +The toolbox is domain-agnostic and can be applied to any computational project with a structured workflow. + + +## Scope and purpose + +CADET-RDM helps manage and version -This toolbox aims to help track and version control: - input data -- code -- software versions -- output data +- source code +- configurations and metadata +- software and environment versions +- generated output data + +The primary goal is to ensure reproducibility, traceability, and reuse of computational results by explicitly linking them to the project state that produced them. + + +## Repository structure + +A CADET-RDM project consists of two independent but coupled Git repositories: + +1. **Project repository** + Contains source code, configuration files, documentation, and metadata required to execute the computations. + +2. **Output repository** + Contains the results generated by running the project code, including data products, models, figures, and run-specific metadata. + +Both repositories have separate Git histories and remotes. CADET-RDM provides workflows that operate on both repositories to maintain a consistent link between code and results. + +## Using CADET-RDM + +### Result tracking and reproducibility + +Each execution of project code creates a new output branch that contains only the files generated by that run. + +In addition, a central run history records + +- the project repository commit used for the run +- software and environment information +- metadata required to reproduce the result + +This commit structure allows results to be reproduced and inspected without manual bookkeeping. + +### Interfaces + +CADET-RDM can be used through + +* a **command line interface (CLI)**, e.g. for scripted or automated bash workflows +* a **Python interface**, e.g. for direct context tracking of code within existing Python workflows + +Additionally, CADET-RDM can be used within Jupyter Lab with some limitations. + +Detailed descriptions of commands and APIs are provided in the dedicated interface documentation. + +* [Command line interface](command-line-interface.md) +* [Python interface](python-interface.md) +* [Jupyter interface](jupyter-interface.md) + +### Typical workflow + +1. Initialize or clone a CADET-RDM project +2. Develop and commit project code +3. Execute computations with CADET-RDM result tracking +4. Generate versioned output branches automatically +5. Push project and output repositories to their remotes +6. Reuse or reference results via their output branches + + +Results are referenced by unique output branch names that encode the timestamp, active project branch, and project commit hash. CADET-RDM provides a local cache mechanism that allows results from previous runs or from other CADET-RDM projects to be reused as input data while preserving provenance information. + + +## Getting started + +The full documentation is available at +https://cadet-rdm.readthedocs.io + +It includes installation instructions, usage guides for the different interfaces, and detailed descriptions of repository and result management workflows. -and allow for easy sharing, integration, and reproduction of generated results. -## Documentation +## Project information -The documentation contains a user guide with helpful information on how to install CADET-RDM, how to quickly start working with it and a more detailed explaination of its tools. -The documentation can be found [here](https://cadet-rdm.readthedocs.io). \ No newline at end of file +- **License:** see [LICENSE](LICENSE) +- **Authors and contributors:** see [AUTHORS](AUTHORS) \ No newline at end of file diff --git a/docs/Makefile b/docs/Makefile index f18b212..d568ce7 100644 --- a/docs/Makefile +++ b/docs/Makefile @@ -1,7 +1,7 @@ # Minimal makefile for Sphinx documentation # -# You can set these variables from the command line. +# These variables can be set from the command line. SPHINXOPTS = SPHINXBUILD = sphinx-build SPHINXPROJ = CADET-RDM diff --git a/docs/source/bibliography.md b/docs/source/bibliography.md index 814c151..dce5e8c 100644 --- a/docs/source/bibliography.md +++ b/docs/source/bibliography.md @@ -7,6 +7,6 @@ ``` ```{bibliography} ./references.bib +:all: :style: unsrt -``` - +``` \ No newline at end of file diff --git a/docs/source/conf.py b/docs/source/conf.py index 8e5a1e4..6249797 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -30,7 +30,7 @@ # -- General configuration --------------------------------------------------- # Add any Sphinx extension module names here, as strings. They can be -# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom +# extensions coming with Sphinx (named 'sphinx.ext.*') or custom # ones. # Extensions @@ -43,7 +43,7 @@ '.rst': 'restructuredtext', '.ipynb': 'myst-nb', '.myst': 'myst-nb', - '.md': 'myst-nb', + '.md': 'myst-nb' } ## Numpydoc @@ -74,12 +74,15 @@ ## Viewcode extensions.append("sphinx.ext.viewcode") +## View figures +extensions.append("sphinx_subfigure") + ## Copy Button extensions.append("sphinx_copybutton") ## BibTeX extensions.append("sphinxcontrib.bibtex") -bibtex_bibfiles = ['references.bib'] +bibtex_bibfiles = ["references.bib"] # -- Internationalization ------------------------------------------------ # specifying the natural language populates some key tags @@ -94,6 +97,9 @@ sitemap_locales = [None] sitemap_url_scheme = "{link}" +### Figure +extensions.append("sphinx_subfigure") + # Add any paths that contain templates here, relative to this directory. templates_path = ["_templates"] @@ -107,7 +113,7 @@ myst_enable_extensions = [ "dollarmath", "amsmath", - "colon_fence", + "colon_fence" ] # -- Options for HTML output ------------------------------------------------- diff --git a/docs/source/index.md b/docs/source/index.md index be0fd70..e5d9ae9 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -1,4 +1,4 @@ -```{include} ../../README.md +```{include} ./user_guide/introduction.md ``` ```{toctree} @@ -6,9 +6,10 @@ :caption: User guide :hidden: +user_guide/introduction user_guide/installation user_guide/getting-started -user_guide/CLI-interface +user_guide/command-line-interface user_guide/python-interface user_guide/jupyter-interface ``` diff --git a/docs/source/user_guide/CLI-interface.md b/docs/source/user_guide/CLI-interface.md deleted file mode 100644 index e27a1b8..0000000 --- a/docs/source/user_guide/CLI-interface.md +++ /dev/null @@ -1,115 +0,0 @@ - -# CLI Interface - -## Initialize Project Repository - -Create a new project repository or convert an existing repository into a CADET-RDM repo: - -```bash -rdm init -``` - - -The `output_folder_name` can be given optionally. It defaults to `output`. - - -## Executing scripts - -You can execute python files or arbitray commands using the CLI: - -```bash -cd path/to/your/project -rdm run_yml python "commit message for the results" -rdm run_yml command "command as it would be run" "commit message for the results" -``` - -For the run-command option, the command must be given in quotes, so: - -```bash -rdm run_yml command "python example_file.py" "commit message for the results" -``` - -## Re-using results from previous iterations - -Each result stored with CADET-RDM is given a unique branch name, formatted as: -`__"from"__` - -With this branch name, previously generated data can be loaded in as input data for -further calculations. The following command will copy the contents of the `branch_name` branch to the -cache folder at `project_root/output_cached/branch_name`. - -```bash -rdm data cache branch_name -``` - - -## Using results from another repository - -You can load in results from another repository to use in your project using the CLI: - -```bash -cd path/to/your/project -rdm data import -rdm data import --target_repo_location -``` - -This will store the URL, branch_name and location in the .cadet-rdm-cache.json file, like this: - -```json -{ - "__example/path/to/repo__": { - "source_repo_location": "git@jugit.fz-juelich.de:IBG-1/ModSim/cadet/agile_cadet_rdm_presentation_output.git", - "branch_name": "output_from_master_3910c84_2023-10-25_00-17-23", - "commit_hash": "6e3c26527999036e9490d2d86251258fe81d46dc" - } -} -``` - -You can use this file to load the remote repositories based on the cache.json with - -```bash -rdm data fetch -``` - -## Cloning from remote - -You should use `cadet-rdm clone` instead of `git clone` to clone the repo to a new location. - -```bash -rdm clone -``` - - -## Sharing Results - -To share your project code and results with others, you need to create remote repositories on e.g. -[GitHub](https://github.com/) or GitLab. You need to create a remote for both the _project_ repo and the -_results_ repo. - -Once created, the remotes need to be added to the local repositories. - -```bash -rdm remote add git@:.git -cd output -rdm remote add git@:_output.git -``` - -Once remotes are configured, you can push all changes to the project repo and the results repos with the -command - -```bash -rdm push -``` - -## Migrating a repository - -If you want to migrate a repository to another remote, the easiest way to do that at the moment is to create the remote -repositories on GitHub or GitLab and change the `origin` URL for the project and output repositories with: - -```bash -rdm remote set-url origin git@:.git -cd output -rdm remote set-url origin git@:_output.git -cd .. -rdm push -``` diff --git a/docs/source/user_guide/command-line-interface.md b/docs/source/user_guide/command-line-interface.md new file mode 100644 index 0000000..1b645ee --- /dev/null +++ b/docs/source/user_guide/command-line-interface.md @@ -0,0 +1,143 @@ +# Command line interface (CLI) + +The command line interface provides access to all CADET-RDM functionality via the `rdm` command. It is suited for scripted workflows, batch execution, and automation. + +## Repository initialization + +Create a new project repository or convert an existing directory into a CADET-RDM repository: + +```bash +rdm init [output_directory_name] +``` + +Options: + +- If no `` is provided, the repository is initialized in the root directory without creating a new directory. +- If `` is given as a relative path (e.g. "repository_name"), a new directory with that name is created inside the root directory. +- If `` is given as an absolute path (e.g. C:\Users\me\projects\myrepo), a new directory is created at the specified location. + +Optionally, an `[output_directory_name]` can be given. Otherwise, it defaults to `output`. + + +### Cookiecutter support + +Initialize a repository from a Cookiecutter template: + +```bash +rdm init --cookiecutter +``` + +If `` is provided, it overrides any directory name chosen in the Cookiecutter prompt. +If omitted, initialization happens in the current working directory. + +## Handling results with CADET-RDM + +### Running code and tracking results + +Each execution creates a new output branch containing the generated results and associated metadata. + +Run a Python script and track all generated results: + +```bash +rdm run python "commit message for the results" +``` + +Run an arbitrary command, for example a bash script: + +```bash +rdm run command "bash run_simulation.sh" "commit message for the results" +``` + +The command must be enclosed in quotes. + +### Staging, committing, and pushing changes + +Check repository consistency and stage changes: + +```bash +rdm check +``` + +Commit staged changes: + +```bash +rdm commit -m +``` + +Push both project and output repositories: + +```bash +rdm push +``` + +### Reusing results from earlier runs + +Each run is stored in an output branch named: + +``` +__ +``` + +Cache results locally: + +```bash +rdm data cache +``` + +### Using results from another repository + +Fetch repositories listed in `.cadet-rdm-cache.json`: + +```bash +rdm data fetch +``` + +## Remote repositories + +### Cloning repositories + +Clone an existing CADET-RDM repository: + +```bash +rdm clone +``` + +The destination directory must be empty. + +### Adding existing remotes + +Add remotes manually in both repositories: + +```bash +rdm remote add git@:.git +cd output +rdm remote add git@:_output.git +``` + +### Creating remotes automatically + +Create project and output remotes using the GitHub or GitLab APIs: + +```bash +rdm remote create +``` + +Example: + +```bash +rdm remote create https://github.com/ githubusers_workproject Workproject githubuser +``` + +The output repository name is derived automatically by appending `_output` to the project repository name. + +### Migrating repositories + +Update the `origin` remote for both repositories and push: + +```bash +rdm remote set-url origin git@:.git +cd output +rdm remote set-url origin git@:_output.git +cd .. +rdm push +``` diff --git a/docs/source/user_guide/figures/RDM-output-commits.png b/docs/source/user_guide/figures/RDM-output-commits.png new file mode 100644 index 0000000..a8b8c1d Binary files /dev/null and b/docs/source/user_guide/figures/RDM-output-commits.png differ diff --git a/docs/source/user_guide/figures/RDM-project-commits.png b/docs/source/user_guide/figures/RDM-project-commits.png new file mode 100644 index 0000000..69074d7 Binary files /dev/null and b/docs/source/user_guide/figures/RDM-project-commits.png differ diff --git a/docs/source/user_guide/figures/RDM_wide.png b/docs/source/user_guide/figures/RDM_wide.png new file mode 100644 index 0000000..7902ed4 Binary files /dev/null and b/docs/source/user_guide/figures/RDM_wide.png differ diff --git a/docs/source/user_guide/getting-started.md b/docs/source/user_guide/getting-started.md index 5be80c9..201ac91 100644 --- a/docs/source/user_guide/getting-started.md +++ b/docs/source/user_guide/getting-started.md @@ -1,103 +1,330 @@ - # Getting started -## Initialize Project Repository +CADET-RDM manages computational research projects by separating **project code** and **generated results** into two coupled Git repositories: + +* the **project repository**, which contains source code, configuration files, documentation, and metadata +* the **output repository**, which contains all results generated by executing the project code + +Both repositories are created and managed automatically. They are independent Git repositories with separate histories and remotes, but CADET-RDM provides workflows that operate on both to ensure reproducibility and traceability of results. + +CADET-RDM can be used through two interfaces: + +* a **command line interface (CLI)**, e.g. for scripted or automated bash workflows +* a **Python interface**, e.g. for direct context tracking of code within existing Python workflows + +Additionally, CADET-RDM can be used within Jupyter Lab with some limitations. + +Detailed descriptions of commands and APIs are provided in the dedicated interface documentation. + +* [Command line interface](command-line-interface.md) +* [Python interface](python-interface.md) +* [Jupyter interface](jupyter-interface.md) -Create a new project repository or convert an existing repository into a CADET-RDM repo: +## Initializing a project repository + +Create a new project repository or convert an existing directory into a CADET-RDM repository. + +CLI: ```bash -rdm init +rdm init [output_directory_name] ``` -or from python +Python: ```python from cadetrdm import initialize_repo - -initialize_repo(path_to_repo) +initialize_repo(path_to_repo, [output_directory_name]) ``` -The `output_folder_name` can be given optionally. It defaults to `output`. +If `output_directory_name` is not provided, it defaults to `output`. + +During initialization, the project repository is created or updated and an output repository is created inside the project repository. The two repositories are independent Git repositories. -## Cookiecutter support +### Cookiecutter support -[Cookiecutter](https://github.com/cookiecutter/cookiecutter) can be used to set a template as a starting position for the repository initialization. +CADET-RDM supports initializing repositories from Cookiecutter templates. +CLI: ```bash -rdm init --cookiecutter template_url +rdm init --cookiecutter ``` -or from python +Python: ```python from cadetrdm import initialize_repo - initialize_repo(path_to_repo, cookiecutter_template="template_url") ``` +Options: -## Creating and adding remotes +* If `` is provided as an absolute or relative path, it overrides any directory name specified by the Cookiecutter template. +* If `` is omitted, the repository is initialized in the current working directory. No additional directory is created, even if the Cookiecutter template would normally create one. -You can create remotes for both the project and the output repository with one command, using the GitLab or GitHub API. +For state-of-the-art Python package development, we recommend using the [CADET Cookiecutter Template](https://github.com/cadet/CADET-Cookiecutter-Template). -You need to create a -[GitLab Personal Access Token (PAT)](https://docs.gitlab.com/ee/user/profile/personal_access_tokens.html) or [GitHub PAT](https://github.com/settings/tokens?type=beta) with api access rights -and store it in the Python `keyring` using an interactive Python session: +This template provides a standardized starting point for development projects, including project metadata, development tooling, and continuous integration configuration. -```python -import keyring +This Cookiecutter template automatically creates a repository with the following content: + +``` +README.md +LICENCE.md +AUTHORS.md +SECURITY.md +CODE_OF_CONDUCT.md +CONTRIBUTING.md +CITATION.bib +.zenodo.json +pyproject.toml +environment.yml +.pre-commit-config.yml +.github/dependabot.yml +.github/workflows/ruff.yml +.gitignore +``` +## Handling results with CADET-RDM + +### Running code and tracking results + +Each execution of project code creates a new output branch that contains the generated results and associated metadata. -keyring.set_password("e.g. https://jugit.fz-juelich.de/", username, token) +CLI (run a Python script): + +```bash +cd +rdm run python "commit message for the results" ``` -or in a command line +CLI (run an arbitrary command): -````commandline -keyring set "e.g. https://jugit.fz-juelich.de/" username -```` +```bash +rdm run command "bash run_simulation.sh" "commit message for the results" +``` -Then you can run: +Python (track results in code): ```python from cadetrdm import ProjectRepo repo = ProjectRepo() +repo.commit("Commit code changes") -repo.create_remotes( - name="e.g. API_test_project", - namespace="e.g. r.jaepel", - url="e.g. https://jugit.fz-juelich.de/", - username="e.g. r.jaepel" +with repo.track_results(results_commit_message="Generate results"): + data = generate_data() + write_data_to_file(data, output_directory=repo.output_directory) +``` + +### Pushing changes + +CLI (check consistency and stage changes): + +```bash +rdm check +``` + +CLI (commit staged changes): + +```bash +rdm commit -m +``` + +CLI (push project and output repositories): + +```bash +rdm push +``` + +Python (push both repositories): + +```python +repo.push() +``` + +### Reusing results from earlier runs + +Results are referenced by a unique output branch name: + +``` +__ +``` + +CLI: + +```bash +rdm data cache +``` + +Python: + +```python +cached_folder_path = repo.input_data(branch_name="") +``` + + +### Using results from another repository + +Results from other CADET-RDM projects can be reused via the local cache mechanism. + +CLI (fetch remote repositories defined in the cache file): + +```bash +rdm data fetch +``` + +Python (import a remote repository and register it in the cache): + +```python +repo.import_remote_repo( + source_repo_location="", + source_repo_branch="" +) +``` + +Optionally specify a destination directory: + +```python +repo.import_remote_repo( + source_repo_location="", + source_repo_branch="", + target_repo_location="" ) ``` -or in a command line +Python (load repositories listed in `.cadet-rdm-cache.json`): + +```python +repo.fill_data_from_cadet_rdm_json() +``` + +### How result metadata is cached + +CADET-RDM uses the `.cadet-rdm-cache.json` file to keep track of external or previously generated results that can be reused as input data. + +The file is located in the **project repository root** and is managed automatically by CADET-RDM. It should not be edited manually. + +Conceptually, the cache file stores **references to results**, not the results themselves. It records: + +* the location of an output repository +* the output branch containing the results +* the exact commit hash used for reproducibility + +A typical entry looks like this: + +```json +{ + "__example/path/to/repo__": { + "source_repo_location": "git@github.com:cadet/example_output.git", + "branch_name": "2024-10-25_main_3910c84", + "commit_hash": "6e3c26527999036e9490d2d86251258fe81d46dc" + } +} +``` + + +## Remote repositories + +To share both project code and results, remotes must be configured for both repositories. + +### Cloning an existing CADET-RDM project + +Use CADET-RDM cloning rather than `git clone`. + +CLI: + +```bash +rdm clone +``` + +Python: + +```python +from cadetrdm import ProjectRepo +ProjectRepo.clone("", "") +``` + +The destination directory must be empty. + + +### Adding existing remotes manually + +CLI: ```bash -rdm remote create url namespace name username -rdm remote create https://jugit.fz-juelich.de/ r.jaepel API_test_project r.jaepel +rdm remote add git@:.git +cd output +rdm remote add git@:_output.git +``` + +Python: + +```python +from cadetrdm import ProjectRepo + +repo = ProjectRepo() +repo.add_remote("git@:.git") +repo.output_repo.add_remote("git@:_output.git") ``` +### Creating remotes automatically via GitHub or GitLab APIs + +CADET-RDM can create both remotes automatically if a Personal Access Token is available in the Python keyring. + +The URL must match the GitHub or GitLab instance used for remote creation, for example: + +Store a token (Python): + +```python +import keyring +keyring.set_password("https://jugit.fz-juelich.de/", "username", "token") +``` -## Extending GIT-LFS scope +Store a token (CLI): -Several common datatypes are included in GIT-LFS by default. These currently are -`"*.jpg", "*.png", "*.xlsx", "*.h5", "*.ipynb", "*.pdf", "*.docx", "*.zip", "*.html"` +```bash +keyring set "https://jugit.fz-juelich.de/" +``` -You can add datatypes you require by running: +Create remotes (Python): -````python +```python from cadetrdm import ProjectRepo repo = ProjectRepo() +repo.create_remotes( + name="Workproject", + namespace="githubusers_workproject", + url="https://github.com/", + username="githubuser" +) +``` -repo.output_repo.add_filetype_to_lfs("*.npy") -```` +Create remotes (CLI): +```bash +rdm remote create +``` -or from within the output folder in a command line: +Example: ```bash -rdm lfs add *.npy +rdm remote create https://github.com/ githubusers_workproject Workproject githubuser ``` + +The output repository name is derived automatically by appending `_output` to the project repository name. + +### Migrating a repository to another remote + +To migrate to a different remote, update the `origin` URL for both repositories and push. + +CLI: + +```bash +rdm remote set-url origin git@:.git +cd output +rdm remote set-url origin git@:_output.git +cd .. +rdm push +``` \ No newline at end of file diff --git a/docs/source/user_guide/installation.md b/docs/source/user_guide/installation.md index 010ad5f..fdc1ea7 100644 --- a/docs/source/user_guide/installation.md +++ b/docs/source/user_guide/installation.md @@ -1,16 +1,19 @@ - # Installation -CADET-RDM can be installed using +CADET-RDM can be installed using: + +```bash +pip install cadet-rdm +``` -```pip install cadet-rdm``` +We strongly recommend using a dedicated environment to install CADET-RDM. See +[A guide to reproducible Python environments and CADET installations](https://forum.cadet-web.de/t/a-guide-to-reproducible-python-environments-and-cadet-installations/766) +for general background. -We *highly* recommend using an -[environment file](https://forum.cadet-web.de/t/a-guide-to-reproducible-python-environments-and-cadet-installations/766) -to install CADET-RDM. +## Installation using conda or mamba For use with [mamba](https://github.com/conda-forge/miniforge#mambaforge) or -[conda](https://docs.conda.io/projects/miniconda/en/latest/), create a rdm_environment.yml like: +[conda](https://docs.conda.io/projects/miniconda/en/latest/), create an environment file `rdm_environment.yml`: ```yaml name: rdm_example @@ -24,19 +27,89 @@ dependencies: - cadet-rdm ``` -and then run +Create the environment with: -```commandline -mamba env create -f rdm_environment.yml +```bash +conda env create -f rdm_environment.yml ``` -For use with [pip](https://pypi.org/project/pip/), create a rdm_requirements.txt file like: +## Installation using pip + +For use with [pip](https://pypi.org/project/pip/), create a `rdm_requirements.txt` file: ``` python==3.11 -cadet-rdm>=0.0.15 +cadet-rdm>=1.0.1 ``` -```commandline +Install the dependencies with: + +```bash pip install -r rdm_requirements.txt ``` + +## Developer installation + +To install a development version of CADET-RDM from source, clone the repository and install it in editable mode. + +```bash +git clone git@github.com:cadet/CADET-RDM.git +cd CADET-RDM +pip install -e . +``` + +Cloning via SSH is recommended. Alternatively, HTTPS can be used with +`https://github.com/cadet/CADET-RDM.git`. + + +This installs CADET-RDM in editable mode, so local code changes take effect immediately without reinstalling the package. This setup is recommended for development, debugging, or contributing to CADET-RDM. + + +## Git LFS + +Running CADET-RDM requires [Git LFS](https://git-lfs.com/), which must be installed separately. + +* **Ubuntu/Debian**: + + ```bash + sudo apt-get install git-lfs + git lfs install + ``` + +* **macOS** (with Homebrew): + + ```bash + brew install git-lfs + git lfs install + ``` + +* **Windows**: + + Download and install Git LFS from + [https://git-lfs.com](https://git-lfs.com) + + +## Extending Git LFS scope + +Several common data types are tracked with Git LFS by default: + +``` +*.jpg, *.png, *.xlsx, *.h5, *.ipynb, *.pdf, *.docx, *.zip, *.html +``` + +Additional file types can be added if required. + +From Python: + +```python +from cadetrdm import ProjectRepo + +repo = ProjectRepo() +repo.output_repo.add_filetype_to_lfs("*.npy") +``` + +From the command line, run the following command inside the output repository: + +```bash +rdm lfs add *.npy +``` diff --git a/docs/source/user_guide/introduction.md b/docs/source/user_guide/introduction.md new file mode 100644 index 0000000..6a5862c --- /dev/null +++ b/docs/source/user_guide/introduction.md @@ -0,0 +1,69 @@ +# Introduction + +Welcome to CADET-Research Data Management, a project by the Forschungszentrum Jülich. + +This toolbox aims to help track and version control: + +* input data + +* code + +* software versions + +* configurations + +* metadata + +* output data + +and allow for easy sharing, integration, and reproduction of the generated results. + + +The tools of CADET-RDM can be applied to any project with the structure of an RDM project. + + +## RDM repository architecture + +CADET-RDM projects are structured into two distinct repositories. + +1. The **project repository** that contains the input data, code, software and configurations to execute the computations. The output repository is a directory within the project repository. +2. The **output repository** that contains the results of these computations, including all calculations, models and figures created by running the project code. Also stored in the output directory is the metadata used to create the specific result. This includes e.g. the software versions and requirements. + +:::{figure} figures/RDM_wide.png +:width: 700 +:alt: RDM structure + +CADET-RDM repository architechture +::: +::: + +Both the **project** and the **output** repository are their own git repositories. The commit architecture of CADET-RDM allows for easy tracking and reproducing of results and their respective project code. + +## RDM commit architecture + +Every run of the project code creates a new output branch (*result branch*) in the **output directory**. The repository on this new branch uniquely contains the files created by the execution of the project code.
At the same time, for every run of the project code the `run_history` directory on the master branch of the output repository is updated. This directory is unique to the master branch and contains the metadata and software specifications for every branch in the output repository. This directory also links the results in the output branch to the corresponding commit in the project repository used to create them. For transparency and easy accessibility, the most important specifications for every result branch are also documented in the `log.tsv` on the master branch of the output repository. + +```{eval-rst} +.. subfigure:: AB + :gap: 8px + :subcaptions: below + + .. image:: figures/RDM-project-commits.png + :alt: Project Repository + :width: 300px + + .. image:: figures/RDM-output-commits.png + :alt: Output Repository + :width: 420px + + CADET-RDM commit architechture. +``` + + +Because of this simultanious log of the metadata and the environment used to create a specific output, results can be reproduced easily. + +## User function + +The tools of CADET-RDM can be used through the command line interface (CLI) or by executing script in python or in [Jupyter Lab](https://jupyterlab.readthedocs.io/en/latest/). + +The following documentation contains an installation guide, a user guide to quickly start using CADET-RDM and more detailed descriptions on using the command line interface, python interface and jupyter interface. \ No newline at end of file diff --git a/docs/source/user_guide/jupyter-interface.md b/docs/source/user_guide/jupyter-interface.md index 4ff80d4..3783b5e 100644 --- a/docs/source/user_guide/jupyter-interface.md +++ b/docs/source/user_guide/jupyter-interface.md @@ -1,38 +1,43 @@ - # Jupyter interface -The CADET-RDM Jupyter interface **only works** with [Jupyter Lab](https://jupyterlab.readthedocs.io/en/latest/), -and not with the old [Jupyter Notebook](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html) interface -at the moment. +The CADET-RDM Jupyter interface provides integration with JupyterLab for tracking code and results generated from notebooks. + +At the moment, the Jupyter interface **only works** with [Jupyter Lab](https://jupyterlab.readthedocs.io/en/latest/), +and not with the classic [Jupyter Notebook](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html) interface. -## General concepts +## Overview and concepts + +The Jupyter interface builds on the Python interface and applies additional constraints to ensure reproducibility when working with notebooks. ### Jupytext -Jupyter Notebooks are not well suited for version control with git, as the metadata and cell outputs are stored besides -the input code. This overwhelms the inspection of differences within commits and the comparisons between branches. +Jupyter notebooks are not well suited for version control with Git, as metadata and cell outputs are stored alongside the input code. This makes inspecting changes and comparing branches difficult. + +Therefore, CADET-RDM uses the [jupytext](https://github.com/mwouts/jupytext) extension by default. Notebooks are converted from `.ipynb` files into `.py` files, with Markdown cells stored as block comments. -Therefore, the [jupytext](https://github.com/mwouts/jupytext) extension is used by default to convert `.ipynb` files -into a `.py` files, with the markdown cells included as block comments. All `.ipynb` files are removed from git's -version control through the `.gitignore` file and only changes in the `.py` files are tracked. The `.py` files are -automatically created and updated whenever a `.ipynb` file is saved. +* `.ipynb` files are excluded from version control via `.gitignore` +* only the generated `.py` files are tracked in Git +* the `.py` file is automatically created and updated whenever the notebook is saved -Please ensure, that `juyptext` is working for you and that a `.py` file is created after saving your notebook, otherwise -your code will not be version-controlled. +Please ensure that `jupytext` is working correctly and that a `.py` file is generated when saving the notebook. Otherwise, code changes will not be version controlled. ### Reproducibility -To ensure results from `.ipynb` files are perfectly reproducible, `CADET-RDM` does not allow for the tracking of -results generated during live-coding usage. Therefore, before committing results, -all previous outputs are cleared and all cells -are executed sequentially from top to bottom and then committed to the output repository. +To ensure that results generated from notebooks are reproducible, CADET-RDM does not allow tracking results produced during interactive execution. + +Before committing results: + +* all existing outputs are cleared +* all cells are executed sequentially from top to bottom +* the executed notebook is committed to the output repository + -To maintain the link between Markdown annotation, code, and inline graphs, the final notebook is also saved as -a `.html` webpage into the output folder for future inspection. -## Tracking Results +## Handling results with CADET-RDM -To use `CADET-RDM` from within an `.ipynb` file, please include this at the top of your file. +### Tracking results from notebooks + +To use CADET-RDM inside a Jupyter notebook, initialize the repository interface at the top of the notebook: ```python from cadetrdm.repositories import JupyterInterfaceRepo @@ -40,7 +45,8 @@ from cadetrdm.repositories import JupyterInterfaceRepo repo = JupyterInterfaceRepo() ``` -Then, at the end of your file, run: +At the end of the notebook, trigger result tracking and committing: + ```python repo.commit_nb_output( "path-to-the-current-notebook.ipynb", @@ -48,21 +54,32 @@ repo.commit_nb_output( ) ``` -This will re-run the `.ipynb` file from the start, save a html version of the completed notebook into the output repo -and commit all changes to the output repo. +This will: -## Committing changes to your code +* re-run the notebook from the beginning +* commit all generated results to a new output branch +* save a html and ipynb version of the current notebook inside the output branch. The parameter `conversion_formats` can be used to specify the desired output format of the notebook. It defaults to `["html", "ipynb"]`. -You can commit all current changes to your code directly from Jupyter by running +### Committing code changes + +Code changes can be committed directly from within Jupyter: ```python from cadetrdm.repositories import JupyterInterfaceRepo repo = JupyterInterfaceRepo() - repo.commit("Commit message") ``` + ## Other workflows -All other workflows function identically as described in the {ref}`python_interface` section. \ No newline at end of file +All other workflows, including: + +* reusing results from earlier runs +* importing results from other repositories +* configuring remotes +* pushing results +* cloning repositories + +behave identically to the Python interface and are described in the {ref}`python_interface` section. \ No newline at end of file diff --git a/docs/source/user_guide/python-interface.md b/docs/source/user_guide/python-interface.md index 48860af..f3e47ee 100644 --- a/docs/source/user_guide/python-interface.md +++ b/docs/source/user_guide/python-interface.md @@ -1,113 +1,158 @@ (python_interface)= # Python interface -## Tracking Results +The Python interface exposes all CADET-RDM functionality for direct use within Python scripts, libraries, and interactive environments. It is suited for programmatic control, direct context tracking of code execution, and integration into existing Python workflows. + +## Repository initialization + +Create a new project repository or convert an existing directory into a CADET-RDM repository. ```python -from cadetrdm import ProjectRepo +from cadetrdm import initialize_repo + +initialize_repo(path_to_repo, [output_directory_name]) +``` + +Options: -""" -Your imports and function declarations -e.g. generate_data(), write_data_to_file(), analyse_data() and plot_analysis_results() -""" +- If no `path_to_repo` is provided, the repository is initialized in the root directory without creating a new directory. +- If `path_to_repo` is given as a relative path (e.g. "repository_name"), a new directory with that name is created inside the root directory. +- If `path_to_repo` is given as an absolute path (e.g.C:\Users\me\projects\myrepo), a new directory is created at the specified location. -if __name__ == '__main__': - # Instantiate CADET-RDM ProjectRepo handler - repo = ProjectRepo() +Optionally, a `output_directory_name` can be given. Otherwise, it defaults to `output`. - # If you've made changes to the code, commit the changes - repo.commit("Add code to generate and analyse example data") - # Everything written to the output_folder within this context manager gets tracked - # The method repo.output_data() generates full paths to within your output_folder - with repo.track_results(results_commit_message="Generate and analyse example data"): - data = generate_data() - write_data_to_file(data, output_folder=repo.output_folder) +### Cookiecutter support - analysis_results = analyse_data(data) - plot_analysis_results(analysis_results, figure_path=repo.output_folder / "analysis" / "regression.png") +Repositories can be initialized from Cookiecutter templates. +```python +from cadetrdm import initialize_repo + +initialize_repo(path_to_repo, cookiecutter_template="template_url") ``` -## Sharing Results +If `path_to_repo` is provided, it overrides any directory name specified by the Cookiecutter template. +If omitted, initialization happens in the current working directory. + +## Handling results with CADET-RDM -To share your project code and results with others, you need to create remote repositories on e.g. -[GitHub](https://github.com/) or GitLab. You need to create a remote for both the _project_ repo and the -_results_ repo. +### Tracking, committing and pushing results -Once created, the remotes need to be added to the local repositories. +Results are tracked using the `ProjectRepo` interface. All files written inside the tracking context are stored in a new output branch together with execution metadata. ```python +from cadetrdm import ProjectRepo + repo = ProjectRepo() -repo.add_remote("git@:.git") -repo.output_repo.add_remote("git@:_output.git") +repo.commit("Commit code changes") + +with repo.track_results(results_commit_message="Generate results"): + data = generate_data() + write_data_to_file(data, output_directory=repo.output_directory) + + analysis_results = analyse_data(data) + plot_analysis_results( + analysis_results, + figure_path=repo.output_directory / "analysis" / "regression.png" + ) ``` -Once remotes are configured, you can push all changes to the project repo and the results repos with the -command +Each execution creates a new output branch containing the generated results and associated metadata. + +Project and output repositories can be pushed together using a single command. ```python -# push all changes to the Project and Output repositories with one command: repo.push() ``` -## Re-using results from previous iterations +Consistency checks and staging are handled automatically by the Python interface before pushing. -Each result stored with CADET-RDM is given a unique branch name, formatted as: -`__"from"__` +### Reusing results from earlier runs -With this branch name, previously generated data can be loaded in as input data for -further calculations. +Each run is stored in an output branch named: -```python -cached_folder_path = repo.input_data(branch_name=branch_name) +``` +__ ``` +Reuse results from a previous run by loading them into the local cache: -```json -{ - "__example/path/to/repo__": { - "source_repo_location": "git@jugit.fz-juelich.de:IBG-1/ModSim/cadet/agile_cadet_rdm_presentation_output.git", - "branch_name": "output_from_master_3910c84_2023-10-25_00-17-23", - "commit_hash": "6e3c26527999036e9490d2d86251258fe81d46dc" - } -} +```python +cached_folder_path = repo.input_data(branch_name="") ``` -## Using results from another repository +### Using results from another repository -You can load in results from another repository to use in your project using the CLI: +Results from other CADET-RDM repositories can be imported and registered in the local cache. ```python -repo.import_remote_repo(source_repo_location="", source_repo_branch="") -repo.import_remote_repo(source_repo_location="", source_repo_branch="", - target_repo_location="") +repo.import_remote_repo( + source_repo_location="", + source_repo_branch="" +) ``` -This will store the URL, branch_name and location in the .cadet-rdm-cache.json file, like this: +Optionally, a destination directory can be specified: -```json -{ - "__example/path/to/repo__": { - "source_repo_location": "git@jugit.fz-juelich.de:IBG-1/ModSim/cadet/agile_cadet_rdm_presentation_output.git", - "branch_name": "output_from_master_3910c84_2023-10-25_00-17-23", - "commit_hash": "6e3c26527999036e9490d2d86251258fe81d46dc" - } -} +```python +repo.import_remote_repo( + source_repo_location="", + source_repo_branch="", + target_repo_location="" +) ``` -You can use this file to load the remote repositories based on the cache.json with +Repositories listed in `.cadet-rdm-cache.json` can be loaded with: ```python repo.fill_data_from_cadet_rdm_json() ``` -## Cloning from remote +## Remote repositories -You should use `cadetrdm.ProjectRepo.clone()` instead of `git clone` to clone the repo to a new location. +### Cloning repositories + +Clone an existing CADET-RDM repository. This method must be used instead of `git clone` to ensure that both project and output repositories are initialized correctly. ```python from cadetrdm import ProjectRepo -ProjectRepo.clone("") +ProjectRepo.clone("", "") ``` + +The destination directory must be empty. + +### Adding existing remotes + +Add remotes manually for both the project and output repositories. + +```python +from cadetrdm import ProjectRepo + +repo = ProjectRepo() +repo.add_remote("git@:.git") +repo.output_repo.add_remote("git@:_output.git") +``` + +### Creating remotes automatically + +Remote repositories can be created automatically using the GitHub or GitLab APIs if a Personal Access Token is available in the Python keyring. + +```python +from cadetrdm import ProjectRepo + +repo = ProjectRepo() +repo.create_remotes( + name="Workproject", + namespace="githubusers_workproject", + url="https://github.com/", + username="githubuser" +) +``` + +The output repository name is derived automatically by appending `_output` to the project repository name. + +### Migrating repositories + +Migration to a different remote is performed by updating the `origin` URLs for both repositories and pushing the changes. This follows the same workflow as the command line interface.