From 1695dfb47da04320f674064ac3b367d86c648db7 Mon Sep 17 00:00:00 2001 From: Mossa Date: Thu, 23 Apr 2026 23:53:00 +0200 Subject: [PATCH 1/5] sync spec to current implementation plus added missing items --- specs.md | 101 ++++++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 89 insertions(+), 12 deletions(-) diff --git a/specs.md b/specs.md index aa1c053..d1a0dfd 100644 --- a/specs.md +++ b/specs.md @@ -48,6 +48,15 @@ A file in a `dvs` project can be in 3 states: ## In-depth spec +### errors / failures + +- [ ] describe --fail-fast +- [ ] do we stop processing when an error occurred? + +#### CLI + +#### R package + ### init init will always error if a `dvs.toml` already exists in the target directory. @@ -56,6 +65,10 @@ This check is local: a `dvs.toml` in a parent directory does not prevent initial On partial failure (e.g., metadata folder or storage creation fails after `dvs.toml` is written), init attempts best-effort cleanup of local artifacts (`dvs.toml` and, if it didn't exist beforehand, the metadata folder) so that a retry is possible. +A storage directory or backend is bound to one dvs repository. + +`metadata_folder_name` is omitted from `dvs.toml` if not specified. + #### CLI ```shell @@ -86,12 +99,21 @@ Options: #### Rust library +- [ ] figure out the actual spec for the library, other than being the engine + behind both CLI and R package. + Library takes a project directory and the config to save. #### R package ```r -dvs_init(storage_path, root_dir = ".", group = NULL, metadata_folder_name = NULL, no_compression = FALSE) +dvs_init( + storage_path, + root_dir = NULL, + group = NULL, + metadata_folder_name = NULL, + compression = c("zstd", "none") +) ``` - `storage_path`: where the data will be stored (required, same as CLI's ``) @@ -100,10 +122,15 @@ dvs_init(storage_path, root_dir = ".", group = NULL, metadata_folder_name = NULL - `metadata_folder_name`: custom name for the metadata folder (default `.dvs`) - `no_compression`: disable zstd compression of stored files -Returns a list with `status = "initialized"`. +Returns a list with `status = "initialized"`, invisibly. ### add +- [ ] does not support folder as an argument +- [ ] does not support overriding compression of the storage data, e.g. if repo + is not set to use compression, we cannot add one file that selectively is + then compressed. + It only takes files as input, directories will not work unless combined with a glob. It can also take an optional message that will be recorded in the metadata file. @@ -121,11 +148,13 @@ Symlinks are resolved before adding. If a symlink target resolves to a path outs Each `add` operation is atomic: the storage write and metadata update either both succeed or both roll back. A failure writing to storage will not leave behind a partial metadata file, and vice versa. -You can also do a dry run from the CLI or the library that will return the outcome that would have happened for each file but +You can also do a dry run from the CLI,the library, or R package that will return the outcome that would have happened for each file but without actually doing them. #### CLI +- [ ] missing `-r, --recursive` options that `dvs status` has. + ```shell ❯ dvs add --help Adds the given files to dvs. You can use a glob or paths. If you pass a directory and a glob, the glob will be ran from that directory. At least one path or --glob must be provided @@ -147,7 +176,7 @@ Options: You can run `dvs add *.csv` and it will be expanded by your shell before calling `dvs`. To ensure globs are consistent with the R package, you can use the `--glob` parameter which will be expanded by the library. -This will exit with `1` if one or more files could not be added to the storage (file does not exist, no permissions etc). +This will exit with `1` if one or more files could not be added to the storage (file does not exist, no permissions, etc). #### Rust library @@ -157,17 +186,20 @@ It otherwise returns a list of results sorted alphabetically by path, letting us #### R package ```r -dvs_add(files = character(0), message, glob = NULL, dry_run = FALSE) +dvs_add(paths = character(0), message = NULL, glob = NULL, dry_run = NULL) ``` - `files`: character vector of file paths to add (can be empty if `glob` is provided) -- `message`: optional message recorded in the metadata file. Omit or pass `NULL` to skip +- `message`: optional message recorded in the metadata file. - `glob`: pattern to match files (same resolution rules as CLI `--glob`) - `dry_run`: if `TRUE`, returns what would be added without making changes Returns a data frame with one row per file. Errors if no files match. Glob resolution uses the same rules as the CLI — see the Globbing section. +- [ ] Partial completion results in a list containing two elements named `result` +and `error`, each containing data-frames describing the failures and successes. + ### get `get` retrieves files from the storage into the project. @@ -182,11 +214,13 @@ is deleted and the operation fails for that file. Users can use `get` with specific paths or globs. In practice those will be ran on the metadata folder rather than the actual project, to know what to pull but the resolution works the same way as `add`. -You can also do a dry run from the CLI or the library that will return the outcome that would have happened for each file but +You can also do a dry run from the CLI, R package or the library that will return the outcome that would have happened for each file but without actually doing them. #### CLI +- [ ] missing `-r, --recursive` options that `dvs status` has. + ```shell ❯ dvs get --help Retrieves the given files from dvs storage. You can use a glob or paths. If you pass a directory and a glob, the glob will be ran from that directory. At least one path or --glob must be provided @@ -207,13 +241,14 @@ Options: This will exit with `1` if one or more files could not be retrieved. #### Rust library + The library automatically sets up parallelism and the only error it can return is if it couldn't set up the threadpool. It otherwise returns a list of results sorted alphabetically by path, letting users decide what to do with each. #### R package ```r -dvs_get(files = character(0), glob = NULL, dry_run = FALSE) +dvs_get(paths = character(0), glob = NULL, dry_run = NULL) ``` - `files`: character vector of file paths to retrieve (can be empty if `glob` is provided) @@ -222,8 +257,13 @@ dvs_get(files = character(0), glob = NULL, dry_run = FALSE) Returns a data frame with one row per file. Errors if no files match. +- [ ] Partial completion results in a list containing two elements named `result` +and `error`, each containing data-frames describing the failures and successes. + ### status +- [ ] does not show "untracked" files even when a folder is provided + This returns the status (mentioned in the high level overview above) of the tracked files in the project. #### CLI @@ -260,10 +300,16 @@ as per-file errors in the result list. #### R package ```r -dvs_status(current = FALSE, absent = FALSE, unsynced = FALSE) +dvs_status( + paths = character(0), + recursive = NULL, + status = c("current", "absent", "unsynced") +) ``` -- `current`, `absent`, `unsynced`: filter flags. When all are `FALSE` (default), all tracked files are returned. When one or more are `TRUE`, only files matching those states are returned. Errors are always included. +- `paths` optional vector of files to retrieve status of +- `status` represent the filter flags `current`, `absent`, `unsynced`. + When all are `FALSE` (default), all tracked files are returned. When one or more are `TRUE`, only files matching those states are returned. Errors are always included. Returns a data frame with one row per tracked file. @@ -317,7 +363,6 @@ prevents partial blobs from appearing in storage mapping to an expected file. Stored files are set read-only after writing. - ### Audit trail Every `add` operation is logged to an append-only audit file (`audit.log.jsonl`) in the storage directory. Each @@ -326,12 +371,37 @@ entry records: - `operation_id`: a UUID grouping all files from one `add` invocation - `timestamp`: unix seconds - `user`: the system username of whoever ran the command -- `file`: path and hashes of the added file - `action`: currently only `add` +`add` is an object that contains + +- `file`: path and hashes of the added file +- `compression`: `"zstd"` or `"none"`, see add section above + The audit log is protected by a mutex so a single `dvs` process cannot corrupt it but there is no protection against multiple processes appending logs. +Example: + +```json +{ + "operation_id": "93ca9155-b767-47e8-b433-a110f5943212", + "timestamp": 1776959650, + "user": "elea", + "action": { + "add": { + "file": { + "path": "data/derived/many34.csv", + "hashes": { + "blake3": "55539acdd1e0617c9180f8a593aeda480b7fab415faef16596f25098908b8487" + } + }, + "compression": "zstd" + } + } +} +``` + ### Hash cache `dvs` maintains a SQLite cache at `{metadata_folder}/.cache/dvs.db` to avoid re-hashing files that haven't @@ -342,11 +412,18 @@ and operations proceed without it if that also fails. ### Parallelism +- [ ] update the threads strategy here to reflect that of current `dvs2`. + `add`, `get`, and `status` run file operations in parallel. You can set the `DVS_NUM_THREADS` environment variable to control the thread count. If it is unset, the default is `min(available_parallelism * 4, 16)`. If it is set to a positive integer, the override is capped at 32. In both cases the final thread count is clamped to the number of files being processed. +#### R package + +The environment variable `DVS_NUM_THREADS` takes precedence over the threadpool +size set by the R package. + ### Gitignore After a successful `add`, each added file is appended to a `.gitignore` in the file's parent directory From 20d8501faa2d637b5e53541a92ab29dfec1dccde Mon Sep 17 00:00:00 2001 From: Mossa Date: Fri, 24 Apr 2026 13:12:30 +0200 Subject: [PATCH 2/5] updated spec based on slack feedback --- specs.md | 26 +++++++++++--------------- 1 file changed, 11 insertions(+), 15 deletions(-) diff --git a/specs.md b/specs.md index d1a0dfd..6771aeb 100644 --- a/specs.md +++ b/specs.md @@ -48,10 +48,10 @@ A file in a `dvs` project can be in 3 states: ## In-depth spec -### errors / failures +### Errors / failures -- [ ] describe --fail-fast -- [ ] do we stop processing when an error occurred? +All input must be evaluated, and errors as well as successes must be collated +are reported back to the user. #### CLI @@ -126,13 +126,8 @@ Returns a list with `status = "initialized"`, invisibly. ### add -- [ ] does not support folder as an argument -- [ ] does not support overriding compression of the storage data, e.g. if repo - is not set to use compression, we cannot add one file that selectively is - then compressed. - It only takes files as input, directories will not work unless combined with a glob. It can also take an optional -message that will be recorded in the metadata file. +message that will be recorded in the metadata file. Similarly, `add` must not have a recursive option; The glob mechanism is sufficient, and intentional when used. This method follows a best-effort approach: even if some files failed to be added, it will still try to add everything and not stop. @@ -153,8 +148,6 @@ without actually doing them. #### CLI -- [ ] missing `-r, --recursive` options that `dvs status` has. - ```shell ❯ dvs add --help Adds the given files to dvs. You can use a glob or paths. If you pass a directory and a glob, the glob will be ran from that directory. At least one path or --glob must be provided @@ -262,9 +255,8 @@ and `error`, each containing data-frames describing the failures and successes. ### status -- [ ] does not show "untracked" files even when a folder is provided - -This returns the status (mentioned in the high level overview above) of the tracked files in the project. +This returns the status (mentioned in the high level overview above) + of the tracked files in the project. #### CLI @@ -309,7 +301,7 @@ dvs_status( - `paths` optional vector of files to retrieve status of - `status` represent the filter flags `current`, `absent`, `unsynced`. - When all are `FALSE` (default), all tracked files are returned. When one or more are `TRUE`, only files matching those states are returned. Errors are always included. + When omitted, all three states are present in the returned data-frame, otherwise serval filtering status can be provided. Returns a data frame with one row per tracked file. @@ -433,7 +425,11 @@ the `add` operation to fail. ### Globbing +<<<<<<< HEAD `add`, `status`, `get` accept a `--glob` flag. The resolution works the following way: +======= +`add`, `status`, `get` both accept a `--glob` flag. The resolution works the following way: +>>>>>>> 6f99880 (updated spec based on slack feedback) - Explicit files: added/retrieved directly (glob ignored) - Explicit directories with a glob: walked and filtered by glob From fc20b6c3702c82a95b3ea8c141b156fe57ab80d4 Mon Sep 17 00:00:00 2001 From: Mossa Date: Thu, 30 Apr 2026 16:19:47 +0200 Subject: [PATCH 3/5] finished the last edits; dvs-rpkg on how to handle the override of threads is next --- specs.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/specs.md b/specs.md index 6771aeb..86977da 100644 --- a/specs.md +++ b/specs.md @@ -190,7 +190,7 @@ dvs_add(paths = character(0), message = NULL, glob = NULL, dry_run = NULL) Returns a data frame with one row per file. Errors if no files match. Glob resolution uses the same rules as the CLI — see the Globbing section. -- [ ] Partial completion results in a list containing two elements named `result` +Partial completion leads to a return value as a list containing two elements named `result` and `error`, each containing data-frames describing the failures and successes. ### get @@ -250,7 +250,7 @@ dvs_get(paths = character(0), glob = NULL, dry_run = NULL) Returns a data frame with one row per file. Errors if no files match. -- [ ] Partial completion results in a list containing two elements named `result` +Partial completion leads to a return value as a list containing two elements named `result` and `error`, each containing data-frames describing the failures and successes. ### status @@ -407,7 +407,8 @@ and operations proceed without it if that also fails. - [ ] update the threads strategy here to reflect that of current `dvs2`. `add`, `get`, and `status` run file operations in parallel. You can set the `DVS_NUM_THREADS` environment -variable to control the thread count. If it is unset, the default is `min(available_parallelism * 4, 16)`. +variable to control the thread count, if the CLI and R package have not set the number +of threads directly. If it is unset, the default is `min(available_parallelism * 4, 16)`. If it is set to a positive integer, the override is capped at 32. In both cases the final thread count is clamped to the number of files being processed. From ed498bacab716d44184aaf56007bbb098be170be Mon Sep 17 00:00:00 2001 From: Mossa Date: Mon, 4 May 2026 14:10:23 +0200 Subject: [PATCH 4/5] follow up on claude review --- specs.md | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/specs.md b/specs.md index 86977da..9133e20 100644 --- a/specs.md +++ b/specs.md @@ -50,13 +50,20 @@ A file in a `dvs` project can be in 3 states: ### Errors / failures -All input must be evaluated, and errors as well as successes must be collated -are reported back to the user. +All input must be evaluated, and errors as well as successes must be collected, +and then reported back to the user. #### CLI +Error handling in CLI must be designed around the use case of being a part +of pipelines, or non-interactive workflows. Returning error codes is necessary. + #### R package +Error handling in the R package assumes interactive, long-lived sessions. Thus, +the R package must provide comprehensive information, that the end-user can +explore. + ### init init will always error if a `dvs.toml` already exists in the target directory. @@ -99,10 +106,7 @@ Options: #### Rust library -- [ ] figure out the actual spec for the library, other than being the engine - behind both CLI and R package. - -Library takes a project directory and the config to save. +Library takes a project directory and the configuration of the backend to initiate repository. #### R package From 535784b3ba35c85f8fed4ccffa966a0d7084fd30 Mon Sep 17 00:00:00 2001 From: Mossa Date: Mon, 4 May 2026 16:26:23 +0200 Subject: [PATCH 5/5] final corrections. --- specs.md | 22 +++++++--------------- 1 file changed, 7 insertions(+), 15 deletions(-) diff --git a/specs.md b/specs.md index 9133e20..b028cec 100644 --- a/specs.md +++ b/specs.md @@ -50,8 +50,8 @@ A file in a `dvs` project can be in 3 states: ### Errors / failures -All input must be evaluated, and errors as well as successes must be collected, -and then reported back to the user. +All input must be evaluated, and errors as well as successes must be collected +are then reported back to the user. #### CLI @@ -124,7 +124,7 @@ dvs_init( - `root_dir`: project root where `dvs.toml` is created (defaults to working directory) - `group`: Unix group to set on storage directory and files - `metadata_folder_name`: custom name for the metadata folder (default `.dvs`) -- `no_compression`: disable zstd compression of stored files +- `compression`: desired compression for the stored date (default `zstd`) Returns a list with `status = "initialized"`, invisibly. @@ -186,7 +186,7 @@ It otherwise returns a list of results sorted alphabetically by path, letting us dvs_add(paths = character(0), message = NULL, glob = NULL, dry_run = NULL) ``` -- `files`: character vector of file paths to add (can be empty if `glob` is provided) +- `paths`: character vector of file paths to add (can be empty if `glob` is provided) - `message`: optional message recorded in the metadata file. - `glob`: pattern to match files (same resolution rules as CLI `--glob`) - `dry_run`: if `TRUE`, returns what would be added without making changes @@ -216,8 +216,6 @@ without actually doing them. #### CLI -- [ ] missing `-r, --recursive` options that `dvs status` has. - ```shell ❯ dvs get --help Retrieves the given files from dvs storage. You can use a glob or paths. If you pass a directory and a glob, the glob will be ran from that directory. At least one path or --glob must be provided @@ -303,9 +301,9 @@ dvs_status( ) ``` -- `paths` optional vector of files to retrieve status of -- `status` represent the filter flags `current`, `absent`, `unsynced`. - When omitted, all three states are present in the returned data-frame, otherwise serval filtering status can be provided. +- `paths` optional vector of paths files to retrieve status of +- `status` represent the filter flags `current`, `absent`, `unsynced` + When omitted, all three states are present in the returned data-frame, otherwise serval filtering status can be provided Returns a data frame with one row per tracked file. @@ -408,8 +406,6 @@ and operations proceed without it if that also fails. ### Parallelism -- [ ] update the threads strategy here to reflect that of current `dvs2`. - `add`, `get`, and `status` run file operations in parallel. You can set the `DVS_NUM_THREADS` environment variable to control the thread count, if the CLI and R package have not set the number of threads directly. If it is unset, the default is `min(available_parallelism * 4, 16)`. @@ -430,11 +426,7 @@ the `add` operation to fail. ### Globbing -<<<<<<< HEAD `add`, `status`, `get` accept a `--glob` flag. The resolution works the following way: -======= -`add`, `status`, `get` both accept a `--glob` flag. The resolution works the following way: ->>>>>>> 6f99880 (updated spec based on slack feedback) - Explicit files: added/retrieved directly (glob ignored) - Explicit directories with a glob: walked and filtered by glob