Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 88 additions & 18 deletions specs.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,22 @@ A file in a `dvs` project can be in 3 states:

## In-depth spec

### Errors / failures

All input must be evaluated, and errors as well as successes must be collected
are then reported back to the user.

#### CLI

Error handling in CLI must be designed around the use case of being a part
of pipelines, or non-interactive workflows. Returning error codes is necessary.

#### R package

Error handling in the R package assumes interactive, long-lived sessions. Thus,
the R package must provide comprehensive information, that the end-user can
explore.

### init

init will always error if a `dvs.toml` already exists in the target directory.
Expand All @@ -56,6 +72,10 @@ This check is local: a `dvs.toml` in a parent directory does not prevent initial
On partial failure (e.g., metadata folder or storage creation fails after `dvs.toml` is written),
init attempts best-effort cleanup of local artifacts (`dvs.toml` and, if it didn't exist beforehand, the metadata folder) so that a retry is possible.

A storage directory or backend is bound to one dvs repository.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have a check for that? Or a way to actually detect it


`metadata_folder_name` is omitted from `dvs.toml` if not specified.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that seems oddly specific and not something that should be in the specs


#### CLI

```shell
Expand Down Expand Up @@ -86,26 +106,32 @@ Options:

#### Rust library

Library takes a project directory and the config to save.
Library takes a project directory and the configuration of the backend to initiate repository.

#### R package

```r
dvs_init(storage_path, root_dir = ".", group = NULL, metadata_folder_name = NULL, no_compression = FALSE)
dvs_init(
storage_path,
root_dir = NULL,
group = NULL,
metadata_folder_name = NULL,
compression = c("zstd", "none")
)
```

- `storage_path`: where the data will be stored (required, same as CLI's `<PATH>`)
- `root_dir`: project root where `dvs.toml` is created (defaults to working directory)
- `group`: Unix group to set on storage directory and files
- `metadata_folder_name`: custom name for the metadata folder (default `.dvs`)
- `no_compression`: disable zstd compression of stored files
- `compression`: desired compression for the stored date (default `zstd`)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

date -> data


Returns a list with `status = "initialized"`.
Returns a list with `status = "initialized"`, invisibly.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does invisibly mean?


### add

It only takes files as input, directories will not work unless combined with a glob. It can also take an optional
message that will be recorded in the metadata file.
message that will be recorded in the metadata file. Similarly, `add` must not have a recursive option; The glob mechanism is sufficient, and intentional when used.

This method follows a best-effort approach: even if some files failed to be added, it will still try to add everything
and not stop.
Expand All @@ -121,7 +147,7 @@ Symlinks are resolved before adding. If a symlink target resolves to a path outs
Each `add` operation is atomic: the storage write and metadata update either both succeed or both roll back. A
failure writing to storage will not leave behind a partial metadata file, and vice versa.

You can also do a dry run from the CLI or the library that will return the outcome that would have happened for each file but
You can also do a dry run from the CLI,the library, or R package that will return the outcome that would have happened for each file but
without actually doing them.

#### CLI
Expand All @@ -147,7 +173,7 @@ Options:
You can run `dvs add *.csv` and it will be expanded by your shell before calling `dvs`.
To ensure globs are consistent with the R package, you can use the `--glob` parameter which will be expanded by the library.

This will exit with `1` if one or more files could not be added to the storage (file does not exist, no permissions etc).
This will exit with `1` if one or more files could not be added to the storage (file does not exist, no permissions, etc).

#### Rust library

Expand All @@ -157,17 +183,20 @@ It otherwise returns a list of results sorted alphabetically by path, letting us
#### R package

```r
dvs_add(files = character(0), message, glob = NULL, dry_run = FALSE)
dvs_add(paths = character(0), message = NULL, glob = NULL, dry_run = NULL)
```

- `files`: character vector of file paths to add (can be empty if `glob` is provided)
- `message`: optional message recorded in the metadata file. Omit or pass `NULL` to skip
- `paths`: character vector of file paths to add (can be empty if `glob` is provided)
- `message`: optional message recorded in the metadata file.
- `glob`: pattern to match files (same resolution rules as CLI `--glob`)
- `dry_run`: if `TRUE`, returns what would be added without making changes

Returns a data frame with one row per file. Errors if no files match.
Glob resolution uses the same rules as the CLI — see the Globbing section.

Partial completion leads to a return value as a list containing two elements named `result`
and `error`, each containing data-frames describing the failures and successes.

### get

`get` retrieves files from the storage into the project.
Expand All @@ -182,7 +211,7 @@ is deleted and the operation fails for that file.
Users can use `get` with specific paths or globs. In practice those will be ran on the metadata folder rather
than the actual project, to know what to pull but the resolution works the same way as `add`.

You can also do a dry run from the CLI or the library that will return the outcome that would have happened for each file but
You can also do a dry run from the CLI, R package or the library that will return the outcome that would have happened for each file but
without actually doing them.

#### CLI
Expand All @@ -207,13 +236,14 @@ Options:
This will exit with `1` if one or more files could not be retrieved.

#### Rust library

The library automatically sets up parallelism and the only error it can return is if it couldn't set up the threadpool.
It otherwise returns a list of results sorted alphabetically by path, letting users decide what to do with each.

#### R package

```r
dvs_get(files = character(0), glob = NULL, dry_run = FALSE)
dvs_get(paths = character(0), glob = NULL, dry_run = NULL)
```

- `files`: character vector of file paths to retrieve (can be empty if `glob` is provided)
Expand All @@ -222,9 +252,13 @@ dvs_get(files = character(0), glob = NULL, dry_run = FALSE)

Returns a data frame with one row per file. Errors if no files match.

Partial completion leads to a return value as a list containing two elements named `result`
and `error`, each containing data-frames describing the failures and successes.

### status

This returns the status (mentioned in the high level overview above) of the tracked files in the project.
This returns the status (mentioned in the high level overview above)
of the tracked files in the project.

#### CLI

Expand Down Expand Up @@ -260,10 +294,16 @@ as per-file errors in the result list.
#### R package

```r
dvs_status(current = FALSE, absent = FALSE, unsynced = FALSE)
dvs_status(
paths = character(0),
recursive = NULL,
status = c("current", "absent", "unsynced")
)
```

- `current`, `absent`, `unsynced`: filter flags. When all are `FALSE` (default), all tracked files are returned. When one or more are `TRUE`, only files matching those states are returned. Errors are always included.
- `paths` optional vector of paths files to retrieve status of
- `status` represent the filter flags `current`, `absent`, `unsynced`
When omitted, all three states are present in the returned data-frame, otherwise serval filtering status can be provided

Returns a data frame with one row per tracked file.

Expand Down Expand Up @@ -317,7 +357,6 @@ prevents partial blobs from appearing in storage mapping to an expected file.

Stored files are set read-only after writing.


### Audit trail

Every `add` operation is logged to an append-only audit file (`audit.log.jsonl`) in the storage directory. Each
Expand All @@ -326,12 +365,37 @@ entry records:
- `operation_id`: a UUID grouping all files from one `add` invocation
- `timestamp`: unix seconds
- `user`: the system username of whoever ran the command
- `file`: path and hashes of the added file
- `action`: currently only `add`

`add` is an object that contains

- `file`: path and hashes of the added file
- `compression`: `"zstd"` or `"none"`, see add section above
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not clear which section we are talking about. We should have a compression section with the available choices


The audit log is protected by a mutex so a single `dvs` process cannot corrupt it but there is no protection against
multiple processes appending logs.

Example:

```json
{
"operation_id": "93ca9155-b767-47e8-b433-a110f5943212",
"timestamp": 1776959650,
"user": "elea",
"action": {
"add": {
"file": {
"path": "data/derived/many34.csv",
"hashes": {
"blake3": "55539acdd1e0617c9180f8a593aeda480b7fab415faef16596f25098908b8487"
}
},
"compression": "zstd"
}
}
}
```

### Hash cache

`dvs` maintains a SQLite cache at `{metadata_folder}/.cache/dvs.db` to avoid re-hashing files that haven't
Expand All @@ -343,10 +407,16 @@ and operations proceed without it if that also fails.
### Parallelism

`add`, `get`, and `status` run file operations in parallel. You can set the `DVS_NUM_THREADS` environment
variable to control the thread count. If it is unset, the default is `min(available_parallelism * 4, 16)`.
variable to control the thread count, if the CLI and R package have not set the number
of threads directly. If it is unset, the default is `min(available_parallelism * 4, 16)`.
If it is set to a positive integer, the override is capped at 32. In both cases the final thread count is
clamped to the number of files being processed.

#### R package

The environment variable `DVS_NUM_THREADS` takes precedence over the threadpool
size set by the R package.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought we said the opposite?


### Gitignore

After a successful `add`, each added file is appended to a `.gitignore` in the file's parent directory
Expand Down
Loading