Skip to content

Conversation

@philipmw
Copy link

@philipmw philipmw commented Nov 9, 2025

Add support for batch warmup

This implements the idea in rustic-rs/rustic#1430

To use this feature, I wrote a proof-of-concept warmup-s3-archives program:
https://gitlab.com/philipmw/warmup-s3-archives

Changes:

  • add --warm-up-batch <N> parameter
  • add --warm-up-pack-id-input <mode> parameter
  • add --warm-up-input-type <type> parameter
  • add a function to the backend interface that provides the S3 key (or other
    key usable by the warmup command) instead of pack ID.

Tested:

  • unit tests;
  • restoring data from Glacier Deep Archive spanning 3 packs, specifying a
    batch size of >=3 and argv mode
  • killing and restarting restore; this works as long as the warmup program
    is idempotent (which works for S3)
  • having the warmup command exit with an error code; rustic aborts the restore
    and prints a correct error message
  • running multiple restore operations for different packs in parallel. The
    warmup program ignores notifications for packs that it does not recognize,
    leaving them in the queue and letting another warmup instance process them.

Known limitations and my thoughts for improvement opportunities:

  • setup is non-trivial, between the AWS infrastructure and the warmup program configuration.
    It requires some AWS experience and cold storage motivation. But IMO all the
    complexity is specific to the domain; none is incidental.
  • rustic does not pass the backend credentials to the warmup program. The warmup
    program is responsible for finding credentials on its own.
    Probably the best solution, at least for AWS, is for both rustic and the
    warmup program to use a common AWS credential provider.
  • rustic's progress bar does not reflect warmup progress within a batch; only
    progress of entire batches. There is no protocol for communicating progress
    from a single invocation of the warmup command.
  • rustic's warmup parameters are growing in complexity and could use a refactor
    as we discover and clarify cold storage backup scenarios.
    The distinction between --warm-up-command and --warm-up-wait-command
    seems too subtle. --warm-up-wait is too inflexible (since cold storage
    backends' estimates are measured in hours) and can be avoided entirely.

@philipmw philipmw force-pushed the batch-warm-up branch 2 times, most recently from 4115d63 to 086e581 Compare November 15, 2025 14:04
@philipmw
Copy link
Author

@aawsome , I have been looking forward to your feedback on this. This addresses an issue that I've been thinking about for over a year, and I hope that it helps gain new customers for rustic.

@aawsome
Copy link
Member

aawsome commented Dec 21, 2025

Hi @philipmw!

First, thanks a lot for your proposal and deep apologies for letting you wait so long. I wanted to have a look very soon, but then this PR somehow went below my radar - sorry for that!

I took a lock at the code changes (but more a general one) and the implementation looks fine from my point of view. There are however some general points:

  • First I must say that the design to only warm-up pack files was wrong (from my side) - as there are use cases (when repairing a hot/cold repo) where we may want to also warm-up other files. I think we should extend the warmup. This can go in a future PR; however, I think we must keep an eye to ensure this can be easily extended. For backend_path this should be the case, but the name pack_id should be id and then we'll need a way to transfer the type to the command....
  • As a batch is now added for "anchor" mode as well, I think this should not be implemented by waiting for each command to finish, but instead spawn all commands of the batch and then wait for all of them - making this also a parallel warm-up.
  • For "argv" mode, I find it irritating that the args given for the command are completely ignored. What about keeping them and just append the args of what to warm-up there? (actually that could be also the solution to how to give the type of the file: In the existing args there could be a %type or something like this to be replaced like in "id" mode; just some thoughts I'm having while writing this...)
  • I must say I was personally a bit confused about the name "anchor" and would maybe call it "variable" or something like this. Do you have another suggestion? Also "argv" is quite technical, maybe "args" is a better name?

What do you think about these points?

@aawsome
Copy link
Member

aawsome commented Dec 21, 2025

Also note that there are rustfmt and clippy checks failing - we are quite strict about formatting and clippy-compliance, so these findings must also be fixed (but after discussing the general points..)

@aawsome
Copy link
Member

aawsome commented Jan 20, 2026

HI @philipmw!
Are you still interested in working on this? If not, I would try to adapt this in order to get it into rustic!
When thinking about it, I wondered if it was easier to work if we not use argv vs anchor mode and packid vs path type, but just different variables to substitute for all cases. Something like id (anchor+packid), ids (argv+packid), path (anchor+path) and paths (argv+packid) (+ additionally type solving the mentioned problem about not only warming-up packfiles). What do you think about it?
Sorry again for the long delay after your PR!

@philipmw
Copy link
Author

philipmw commented Jan 20, 2026

Hi, @aawsome , now it's my turn to apologize for taking so long to reply to your feedback. Now that the holidays are over, I will try to reply and act faster on this, as I am still motivated to get it built. I was sitting with my computer already analyzing your feedback when you wrote the most recent update. Here are my responses to each point:

First I must say that the design to only warm-up pack files was wrong (from my side) - as there are use cases (when repairing a hot/cold repo) where we may want to also warm-up other files. I think we should extend the warmup. This can go in a future PR; however, I think we must keep an eye to ensure this can be easily extended. For backend_path this should be the case, but the name pack_id should be id and then we'll need a way to transfer the type to the command....

Suggested task is to rename pack_id to id. I have no concerns with this and happy to implement.

As a batch is now added for "anchor" mode as well, I think this should not be implemented by waiting for each command to finish, but instead spawn all commands of the batch and then wait for all of them - making this also a parallel warm-up.

Change` sequential warmup to parallel. I think this is a good idea, although I am not certain that it would be backward compatible for all current users. Do you have any concerns here?

For "argv" mode, I find it irritating that the args given for the command are completely ignored. What about keeping them and just append the args of what to warm-up there? (actually that could be also the solution to how to give the type of the file: In the existing args there could be a %type or something like this to be replaced like in "id" mode; just some thoughts I'm having while writing this...)

Can you clarify this feedback? What "args given for the command" are you referring to?

I must say I was personally a bit confused about the name "anchor" and would maybe call it "variable" or something like this. Do you have another suggestion? Also "argv" is quite technical, maybe "args" is a better name?

I agree, "anchor" is not intuitive. With this name, I was alluding to the HTML anchor element.

Another parallel is Firefox's dynamic bookmarks feature: "Wherever the string %s appears in the bookmark's URL, it will be replaced with any words typed in the address bar after the bookmark's keyword and a space, properly URL-encoded, so they can be used as query string parameters to a search engine, for example."

Perhaps another good name for this could be` "substitute", "replace", "dynamic". "Variable" is also fine. Given this list, which name do you like most? (Naming is hard.)

I agree that "argv" could be better renamed to "args". Happy to rename it.

When thinking about it, I wondered if it was easier to work if we not use argv vs anchor mode and packid vs path type, but just different variables to substitute for all cases. Something like id (anchor+packid), ids (argv+packid), path (anchor+path) and paths (argv+packid) (+ additionally type solving the mentioned problem about not only warming-up packfiles). What do you think about it?

You are suggesting that, instead of --warm-up-pack-id-input and --warm-up-input-type parameters, each with two possible values, we create a more generic single parameter with four possible values, such as --warmup-mode. I don't have a strong opinion here; both ways make sense to me. Happy to implement it either way.

Also note that there are rustfmt and clippy checks failing - we are quite strict about formatting and clippy-compliance, so these findings must also be fixed (but after discussing the general points..)

Thanks. I didn't realize because the CI step didn't run automatically. Now that it ran for this PR, I will fix the issues.

@aawsome
Copy link
Member

aawsome commented Jan 20, 2026

Suggested task is to rename pack_id to id. I have no concerns with this and happy to implement.

I agree that id is better suited.

Change` sequential warmup to parallel. I think this is a good idea, although I am not certain that it would be backward compatible for all current users. Do you have any concerns here?

If warm-up-batch is not set, the behavior won't change. So it's just a new feature.

Can you clarify this feedback? What "args given for the command" are you referring to?

Sorry, for me it looked like you were omitting the given args, but https://github.com/rustic-rs/rustic_core/pull/438/files#diff-de508849190b7987f41c9d008e5e4bd90aad3c464a04bf5a73f661cf952b4a62R188 does include them. My fault.

You are suggesting that, instead of --warm-up-pack-id-input and --warm-up-input-type parameters, each with two possible values, we create a more generic single parameter with four possible values, such as --warmup-mode. I don't have a strong opinion here; both ways make sense to me. Happy to implement it either way.

Actually, I suggest to not use --warm-up-pack-id-input and --warm-up-input-type at all, but only --warm-up-batch and decide on the variables given in the command. Some examples:

  • warm-up-command = "echo %id" would span [BATCH_SIZE] echo commands with each a single id
  • warm-up-command = "echo %ids" would run a single echo command with [BATCH_SIZE] ids given as args
  • warm-up-command = "echo %path" would span [BATCH_SIZE] echo commands with each a single path
  • warm-up-command = "echo %paths" would run a single echo command with [BATCH_SIZE] ids given as args

There is some validation needed so that

  • warm-up-command = "echo %id %path" would span [BATCH_SIZE] echo commands with each a single id and path
  • warm-up-command = "echo %ids %paths" would run a single echo command with [BATCH_SIZE] ids and [BATCH_SIZE] paths given as args (no idea if this makes sense...), but
  • warm-up-command = "echo %ids %path" would error out - not clear what to do here!

@philipmw
Copy link
Author

  • warm-up-command = "echo %id %path" would span [BATCH_SIZE] echo commands with each a single id and path
  • warm-up-command = "echo %ids %paths" would run a single echo command with [BATCH_SIZE] ids and [BATCH_SIZE] paths given as args (no idea if this makes sense...)

The variable "%id" has made sense because it works so much like shell substitution, which CLI users are already familiar with. Pack IDs don't have spaces, so it works well. But once we're adding "%ids" and the "%path" / "%paths", it no longer resembles shell substitution. Now one percent-string is a variable, while another is actually just a directive, not a variable.

Is your motivation to keep the number of command-line parameters small / keep commands short?

Perhaps as an alternative, we could do this: if %id is provided, then we infer "variable" mode. If %id is not provided, then we infer "args" mode.

To eliminate the type parameter, we could consider breaking backward compatibility and making "path" the default. What was the original motivation to make it just the pack ID? Do you think anyone would mind if we change the default? If we do, we could eliminate that parameter.

@aawsome
Copy link
Member

aawsome commented Jan 23, 2026

The variable "%id" has made sense because it works so much like shell substitution, which CLI users are already familiar with. Pack IDs don't have spaces, so it works well. But once we're adding "%ids" and the "%path" / "%paths", it no longer resembles shell substitution. Now one percent-string is a variable, while another is actually just a directive, not a variable.

From a users perspective, I think that having "%ids" being replaced by space separated ids "1234.. 3423... 23423.." is what they'd expect for variable substitution.
Yes, we transform this into multiple argvs for the command called, but this is also what a shell would do if the ids where not surrounded by "".

Is your motivation to keep the number of command-line parameters small / keep commands short?

This but also to prevent users from the need to specify lot of things via CLI parameters (we already have tons of it.. ;-) ). In the mount/webdav command we also work with variable substitution and there some combinations would also not make sense, but IMO it is more powerful and explicit to be able to express all in a single argument like
warm-up-command = "my_script.sh %type %ids".

Perhaps as an alternative, we could do this: if %id is provided, then we infer "variable" mode. If %id is not provided, then we infer "args" mode.

I must say I'd like using %ids more as it is not implicit but explicit.

To eliminate the type parameter, we could consider breaking backward compatibility and making "path" the default. What was the original motivation to make it just the pack ID? Do you think anyone would mind if we change the default? If we do, we could eliminate that parameter.

We have the problem of the tree used to store pack ids: On a local dir the path looks like data/1b/1b4234..... However, when accessing this via URL it is often data/1b4234.... So, if we decide to only provide path or type/id, I'd vote for type/id. But IMO having the decision is even better.

@philipmw
Copy link
Author

On rereading your proposal, I realize I misunderstood what you were suggesting. Now I understand and have no concerns beyond what you already outlined with edge cases like %ids %path.

Let me take a crack at implementing it over this weekend.

philipmw added a commit to philipmw/rustic_core that referenced this pull request Jan 26, 2026
This implements the idea in rustic-rs/rustic#1430
and the subsequent feedback in rustic-rs#438

To use this feature, I wrote a proof-of-concept *warmup-s3-archives* program:
https://gitlab.com/philipmw/warmup-s3-archives

Changes:

* add `--warm-up-batch <N>` parameter
* add variables to `--warm-up-command` parameter to support singular and plural IDs and paths
* add a function to the backend interface that provides the S3 key (or other
    key usable by the warmup command) instead of pack ID.

Tested:

* unit tests;
* restoring data from Glacier Deep Archive spanning 3 packs, specifying a
    batch size of >=3 and argv mode
* killing and restarting restore; this works as long as the warmup program
    is idempotent (which works for S3)
* having the warmup command exit with an error code; rustic aborts the restore
    and prints a correct error message
* running multiple restore operations for different packs in parallel. The
    warmup program ignores notifications for packs that it does not recognize,
    leaving them in the queue and letting another warmup instance process them.

Known limitations and my thoughts for improvement opportunities:

* setup is non-trivial, between the AWS infrastructure and the warmup program configuration.
    It requires some AWS experience and cold storage motivation. But IMO all the
    complexity is specific to the domain; none is incidental.
* rustic does not pass the backend credentials to the warmup program. The warmup
    program is responsible for finding credentials on its own.
    Probably the best solution, at least for AWS, is for both rustic and the
    warmup program to use a common AWS credential provider.
* rustic's progress bar does not reflect warmup progress within a batch; only
    progress of entire batches. There is no protocol for communicating progress
    from a single invocation of the warmup command.
* rustic's warmup parameters could use a refactor as we discover and clarify
    cold storage backup scenarios.
    The distinction between `--warm-up-command` and `--warm-up-wait-command`
    seems too subtle. `--warm-up-wait` is too inflexible (since cold storage
    backends' estimates are measured in hours) and can be avoided entirely.
philipmw added a commit to philipmw/rustic_core that referenced this pull request Jan 26, 2026
This implements the idea in rustic-rs/rustic#1430
and the subsequent feedback in rustic-rs#438

To use this feature, I wrote a proof-of-concept *warmup-s3-archives* program:
https://gitlab.com/philipmw/warmup-s3-archives

Changes:

* add `--warm-up-batch <N>` parameter
* add variables to `--warm-up-command` parameter to support singular and plural IDs and paths
* add a function to the backend interface that provides the S3 key (or other
    key usable by the warmup command) instead of pack ID.

Tested:

* unit tests;
* restoring data from Glacier Deep Archive spanning 3 packs, specifying a
    batch size of >=3 and argv mode
* killing and restarting restore; this works as long as the warmup program
    is idempotent (which works for S3)
* having the warmup command exit with an error code; rustic aborts the restore
    and prints a correct error message
* running multiple restore operations for different packs in parallel. The
    warmup program ignores notifications for packs that it does not recognize,
    leaving them in the queue and letting another warmup instance process them.

Known limitations and my thoughts for improvement opportunities:

* setup is non-trivial, between the AWS infrastructure and the warmup program configuration.
    It requires some AWS experience and cold storage motivation. But IMO all the
    complexity is specific to the domain; none is incidental.
* rustic does not pass the backend credentials to the warmup program. The warmup
    program is responsible for finding credentials on its own.
    Probably the best solution, at least for AWS, is for both rustic and the
    warmup program to use a common AWS credential provider.
* rustic's progress bar does not reflect warmup progress within a batch; only
    progress of entire batches. There is no protocol for communicating progress
    from a single invocation of the warmup command.
* rustic's warmup parameters could use a refactor as we discover and clarify
    cold storage backup scenarios.
    The distinction between `--warm-up-command` and `--warm-up-wait-command`
    seems too subtle. `--warm-up-wait` is too inflexible (since cold storage
    backends' estimates are measured in hours) and can be avoided entirely.
philipmw added a commit to philipmw/rustic_core that referenced this pull request Jan 26, 2026
This implements the idea in rustic-rs/rustic#1430
and the subsequent feedback in rustic-rs#438

To use this feature, I wrote a proof-of-concept *warmup-s3-archives* program:
https://gitlab.com/philipmw/warmup-s3-archives

Changes:

* add `--warm-up-batch <N>` parameter
* add variables to `--warm-up-command` parameter to support singular and plural IDs and paths
* add a function to the backend interface that provides the S3 key (or other
    key usable by the warmup command) instead of pack ID.

Tested:

* unit tests;
* restoring data from Glacier Deep Archive spanning 3 packs, specifying a
    batch size of >=3 and argv mode
* killing and restarting restore; this works as long as the warmup program
    is idempotent (which works for S3)
* having the warmup command exit with an error code; rustic aborts the restore
    and prints a correct error message
* running multiple restore operations for different packs in parallel. The
    warmup program ignores notifications for packs that it does not recognize,
    leaving them in the queue and letting another warmup instance process them.

Known limitations and my thoughts for improvement opportunities:

* setup is non-trivial, between the AWS infrastructure and the warmup program configuration.
    It requires some AWS experience and cold storage motivation. But IMO all the
    complexity is specific to the domain; none is incidental.
* rustic does not pass the backend credentials to the warmup program. The warmup
    program is responsible for finding credentials on its own.
    Probably the best solution, at least for AWS, is for both rustic and the
    warmup program to use a common AWS credential provider.
* rustic's progress bar does not reflect warmup progress within a batch; only
    progress of entire batches. There is no protocol for communicating progress
    from a single invocation of the warmup command.
* rustic's warmup parameters could use a refactor as we discover and clarify
    cold storage backup scenarios.
    The distinction between `--warm-up-command` and `--warm-up-wait-command`
    seems too subtle. `--warm-up-wait` is too inflexible (since cold storage
    backends' estimates are measured in hours) and can be avoided entirely.
@philipmw
Copy link
Author

@aawsome , it is ready for your review. I implemented all the suggestions.

Copy link
Member

@aawsome aawsome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great already! I found some things I think we can simplify the code at some places without changing functionality, so don't be shocked by the number of comments ;-)

This implements the idea in rustic-rs/rustic#1430
and the subsequent feedback in rustic-rs#438

To use this feature, I wrote a proof-of-concept *warmup-s3-archives* program:
https://gitlab.com/philipmw/warmup-s3-archives

Changes:

* add `--warm-up-batch <N>` parameter
* add variables to `--warm-up-command` parameter to support singular and plural IDs and paths
* add a function to the backend interface that provides the S3 key (or other
    key usable by the warmup command) instead of pack ID.

Tested:

* unit tests;
* invoking the warmup program with "%id", "%ids", "%pack", "%packs", and batch
    size of 2 for a total restore size of 3 packs, verifying that the warmup
    command is invoked either separately per ID/pack or with two IDs/packs for
    the first invocation and with one ID/pack for the second invocation.
* killing and restarting restore; this works as long as the warmup program
    is idempotent (which works for S3)
* having the warmup command exit with an error code; rustic aborts the restore
    and prints a correct error message
* running multiple restore operations for different packs in parallel. The
    warmup program ignores notifications for packs that it does not recognize,
    leaving them in the queue and letting another warmup instance process them.

Known limitations and my thoughts for improvement opportunities:

* setup is non-trivial, between the AWS infrastructure and the warmup program configuration.
    It requires some AWS experience and cold storage motivation. But IMO all the
    complexity is specific to the domain; none is incidental.
* rustic does not pass the backend credentials to the warmup program. The warmup
    program is responsible for finding credentials on its own.
    Probably the best solution, at least for AWS, is for both rustic and the
    warmup program to use a common AWS credential provider.
* rustic's progress bar does not reflect warmup progress within a batch; only
    progress of entire batches. There is no protocol for communicating progress
    from a single invocation of the warmup command.
* rustic's warmup parameters could use a refactor as we discover and clarify
    cold storage backup scenarios.
    The distinction between `--warm-up-command` and `--warm-up-wait-command`
    seems too subtle. `--warm-up-wait` is too inflexible (since cold storage
    backends' estimates are measured in hours) and can be avoided entirely.
@philipmw
Copy link
Author

philipmw commented Feb 2, 2026

@aawsome , I implemented all the changes.

I also noticed that the CI tests are failing for macOS. For one of them, I made a fix, but the other one, I am not certain. It may be already fixed with the latest changes, so let's see how the latest CI build runs.

It seems that the CI won't run automatically til you approve it; is there any way to make it run automatically so that I get faster feedback from it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants