Skip to content

Proposal: Binary Store Building Block#88

Open
WhitWaldo wants to merge 9 commits into
dapr:mainfrom
WhitWaldo:filestore
Open

Proposal: Binary Store Building Block#88
WhitWaldo wants to merge 9 commits into
dapr:mainfrom
WhitWaldo:filestore

Conversation

@WhitWaldo
Copy link
Copy Markdown
Contributor

Increasingly, while writing applications that use Dapr, I keep running into the need to persist data that's too large to reasonably store using Dapr often because it's too large and will exhaust the memory resources of the sidecar, though frequently because it's likely too large to store in a key/value store.

It doesn't make a ton of sense to rely exclusively on bindings for this when that really just provides a Dapr-hosted alternative to the provider's SDK for something that we should increasingly have broad provider support for. Object and blob stores are really overloaded terms representing all manner of things depending on provider for which I think there's a fine opportunity to tackle in the future - this proposal isn't that.

Here, I propose an API devoid of List and even Metadata operations so it can accommodate the broadest of possible storage providers and instead suggest that we increasingly lean on the SDKs to provide the state management instead of putting all that weight on the runtime and the components. It's a slim implementation that should be pretty easily added, but which would provide immediate benefits for popular Dapr features: Workflows and the new Agentic operations come to mind, but it would be beneficial for Actor and Cryptographic operations as well.

I look forward to your feedback!

Signed-off-by: Whit Waldo <whit.waldo@innovian.net>
Signed-off-by: Whit Waldo <whit.waldo@innovian.net>
@WhitWaldo WhitWaldo self-assigned this Aug 23, 2025
@WhitWaldo WhitWaldo added the enhancement New feature or request label Aug 23, 2025
Signed-off-by: Whit Waldo <whit.waldo@innovian.net>
Signed-off-by: Whit Waldo <whit.waldo@innovian.net>
Signed-off-by: Whit Waldo <whit.waldo@innovian.net>
Signed-off-by: Whit Waldo <whit.waldo@innovian.net>
Signed-off-by: Whit Waldo <whit.waldo@innovian.net>
…a few details, removed an extraneous bullet and generally cleaned it up some

Signed-off-by: Whit Waldo <whit.waldo@innovian.net>
@WhitWaldo WhitWaldo changed the title Proposal: File Store Building Block Proposal: Binary Store Building Block Aug 24, 2025
Signed-off-by: Whit Waldo <whit.waldo@innovian.net>
@olitomlinson
Copy link
Copy Markdown

olitomlinson commented Sep 1, 2025

I'm massively in support, but how does this differ from the Object Store proposal? (Other than no support for metadata, anything else?)

@WhitWaldo
Copy link
Copy Markdown
Contributor Author

I'm massively in support, but how does this differ from the Object Store proposal? (Other than no support for metadata, anything else?)

There are a few differences:

  1. This proposal does not anticipate ever supporting a list operation so as to be more readily and broadly supported by those providers without such capability. Most specific object and blob stores that come to mind do offer such a feature. Leaving this feature as a possible differentiator for a future object/blob store API, although this would limit it to a smaller set of matching providers, is a fine trade-off to simply do without altogether here. This is looking to be little more than a provider to store large files in a way the current state store cannot and without all the other current state management add-ons.
  2. This does not purport to offer those behaviors that might be more specific to object and blob stores to perform operations on data through signed URLs. Again, that might be a fine feature to use in a future state store that's more narrowly tailed to that sort of operation. This isn't that.
  3. As you indicated, object and blob stores often persist and maintain a lot of metadata. In my experience, blob stores mostly just store it, but object stores will often act on it (e.g. checksum validation). No need to deal with any of that here, including several of the points brought up in your linked discussion (e.g. Content-Length, Content-Hash, ETag, and other metadata being used for other extraneous purposes).
  4. We talked about my goal here to avoid having the SDKs deal with serialization here. An object or blob store often handles unstructured data in some format or another and I think we should absolutely create more specialized data stores that support operations more suited to one type or another (certainly could be useful from an agentic tooling and pluggable component perspective), but here, in the name of simplification and starting with a low threshold, I would like to put the responsibility on the developer for ensuring that their data can be serialized and encoded and have the API exclusively persist, retrieve and delete that data with no room for any other possibilities.
  5. Object and blob store often support hierarchical or operational permissions structures such as append-only writes, write-only permissions (e.g. no deletion via API), etc. That's also intentionally excluded from consideration here.

Put more simply - those other stores anticipate the developer wanting to do both simple and far more advanced operations with their data. I'd certainly like to build more specialized data stores to accommodate such requirements, but this proposal seeks to do away with any complexities and do one thing really well: manage the reading, writing and deletion of large files in a resource-limiting and highly performant manner which is not possible in today's Dapr state management.

@olitomlinson
Copy link
Copy Markdown

Adding for how we might use this in Workflows for storing large activity inputs / outputs

image

@lindner
Copy link
Copy Markdown

lindner commented Feb 16, 2026

Seems like the existing s3 and other existing bindings could be mapped. Have you tried a PoC?

@WhitWaldo
Copy link
Copy Markdown
Contributor Author

Seems like the existing s3 and other existing bindings could be mapped. Have you tried a PoC?

@lindner The first step is proposing the shape of the block (as I've done here) and soliciting public feedback on the API shape and try to discern if anything else seems necessary within the described purpose of the API.

Out of the box, I'd certainly like to target support for Azure Blob Storage and provide an S3-compatible component (as this would facilitate connectivity with S3 itself, but also the many providers that offer S3-compatible APIs).

Next steps are getting tentative maintainer sign-off (no point building a POC if it's not going to be accepted) and then starting development of it - as I indicated in Discord, I intend to build this out as part of the next Dapr release (1.18).

@olitomlinson
Copy link
Copy Markdown

olitomlinson commented Apr 8, 2026

In the context of its usage for storing large activity inputs and outputs in Workflows, I would strongly recommend that this design allows a workflow author to programmatically choose the path/directory to the binary file.

This is to support multi-tenant use-cases where each tenants data MUST be stored in different locations.

/store/tenant-a/

/store/tenant-b/

/store/tenant-c/

Having this location set at the time of scheduling the workflow (not registering the workflow) gives a good level of flexibility.


builder.Services.AddDaprWorkflow(options =>
    {
         options.RegisterWorkflow<MyWorkflow>( BinaryStoreName = "my-binary-store");
    }
    
    ...
    var tenantId = "tenant-a";
    var workflowId = "2c0882d7";
    
    await workflowClient.ScheduleNewWorkflowAsync(
            name: nameof(MyWorkflow),
            instanceId: workflowId,
            input: orderInfo,
            InputOutputBinaryStorePath: $"/store/{tenantId}/wf/{workflowId}"
            );

In the example above, assuming we're using an S3 Binary Store, the Activity input / output blobs would be stored in the following location

/store/tenant-a/wf/2c0882d7/activity/{activity-id}/output/
/store/tenant-a/wf/2c0882d7/activity/{activity-id}/input/

There is an assumption that workflows have an implicit Activity Id which uniquely identifies each activity call. We use that Activity Id, in the path above.


Building on the above example, the Reference to the blob becomes {app-id}||{binary-store-name}||{location}||{file-id}

myApp||my-binary-store||/store/tenant-a/wf/2c0882d7/activity/123/input/xyz

The Reference is what is encoded in the Workflow History, rather than the blob contents.

The SDK can then dereference the data whenever the user demands it throughout the workflow. It may even be the case that the data is never dereferenced, until end of the Workflow when someone requests the output of the completed workflow, which maybe one (or more) large blobs!

@WhitWaldo
Copy link
Copy Markdown
Contributor Author

In the context of its usage for storing large activity inputs and outputs in Workflows, I would strongly recommend that this design allows a workflow author to programmatically choose the path/directory to the binary file.

This is to support multi-tenant use-cases where each tenants data MUST be stored in different locations.

/store/tenant-a/

/store/tenant-b/

/store/tenant-c/

Having this location set at the time of scheduling the workflow (not registering the workflow) gives a good level of flexibility.

builder.Services.AddDaprWorkflow(options =>
    {
         options.RegisterWorkflow<MyWorkflow>( BinaryStoreName = "my-binary-store");
    }
    
    ...
    var tenantId = "tenant-a";
    var workflowId = "2c0882d7";
    
    await workflowClient.ScheduleNewWorkflowAsync(
            name: nameof(MyWorkflow),
            instanceId: workflowId,
            input: orderInfo,
            InputOutputBinaryStorePath: $"/store/{tenantId}/wf/{workflowId}"
            );

In the example above, assuming we're using an S3 Binary Store, the Activity input / output blobs would be stored in the following location

/store/tenant-a/wf/2c0882d7/activity/{activity-id}/output/ /store/tenant-a/wf/2c0882d7/activity/{activity-id}/input/

There is an assumption that workflows have an implicit Activity Id which uniquely identifies each activity call. We use that Activity Id, in the path above.

Building on the above example, the Reference to the blob becomes {app-id}||{binary-store-name}||{location}||{file-id}

myApp||my-binary-store||/store/tenant-a/wf/2c0882d7/activity/123/input/xyz

The Reference is what is encoded in the Workflow History, rather than the blob contents.

The SDK can then dereference the data whenever the user demands it throughout the workflow. It may even be the case that the data is never dereferenced, until end of the Workflow when someone requests the output of the completed workflow, which maybe one (or more) large blobs!

Might this instead be done more like how actors currently stores state in KVs? Set a path on the component at registration time that's used as the root and defer to the workflow to pick an appropriate path to save the reference to relative to the registration path? Presumably the runtime would pick a path referencing the workflow ID and any namespace values itself and then the user needn't figure out how to specify their own paths?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants