Summary
Implement Time-To-Live (TTL) support for state stores to automatically expire and clean up stale state entries, reducing memory usage and maintaining data freshness.
Problem Statement
Currently, Cortex.States stores data indefinitely:
- Memory growth: State stores accumulate data forever, leading to OOM errors
- Stale data: Old aggregations remain even when no longer relevant
- Manual cleanup required: Developers must implement their own expiration logic
- Window state accumulation: Closed windows remain in state stores
- Session state never expires: Inactive sessions stay in memory
Current Behavior
// State grows unbounded
var stateStore = new InMemoryStateStore<string, decimal>("customer-totals");
// Over time, state accumulates for ALL customers ever seen
stream
.Aggregate(
keySelector: o => o.CustomerId,
aggregateFunction: (total, order) => total + order.Amount,
stateStore: stateStore) // Never cleaned up!
.Build();
// After 1 year: millions of customer entries, most inactive
// Memory usage: constantly growing
// No automatic cleanup mechanism
Impact
Without TTL:
- Production systems eventually run out of memory
- State stores contain irrelevant historical data
- Cannot implement "last N minutes" type aggregations efficiently
- Window operators need manual state cleanup
- Compliance issues (GDPR right to be forgotten harder to implement)
Technical Considerations
-
Clock Skew: Use monotonic clocks where possible; document behavior with system time changes.
-
Cleanup Performance: Background cleanup should not block reads/writes.
-
Checkpoint Integration: TTL metadata should be included in checkpoints.
-
Distributed TTL: In distributed mode, TTL should be consistent across workers.
-
Large Value Cleanup: For large values, consider async cleanup to avoid blocking.
References
Summary
Implement Time-To-Live (TTL) support for state stores to automatically expire and clean up stale state entries, reducing memory usage and maintaining data freshness.
Problem Statement
Currently, Cortex.States stores data indefinitely:
Current Behavior
Impact
Without TTL:
Technical Considerations
Clock Skew: Use monotonic clocks where possible; document behavior with system time changes.
Cleanup Performance: Background cleanup should not block reads/writes.
Checkpoint Integration: TTL metadata should be included in checkpoints.
Distributed TTL: In distributed mode, TTL should be consistent across workers.
Large Value Cleanup: For large values, consider async cleanup to avoid blocking.
References