The organization of local storage object descriptors needs to be updated. Currently, local storage is simply lists of object descriptors, with one list per version mod max_versions. This doesn't scale in various scenarios that are becoming increasingly common.
This organization needs to be able to support several operations efficiently:
- Insert puts
- Find a given object descriptor (for garbage collection, etc.)
- Find objects that intersect with a given object descriptor in order to satisfy gets
Currently, 1 is O(1) time (insert into linked list) ; 2 and 3 are O(n) where n is the number of previously written objects.
I propose the following hierarchical organization:
- Per variable (implies a dimensionality via global dimensions)
- Hash on lb
- Hash on version (mod max_versions)
- Linked-list of objects, sorted in descending version order
Lookups for 1-3 can be done in constant time (1 could be done with by its own hash, but it would be better for there to be a map of name -> array index that is shared by piggybacking onto put/get responses).
This keeps puts constant time. Finding a given object descriptor is constant time, except in the case where this is no user-set max_versions, and so the linked list in tier 5 can grow arbitrarily large. Finding a list of objects that intersect a given object is still O(n), but the search set is potentially reduced by filtering on variable).
We should try to get overlap searches down to constant time (although the results will not be constant size).
Ideally, there would be a list of overlapping objects maintained per object descriptor, so we should try and figure out how to do this without harming put performance.
The organization of local storage object descriptors needs to be updated. Currently, local storage is simply lists of object descriptors, with one list per version mod max_versions. This doesn't scale in various scenarios that are becoming increasingly common.
This organization needs to be able to support several operations efficiently:
Currently, 1 is O(1) time (insert into linked list) ; 2 and 3 are O(n) where n is the number of previously written objects.
I propose the following hierarchical organization:
Lookups for 1-3 can be done in constant time (1 could be done with by its own hash, but it would be better for there to be a map of name -> array index that is shared by piggybacking onto put/get responses).
This keeps puts constant time. Finding a given object descriptor is constant time, except in the case where this is no user-set max_versions, and so the linked list in tier 5 can grow arbitrarily large. Finding a list of objects that intersect a given object is still O(n), but the search set is potentially reduced by filtering on variable).
We should try to get overlap searches down to constant time (although the results will not be constant size).
Ideally, there would be a list of overlapping objects maintained per object descriptor, so we should try and figure out how to do this without harming put performance.