feat: add v3 typed float storage and schema locking#25
Open
feat: add v3 typed float storage and schema locking#25
Conversation
52f8c2a to
c3aa0e9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces
record_format = 3for typed float storage in Atompack while keeping each database schema homogeneous.The main goal is to preserve
float32orfloat64for builtin float-bearing fields, supportfloat64positions cleanly, and reject dtype drift once a database schema has been established.The PR also fixes two regressions found during review:
FILE_FORMAT_VERSION = 2databases written byorigin/mainremain readableDatabase.add_arrays_batch()can no longer bypass v2 builtin dtype restrictions through the raw SOA pathIt also includes cleanup passes that remove duplicated SOA/schema code on the Rust side and split the Python SOA parser/view logic into its own module.
Schema Lock
SchemaLockis the per-database schema contract used during append validation.It tracks:
positionsdtype for v3 (f32orf64)Operationally:
positionsdtype and any sections present in that recordThis preserves typed float storage while still keeping one homogeneous schema per database.
What This Achieves
record_format = 3for typed-float SOA records.float32orfloat64storage for:positionsenergyforceschargesvelocitiescellstressatomic_numbersfixed asu8.pbcfixed asbool[3].Format / Compatibility
New on-disk behavior
record_format = 3storespositionsdtype once per database in persisted schema metadata.[n_atoms:u32][positions][atomic_numbers][n_sections:u16][sections...]type_tagpositions_typebyteThis is the intended split:
positionsneeds database-level schema because it lives in the raw record prefix and determines whether the reader advances byn * 12orn * 24bytesforces,charges,velocities,cell,stress,energy, and custom properties are already self-describing through section tagsCompatibility guarantees
record_format = 2andrecord_format = 3.FILE_FORMAT_VERSION = 2files written byorigin/mainremain readable.Header compatibility fix
The initial implementation repurposed legacy header bytes for schema metadata without bumping
FILE_FORMAT_VERSION.That was incorrect for a version-2 compatibility claim. This PR fixes that by:
index_offset,index_len,num_molecules,compression, andrecord_formatschema_offset/schema_lenin previously unused header bytes insteadThat keeps schema metadata in the header-managed file state without invalidating old v2 files.
V2 raw-write compatibility fix
The initial implementation enforced v2 builtin dtype restrictions in the normal serializer path, but the Python batch raw-record path could still construct invalid v2 records.
This PR fixes that in the storage core by validating builtin section type tags during raw SOA schema parsing as well. That means both Python write paths now enforce the same v2 rules.
Main Changes
Rust core and storage
positionsdtype.f32/f64variants.SOA / schema refactor in Rust
The initial implementation had duplicated v2/v3 serializers, deserializers, and schema parsing logic.
This PR collapses that into shared implementations and splits schema-specific logic out of the storage SOA module:
atompack/src/storage/soa.rsnow owns SOA encoding/decoding and typed payload helpersatompack/src/storage/schema.rsowns schema blobs, per-record schema extraction, and schema mergingThis keeps the actual v2/v3 difference focused on how
positionsdtype is resolved and reduces drift risk between code paths.Python bindings
Molecule(...)andMolecule.from_arrays(...)preservefloat32/float64for builtin float arrays where supported.Database.add_arrays_batch(...)passes typed SOA data through while relying on the core storage layer for v2 compatibility enforcement.get_molecule()andget_molecules_flat()use the database-levelpositionsdtype for v3 reads.positions_typebyte.Python SOA parser/view split
The Python-side SOA parser and lazy view logic had started to accumulate in
atompack-py/src/lib.rs.This PR moves that code into:
atompack-py/src/soa.rswith
lib.rsreduced to shared constants/helpers, module wiring, and re-exports. This mirrors the Rust-side cleanup and makes the crate root much easier to read.Refactoring Done In This PR
f32/f64dispatch in multiple write paths.Vec<Molecule>instances first.positions_typepositionsschema.rsand Python SOA parser/view code intosoa.rs.What Is Not Changing
atomic_numbersstays fixed asu8.pbcstays fixed as boolean triplets.float16/bfloat16storage.Validation
Ran:
rtk make ci-rustrtk make ci-pyThese include formatting, linting, Rust tests, editable Python build, and the Python test suite.
Follow-ups Worth Considering