feat: add v3 typed float storage and schema locking by Ramlaoui · Pull Request #25 · LeMaterial/atompack

Ramlaoui · 2026-05-09T19:59:54Z

Summary

This PR introduces record_format = 3 for typed float storage in Atompack while keeping each database schema homogeneous.

The main goal is to preserve float32 or float64 for builtin float-bearing fields, support float64 positions cleanly, and reject dtype drift once a database schema has been established.

The PR also fixes two regressions found during review:

legacy FILE_FORMAT_VERSION = 2 databases written by origin/main remain readable
Database.add_arrays_batch() can no longer bypass v2 builtin dtype restrictions through the raw SOA path

It also includes cleanup passes that remove duplicated SOA/schema code on the Rust side and split the Python SOA parser/view logic into its own module.

Schema Lock

SchemaLock is the per-database schema contract used during append validation.

It tracks:

the database-wide positions dtype for v3 (f32 or f64)
the dtype and shape contract of each builtin/custom section that has appeared so far

Operationally:

an empty database starts unlocked
the first write locks positions dtype and any sections present in that record
later writes may omit optional fields
later writes may introduce new optional fields; those fields are added to the lock
later writes may not change a locked dtype or section shape contract; if they do, append fails

This preserves typed float storage while still keeping one homogeneous schema per database.

What This Achieves

Adds record_format = 3 for typed-float SOA records.
Preserves float32 or float64 storage for:
- positions
- energy
- forces
- charges
- velocities
- cell
- stress
Keeps atomic_numbers fixed as u8.
Keeps pbc fixed as bool[3].
Locks database schema on first write per field and rejects later dtype/shape mismatches.
Keeps new code able to read existing v2 files while allowing new files to opt into v3.

Format / Compatibility

New on-disk behavior

record_format = 3 stores positions dtype once per database in persisted schema metadata.

v3 record prefixes stay compact:
- [n_atoms:u32][positions][atomic_numbers][n_sections:u16][sections...]
the rest of the builtin/custom dtype information continues to live in each section's existing type_tag
there is no per-record positions_type byte

This is the intended split:

positions needs database-level schema because it lives in the raw record prefix and determines whether the reader advances by n * 12 or n * 24 bytes
forces, charges, velocities, cell, stress, energy, and custom properties are already self-describing through section tags

Compatibility guarantees

New code reads both record_format = 2 and record_format = 3.
Existing FILE_FORMAT_VERSION = 2 files written by origin/main remain readable.
Old atompack versions are not expected to read v3 files.
This PR does not support mixed builtin schemas within a single database.

Header compatibility fix

The initial implementation repurposed legacy header bytes for schema metadata without bumping FILE_FORMAT_VERSION.

That was incorrect for a version-2 compatibility claim. This PR fixes that by:

restoring the legacy v2 header field offsets for index_offset, index_len, num_molecules, compression, and record_format
storing schema_offset / schema_len in previously unused header bytes instead

That keeps schema metadata in the header-managed file state without invalidating old v2 files.

V2 raw-write compatibility fix

The initial implementation enforced v2 builtin dtype restrictions in the normal serializer path, but the Python batch raw-record path could still construct invalid v2 records.

This PR fixes that in the storage core by validating builtin section type tags during raw SOA schema parsing as well. That means both Python write paths now enforce the same v2 rules.

Main Changes

Rust core and storage

Adds persisted schema metadata to the header-managed file state.
Stores database schema blobs alongside the trailing index on flush.
Loads schema metadata on open and uses it as the authoritative v3 positions dtype.
Restores the legacy v2 header layout while keeping schema metadata in unused header bytes.
Extends builtin float handling for f32/f64 variants.
Validates append-time schema compatibility by dtype and shape contract.
Validates raw SOA records against v2 builtin compatibility rules.

SOA / schema refactor in Rust

The initial implementation had duplicated v2/v3 serializers, deserializers, and schema parsing logic.

This PR collapses that into shared implementations and splits schema-specific logic out of the storage SOA module:

atompack/src/storage/soa.rs now owns SOA encoding/decoding and typed payload helpers
atompack/src/storage/schema.rs owns schema blobs, per-record schema extraction, and schema merging

This keeps the actual v2/v3 difference focused on how positions dtype is resolved and reduces drift risk between code paths.

Python bindings

Molecule(...) and Molecule.from_arrays(...) preserve float32 / float64 for builtin float arrays where supported.
Database.add_arrays_batch(...) passes typed SOA data through while relying on the core storage layer for v2 compatibility enforcement.
get_molecule() and get_molecules_flat() use the database-level positions dtype for v3 reads.
The view/parser layer no longer expects a per-record v3 positions_type byte.

Python SOA parser/view split

The Python-side SOA parser and lazy view logic had started to accumulate in atompack-py/src/lib.rs.

This PR moves that code into:

atompack-py/src/soa.rs

with lib.rs reduced to shared constants/helpers, module wiring, and re-exports. This mirrors the Rust-side cleanup and makes the crate root much easier to read.

Refactoring Done In This PR

Consolidated Python dtype parsing into shared helpers instead of repeating f32/f64 dispatch in multiple write paths.
Restored the batch write path to direct SOA record construction instead of materializing full Vec<Molecule> instances first.
Removed the hybrid v3 design:
- no per-record positions_type
- database-level schema metadata for positions
- existing per-section tags for the rest
Unified Rust SOA encode/decode/schema logic to reduce drift risk between v2 and v3.
Split Rust schema code into schema.rs and Python SOA parser/view code into soa.rs.

What Is Not Changing

atomic_numbers stays fixed as u8.
pbc stays fixed as boolean triplets.
This PR does not add per-record mixed builtin schemas inside one database.
This PR does not introduce float16 / bfloat16 storage.
This PR does not add an explicit user-declared schema API yet; schema is still inferred and locked on first write.

Validation

Ran:

rtk make ci-rust
rtk make ci-py

These include formatting, linting, Rust tests, editable Python build, and the Python test suite.

Follow-ups Worth Considering

Add an explicit schema declaration option at database creation time instead of relying only on first-write locking.
Add higher-level fixture-based compatibility tests for old on-disk files instead of only synthetic header/raw-record regressions.
Consider whether schema blobs should eventually expose more explicit user-facing metadata than the current internal lock representation.

Ramlaoui added 4 commits May 9, 2026 21:58

feat: add v3 typed float storage

4c3f219

fix: restore CI validation expectations

d99a928

refactor: store v3 positions dtype in schema metadata

3b1fdee

fix: restore v2 compatibility and split python soa parsing

c3aa0e9

Ramlaoui force-pushed the codex/feat-v3-typed-floats branch from 52f8c2a to c3aa0e9 Compare May 9, 2026 21:35

Ramlaoui marked this pull request as ready for review May 9, 2026 21:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add v3 typed float storage and schema locking#25

feat: add v3 typed float storage and schema locking#25
Ramlaoui wants to merge 4 commits intomainfrom
codex/feat-v3-typed-floats

Ramlaoui commented May 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Ramlaoui commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Schema Lock

What This Achieves

Format / Compatibility

New on-disk behavior

Compatibility guarantees

Header compatibility fix

V2 raw-write compatibility fix

Main Changes

Rust core and storage

SOA / schema refactor in Rust

Python bindings

Python SOA parser/view split

Refactoring Done In This PR

What Is Not Changing

Validation

Follow-ups Worth Considering

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Ramlaoui commented May 9, 2026 •

edited

Loading