Skip to content

feat: add v3 typed float storage and schema locking#25

Open
Ramlaoui wants to merge 4 commits intomainfrom
codex/feat-v3-typed-floats
Open

feat: add v3 typed float storage and schema locking#25
Ramlaoui wants to merge 4 commits intomainfrom
codex/feat-v3-typed-floats

Conversation

@Ramlaoui
Copy link
Copy Markdown
Collaborator

@Ramlaoui Ramlaoui commented May 9, 2026

Summary

This PR introduces record_format = 3 for typed float storage in Atompack while keeping each database schema homogeneous.

The main goal is to preserve float32 or float64 for builtin float-bearing fields, support float64 positions cleanly, and reject dtype drift once a database schema has been established.

The PR also fixes two regressions found during review:

  • legacy FILE_FORMAT_VERSION = 2 databases written by origin/main remain readable
  • Database.add_arrays_batch() can no longer bypass v2 builtin dtype restrictions through the raw SOA path

It also includes cleanup passes that remove duplicated SOA/schema code on the Rust side and split the Python SOA parser/view logic into its own module.

Schema Lock

SchemaLock is the per-database schema contract used during append validation.

It tracks:

  • the database-wide positions dtype for v3 (f32 or f64)
  • the dtype and shape contract of each builtin/custom section that has appeared so far

Operationally:

  • an empty database starts unlocked
  • the first write locks positions dtype and any sections present in that record
  • later writes may omit optional fields
  • later writes may introduce new optional fields; those fields are added to the lock
  • later writes may not change a locked dtype or section shape contract; if they do, append fails

This preserves typed float storage while still keeping one homogeneous schema per database.

What This Achieves

  • Adds record_format = 3 for typed-float SOA records.
  • Preserves float32 or float64 storage for:
    • positions
    • energy
    • forces
    • charges
    • velocities
    • cell
    • stress
  • Keeps atomic_numbers fixed as u8.
  • Keeps pbc fixed as bool[3].
  • Locks database schema on first write per field and rejects later dtype/shape mismatches.
  • Keeps new code able to read existing v2 files while allowing new files to opt into v3.

Format / Compatibility

New on-disk behavior

record_format = 3 stores positions dtype once per database in persisted schema metadata.

  • v3 record prefixes stay compact:
    • [n_atoms:u32][positions][atomic_numbers][n_sections:u16][sections...]
  • the rest of the builtin/custom dtype information continues to live in each section's existing type_tag
  • there is no per-record positions_type byte

This is the intended split:

  • positions needs database-level schema because it lives in the raw record prefix and determines whether the reader advances by n * 12 or n * 24 bytes
  • forces, charges, velocities, cell, stress, energy, and custom properties are already self-describing through section tags

Compatibility guarantees

  • New code reads both record_format = 2 and record_format = 3.
  • Existing FILE_FORMAT_VERSION = 2 files written by origin/main remain readable.
  • Old atompack versions are not expected to read v3 files.
  • This PR does not support mixed builtin schemas within a single database.

Header compatibility fix

The initial implementation repurposed legacy header bytes for schema metadata without bumping FILE_FORMAT_VERSION.

That was incorrect for a version-2 compatibility claim. This PR fixes that by:

  • restoring the legacy v2 header field offsets for index_offset, index_len, num_molecules, compression, and record_format
  • storing schema_offset / schema_len in previously unused header bytes instead

That keeps schema metadata in the header-managed file state without invalidating old v2 files.

V2 raw-write compatibility fix

The initial implementation enforced v2 builtin dtype restrictions in the normal serializer path, but the Python batch raw-record path could still construct invalid v2 records.

This PR fixes that in the storage core by validating builtin section type tags during raw SOA schema parsing as well. That means both Python write paths now enforce the same v2 rules.

Main Changes

Rust core and storage

  • Adds persisted schema metadata to the header-managed file state.
  • Stores database schema blobs alongside the trailing index on flush.
  • Loads schema metadata on open and uses it as the authoritative v3 positions dtype.
  • Restores the legacy v2 header layout while keeping schema metadata in unused header bytes.
  • Extends builtin float handling for f32/f64 variants.
  • Validates append-time schema compatibility by dtype and shape contract.
  • Validates raw SOA records against v2 builtin compatibility rules.

SOA / schema refactor in Rust

The initial implementation had duplicated v2/v3 serializers, deserializers, and schema parsing logic.

This PR collapses that into shared implementations and splits schema-specific logic out of the storage SOA module:

  • atompack/src/storage/soa.rs now owns SOA encoding/decoding and typed payload helpers
  • atompack/src/storage/schema.rs owns schema blobs, per-record schema extraction, and schema merging

This keeps the actual v2/v3 difference focused on how positions dtype is resolved and reduces drift risk between code paths.

Python bindings

  • Molecule(...) and Molecule.from_arrays(...) preserve float32 / float64 for builtin float arrays where supported.
  • Database.add_arrays_batch(...) passes typed SOA data through while relying on the core storage layer for v2 compatibility enforcement.
  • get_molecule() and get_molecules_flat() use the database-level positions dtype for v3 reads.
  • The view/parser layer no longer expects a per-record v3 positions_type byte.

Python SOA parser/view split

The Python-side SOA parser and lazy view logic had started to accumulate in atompack-py/src/lib.rs.

This PR moves that code into:

  • atompack-py/src/soa.rs

with lib.rs reduced to shared constants/helpers, module wiring, and re-exports. This mirrors the Rust-side cleanup and makes the crate root much easier to read.

Refactoring Done In This PR

  • Consolidated Python dtype parsing into shared helpers instead of repeating f32/f64 dispatch in multiple write paths.
  • Restored the batch write path to direct SOA record construction instead of materializing full Vec<Molecule> instances first.
  • Removed the hybrid v3 design:
    • no per-record positions_type
    • database-level schema metadata for positions
    • existing per-section tags for the rest
  • Unified Rust SOA encode/decode/schema logic to reduce drift risk between v2 and v3.
  • Split Rust schema code into schema.rs and Python SOA parser/view code into soa.rs.

What Is Not Changing

  • atomic_numbers stays fixed as u8.
  • pbc stays fixed as boolean triplets.
  • This PR does not add per-record mixed builtin schemas inside one database.
  • This PR does not introduce float16 / bfloat16 storage.
  • This PR does not add an explicit user-declared schema API yet; schema is still inferred and locked on first write.

Validation

Ran:

  • rtk make ci-rust
  • rtk make ci-py

These include formatting, linting, Rust tests, editable Python build, and the Python test suite.

Follow-ups Worth Considering

  • Add an explicit schema declaration option at database creation time instead of relying only on first-write locking.
  • Add higher-level fixture-based compatibility tests for old on-disk files instead of only synthetic header/raw-record regressions.
  • Consider whether schema blobs should eventually expose more explicit user-facing metadata than the current internal lock representation.

@Ramlaoui Ramlaoui force-pushed the codex/feat-v3-typed-floats branch from 52f8c2a to c3aa0e9 Compare May 9, 2026 21:35
@Ramlaoui Ramlaoui marked this pull request as ready for review May 9, 2026 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant