atomisticnet
diff --git a/‎docs/source/usage/torch_datasets.rst‎
Lines changed: 25 additions & 3 deletions b/‎docs/source/usage/torch_datasets.rst‎
Lines changed: 25 additions & 3 deletions
diff --git a/‎docs/source/usage/torch_training.rst‎
Lines changed: 99 additions & 11 deletions b/‎docs/source/usage/torch_training.rst‎
Lines changed: 99 additions & 11 deletions
diff --git a/‎notebooks/example-05-torch-training.ipynb‎
Lines changed: 34 additions & 20 deletions b/‎notebooks/example-05-torch-training.ipynb‎
Lines changed: 34 additions & 20 deletions
@@ -336,7 +336,7 @@ Building the Database
    from aenet.geometry import AtomicStructure
    from glob import glob
 
-   # Define a top-level parser function (required for multiprocessing)
+   # Define the parser used during HDF5 database construction
    def parse_xsf(path: str):
        """Parse XSF file and return torch Structure(s)."""
        atomic_struct = AtomicStructure.from_file(path)
@@ -360,6 +360,7 @@ Building the Database
 
    db.build_database(
        show_progress=True,
+       build_workers=8,                 # optional build-time worker threads
        persist_descriptor=True,         # optional descriptor recovery step
        persist_features=True,           # optional persisted raw features
        persist_force_derivatives=True,  # optional sparse derivative cache
@@ -368,6 +369,14 @@ Building the Database
    # ``db`` is immediately reusable after build_database(); reopening the
    # file is only needed in a later session or when you want a separate handle.
 
+.. note::
+
+   ``build_workers`` only affects the one-time ``build_database()`` call.
+   It parallelizes parser execution and optional persisted-cache preparation
+   with worker threads, while the parent process still performs all ordered
+   HDF5 writes. This is
+   separate from training-time ``num_workers`` on ``TorchTrainingConfig``.
+
 .. note::
 
    ``persist_descriptor=True`` stores a small versioned descriptor manifest
@@ -413,7 +422,8 @@ different purposes:
   for force-labeled entries to ``/torch_cache/force_derivatives``
 * ``cache_features=True`` is a trainer-owned in-memory runtime cache attached
   to the current dataset instance; it speeds up repeated accesses within a run
-  but does not modify the HDF5 file
+  but does not modify the HDF5 file. Its size is controlled by
+  ``cache_feature_max_entries`` on ``TorchTrainingConfig``
 
 Runtime precedence is explicit:
 
@@ -456,7 +466,9 @@ Training from HDF5 Database
        force_fraction=0.3,
        force_sampling="random",
        cache_features=True,
+       cache_feature_max_entries=1024,
        cache_neighbors=True,
+       cache_neighbor_max_entries=512,
        num_workers=8,               # Parallel workers (each opens own handle)
        prefetch_factor=4,
        persistent_workers=True,
@@ -479,10 +491,18 @@ Key HDF5 Features
 * **Multiprocessing-safe**: Each DataLoader worker opens its own read-only handle
 * **Compression**: Built-in HDF5 compression (zlib, blosc) reduces disk usage
 * **LRU caching**: Configurable in-memory cache per worker for frequently accessed entries
-* **Parser requirements**: Must be a top-level function (pickleable) when using ``num_workers > 0``
+* **Build parallelism**: ``build_workers`` accelerates parser execution and
+  optional persisted-cache generation, but ordered HDF5 writes still happen
+  in the parent process
+* **Parser concurrency**: When using ``build_workers > 1``, make sure the
+  parser callable is safe to invoke concurrently over independent file paths
 * **Unified persisted cache**: Optional ``/torch_cache/features`` and
   ``/torch_cache/force_derivatives`` sections can be written once and reused
   lazily across later HDF5-backed runs
+* **Separate trainer cache limits**: ``cache_feature_max_entries``,
+  ``cache_neighbor_max_entries``, and ``cache_force_triplet_max_entries`` bound
+  the trainer-owned runtime caches separately from HDF5
+  ``in_memory_cache_size``
 * **Deterministic handle cleanup**: Call ``dataset.close()`` or use
   ``with HDF5StructureDataset(...) as dataset:``
 
@@ -586,6 +606,8 @@ Set these on ``TorchTrainingConfig``:
   HDF5 features and does not write back to disk.
 * **cache_neighbors**: Reuse neighbor search results for energy-view reuse and legacy non-graph paths
 * **cache_force_triplets**: Cache CSR graphs and triplets instead of rebuilding them for the default sparse force-training path
+* **cache_*_max_entries**: Bound the trainer-owned runtime caches per split and per process/worker
+* **cache_warmup**: Optional single-process cache prefill before epoch 0; skipped automatically when ``num_workers > 0``
 
 For repeated fixed-geometry HDF5 workflows, prefer build-time
 ``persist_features=True`` and ``persist_force_derivatives=True`` when you want
 
@@ -201,6 +201,44 @@ The longer file-backed dataset workflow is intentionally kept in the training
 notebook above so the ``torch_datasets`` page can stay focused on compact
 API-facing examples.
 
+Execution Model
+~~~~~~~~~~~~~~~~
+
+The current trainer has two distinct runtime stages:
+
+1. Sample preparation happens in the main process when ``num_workers=0``, or
+   in ``DataLoader`` workers when ``num_workers > 0``. Structures are
+   converted to tensors on ``descriptor.device``, and descriptor
+   featurization, neighbor reuse, graph/triplet construction, and lazy HDF5
+   cache reads happen there.
+2. The collated batch is then moved onto ``config.device`` inside the
+   training loop. Model forward passes, normalization, loss computation, and
+   optimizer steps run on that device.
+
+In practice, GPU training with ``num_workers > 0`` is best understood as
+worker-side data preparation feeding a training loop on the selected device.
+It is not currently a separate mixed CPU/GPU execution pipeline.
+
+If ``descriptor.device`` and ``config.device`` match, featurization and model
+compute happen on the same device. If they differ, samples are materialized on
+``descriptor.device`` and transferred before the forward pass. The compact
+examples on this page create the descriptor on CPU, so later
+``device='cuda'`` examples describe CPU-side sample preparation feeding GPU
+training unless you also move the descriptor to CUDA.
+
+For HDF5-backed datasets, each worker reopens its own read-only file handle
+and keeps its own bounded ``in_memory_cache_size`` LRU cache. Trainer-owned
+runtime caches (``cache_features``, ``cache_neighbors``,
+``cache_force_triplets``) are also per process/worker, so
+``cache_warmup=True`` is skipped automatically when ``num_workers > 0``. See
+:doc:`torch_datasets` for persisted HDF5 cache precedence and for the
+distinction between build-time ``build_workers`` and training-time
+``num_workers``.
+
+``memory_mode='mixed'`` is reserved for a future real mixed-memory mode and
+currently raises ``NotImplementedError`` if requested. Today, the supported
+execution modes remain ``'cpu'`` and ``'gpu'``.
+
 Performance Optimization Tips
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -245,6 +283,10 @@ Performance Optimization Tips
   and legacy non-graph paths
 * **cache_force_triplets**: Cache CSR graphs and triplets for the default sparse
   force-training path instead of rebuilding them on demand
+* **cache_*_max_entries**: Bound the trainer-owned runtime caches per split
+  and per process/worker instead of letting them grow without limit
+* **cache_warmup**: Optional single-process prefill of trainer-owned runtime
+  caches before epoch 0; skipped automatically when ``num_workers > 0``
 
 These runtime caches are distinct from the on-disk HDF5 persisted cache
 sections created with ``HDF5StructureDataset.build_database(...)``. For HDF5
@@ -339,10 +381,11 @@ Large Dataset (> 500 structures)
        method=Adam(mu=0.001, batchsize=64),  # Larger batches
        testpercent=10,
        force_weight=0.1,
-       device='cuda',  # Use GPU for speedup
+       device='cuda',  # Model/loss on GPU
        # Performance optimizations
        cache_features=True,  # Runtime in-memory feature cache
-       num_workers=8,         # Parallel data loading
+       cache_feature_max_entries=1024,
+       num_workers=8,         # Parallel CPU-side sample preparation
        prefetch_factor=4
    )
 
@@ -356,7 +399,8 @@ Energy-Only with Maximum Speed
        method=Adam(mu=0.001, batchsize=32),
        testpercent=10,
        force_weight=0.0,  # Energy-only
-       cache_features=True,  # Eager/runtime feature cache for this run
+       cache_features=True,  # Bounded runtime feature cache for this run
+       cache_warmup=True,    # Optional single-process prefill
        device='cuda'
    )
 
@@ -371,8 +415,8 @@ Force Training with Optimizations
        testpercent=10,
        force_weight=0.1,
        force_fraction=0.3,  # Use 30% of forces (3× faster)
-       cache_neighbors=True,  # Cache neighbor lists
-       num_workers=4,
+       cache_neighbors=True,  # Cache worker-local neighbor lists
+       num_workers=4,         # Parallel CPU-side sample preparation
        device='cuda'
    )
 
@@ -410,6 +454,13 @@ To resume training from a checkpoint, pass the checkpoint path to
 ``train(..., resume_from="checkpoints/checkpoint_epoch_0050.pt")``. The
 notebook above contains the maintained checkpoint workflow.
 
+When ``resume_from`` is provided, ``config.iterations`` means the number of
+additional epochs to run in that ``train()`` call. For example, resuming a
+checkpoint with ``iterations=10`` runs 10 more epochs after the saved
+checkpoint epoch, regardless of how many epochs were completed in the
+original run. This applies to numbered checkpoints and ``best_model.pt``
+alike.
+
 The trainer will automatically:
 
 * Load model and optimizer state
@@ -512,26 +563,51 @@ Performance & Caching
    * For force training (``force_weight > 0``): Caches features for structures not
      selected for force supervision in current epoch (useful with ``force_fraction < 1.0``)
 
+**cache_feature_max_entries** : int or None (default: 1024)
+   Maximum number of trainer-owned energy-view feature entries to retain per
+   split and per process/worker when ``cache_features=True``. Use ``None`` for
+   an explicit unbounded cache or ``0`` to suppress storage.
+
 **cache_neighbors** : bool (default: False)
    Cache per-structure neighbor graphs (indices, displacement vectors) across
    epochs. Avoids repeated neighbor searches for fixed geometries on
    energy-view reuse and legacy non-graph paths. Supported force training
    does not require this option.
 
+**cache_neighbor_max_entries** : int or None (default: 512)
+   Maximum number of trainer-owned neighbor payload entries to retain per
+   split and per process/worker when ``cache_neighbors=True``. Use ``None`` for
+   an explicit unbounded cache or ``0`` to suppress storage.
+
 **cache_force_triplets** : bool (default: False)
    Cache CSR neighbor graphs and precompute angular triplet indices for the
    default sparse force-training path. Leaving this disabled still uses the
    sparse graph/triplet path, but rebuilds those graph payloads on demand.
 
+**cache_force_triplet_max_entries** : int or None (default: 256)
+   Maximum number of trainer-owned graph/triplet payload entries to retain per
+   split and per process/worker when ``cache_force_triplets=True``. Use
+   ``None`` for an explicit unbounded cache or ``0`` to suppress storage.
+
 **cache_persist_dir** : str (default: None)
    Directory for persisting graph/triplet caches to disk for reuse across runs.
 
 **cache_scope** : str (default: 'all')
    Which dataset splits to cache: ``'train'``, ``'val'``, or ``'all'``.
 
+**cache_warmup** : bool (default: False)
+   If True, pre-populate trainer-owned runtime caches before the first epoch
+   in single-process training. When all enabled caches have finite entry
+   limits, warmup stops once those limits are filled. Warmup is skipped
+   automatically when ``num_workers > 0`` because workers own their own cache
+   instances and the main-process warmup would not populate those worker-local
+   caches.
+
 **num_workers** : int (default: 0)
-   Number of parallel DataLoader workers for on-the-fly featurization.
-   0 = main process only. Values >0 enable parallel data loading.
+   Number of parallel ``DataLoader`` workers for structure loading, HDF5
+   reads, and on-the-fly featurization. ``0`` keeps sample preparation in the
+   main process. Values ``>0`` parallelize worker-side sample preparation; they
+   do not parallelize model compute.
 
 **prefetch_factor** : int (default: 2)
    Number of batches to prefetch per worker when ``num_workers > 0``.
@@ -540,7 +616,9 @@ Performance & Caching
    Keep DataLoader workers alive between epochs for faster iteration.
    During training, this is disabled automatically when
    ``force_sampling='random'`` uses epoch-level resampling, because worker
-   copies would otherwise keep a stale force-supervision subset.
+   copies would otherwise keep a stale force-supervision subset. Trainer-owned
+   runtime caches and HDF5 ``in_memory_cache_size`` state are also
+   worker-local when ``num_workers > 0``.
 
 
 Data Filtering & Quality Control
@@ -593,7 +671,11 @@ Output & Diagnostics
    Save predicted energies for train/test sets to disk. The
    ``Path-of-input-file`` column preserves the original structure path or
    name when available; otherwise it uses a stable ``structure_XXXXXX``
-   identifier from the pre-split input order.
+   identifier from the pre-split input order. For HDF5-backed datasets,
+   the identifier is reconstructed from persisted metadata as
+   ``path#frame=N`` when the source path is available, ``name#frame=N``
+   when only the persisted name is available, and
+   ``structure_XXXXXX#frame=N`` as the final fallback.
 
 **save_forces** : bool (default: False)
    Save predicted forces for train/test sets to disk.
@@ -618,11 +700,17 @@ Advanced Options
 
 **memory_mode** : str (default: 'gpu')
    Memory management strategy: ``'cpu'``, ``'gpu'``, or ``'mixed'``.
-   Controls where data and intermediate results are stored.
+   ``'mixed'`` is reserved for a future real mixed-memory implementation and
+   currently raises ``NotImplementedError``. Use ``'cpu'`` or ``'gpu'`` with
+   ``descriptor.device`` and ``device`` set explicitly to control the current
+   execution path.
 
 **device** : str (default: None)
    PyTorch device: ``'cpu'``, ``'cuda'``, or ``'cuda:0'``. Auto-detected if
-   None.
+   None. This selects the model/training-loop device. ``descriptor.device``
+   separately controls where structures are featurized. When the two differ,
+   samples are prepared on ``descriptor.device`` and moved to ``device``
+   before the forward pass.
 
 
 Monitoring Training Progress
 
@@ -104,12 +104,16 @@
     "\n",
     "<sup>1</sup>By default, the cohesive energy is the training target and recommended.\n",
     "\n",
-    "<sup>2</sup>The `train()` method accepts strings and `Path` objects, as well as\n",
-    "lists of `AtomicStructure` objects. More advanced data handling is possible\n",
-    "using dataset classes from `aenet.torch_training.dataset`.\n",
-    "\n",
-    "Energy-only training can benefit significantly from feature caching. Setting\n",
-    "`cache_features=True` below takes the automatic cached-dataset path internally.\n"
+   "<sup>2</sup>The `train()` method accepts strings and `Path` objects, as well as\n",
+   "lists of `AtomicStructure` objects. More advanced data handling is possible\n",
+   "using dataset classes from `aenet.torch_training.dataset`.\n",
+   "\n",
+   "Energy-only training can benefit significantly from feature caching. Setting\n",
+   "`cache_features=True` below takes the automatic cached-dataset path internally.\n",
+   "The descriptor device controls where structures are featurized, while\n",
+   "`config.device` controls where the model, losses, and optimizer run.\n",
+   "In this notebook the descriptor stays on CPU, so a later GPU config would\n",
+   "mean CPU-side sample preparation feeding a GPU training loop.\n"
    ]
   },
   {
@@ -139,6 +143,7 @@
     "\n",
     "pot = TorchANNPotential(arch, descriptor=descr)\n",
     "\n",
+    "# Resume from the saved best checkpoint and run 12 additional epochs.\n",
     "cfg = TorchTrainingConfig(\n",
     "    atomic_energies={'O': -432.503149303, 'Ti': -1604.604515075},\n",
     "    testpercent=10,\n",
@@ -342,15 +347,21 @@
    "source": [
     "# 5. Force training\n",
     "\n",
-    "Training on both energies and forces is computationally significantly more\n",
-    "expensive and memory intensive. Feature caching is not effective for the\n",
-    "gradient evaluation required for force training, so additional caching of\n",
-    "neighbor lists and triplet vectors is available instead. Typically, a fraction\n",
-    "of all structures is randomly selected for force training to balance efficiency\n",
-    "and accuracy.\n",
-    "\n",
-    "The fixed-split dataset objects above are the more reliable approach when you\n",
-    "want train/test membership to stay unchanged across repeated runs.\n"
+   "Training on both energies and forces is computationally significantly more\n",
+   "expensive and memory intensive. Feature caching is not effective for the\n",
+   "gradient evaluation required for force training, so additional caching of\n",
+   "neighbor lists and triplet vectors is available instead. Typically, a fraction\n",
+   "of all structures is randomly selected for force training to balance efficiency\n",
+   "and accuracy.\n",
+   "\n",
+   "With `num_workers > 0`, loading and featurization stay on the worker side.\n",
+   "The collated batch is moved onto `config.device` before the forward and loss\n",
+   "steps, so worker parallelism and model-device selection are separate knobs.\n",
+   "`memory_mode='mixed'` is reserved for a future real mixed-memory mode and\n",
+   "currently raises `NotImplementedError` if requested.\n",
+   "\n",
+   "The fixed-split dataset objects above are the more reliable approach when you\n",
+   "want train/test membership to stay unchanged across repeated runs.\n"
    ]
   },
   {
@@ -458,11 +469,13 @@
     "derivative cache. The trainer will then load those derivatives lazily per\n",
     "force-supervised sample instead of recomputing them on every pass.\n",
     "\n",
-    "`build_database()` leaves the same dataset instance ready for read-only use, so\n",
-    "reopening the file is optional and mainly useful in a later session.\n",
-    "\n",
-    "This does **not** persist the network features themselves.\n",
-    "`persist_force_derivatives=True` writes sparse derivative blocks to disk for\n",
+   "`build_database()` leaves the same dataset instance ready for read-only use, so\n",
+   "reopening the file is optional and mainly useful in a later session.\n",
+   "When `num_workers > 0`, each worker reopens its own read-only HDF5 handle and\n",
+   "owns its own in-memory dataset/runtime caches.\n",
+   "\n",
+   "This does **not** persist the network features themselves.\n",
+   "`persist_force_derivatives=True` writes sparse derivative blocks to disk for\n",
     "cross-run reuse, whereas `cache_force_triplets=True` and `cache_features=True`\n",
     "are in-memory runtime caches configured on `TorchTrainingConfig`.\n"
    ]
@@ -522,6 +535,7 @@
     "# Reuse the dataset built above. Reopening the HDF5 file is optional.\n",
     "pot_hdf5 = TorchANNPotential(arch, descriptor=descr)\n",
     "\n",
+    "# Resume from the saved best checkpoint and run 6 additional epochs.\n",
     "cfg_hdf5 = TorchTrainingConfig(\n",
     "    atomic_energies={'O': -432.503149303, 'Ti': -1604.604515075},\n",
     "    iterations=6,\n",