@@ -201,6 +201,44 @@ The longer file-backed dataset workflow is intentionally kept in the training
201201notebook above so the ``torch_datasets `` page can stay focused on compact
202202API-facing examples.
203203
204+ Execution Model
205+ ~~~~~~~~~~~~~~~~
206+
207+ The current trainer has two distinct runtime stages:
208+
209+ 1. Sample preparation happens in the main process when ``num_workers=0 ``, or
210+ in ``DataLoader `` workers when ``num_workers > 0 ``. Structures are
211+ converted to tensors on ``descriptor.device ``, and descriptor
212+ featurization, neighbor reuse, graph/triplet construction, and lazy HDF5
213+ cache reads happen there.
214+ 2. The collated batch is then moved onto ``config.device `` inside the
215+ training loop. Model forward passes, normalization, loss computation, and
216+ optimizer steps run on that device.
217+
218+ In practice, GPU training with ``num_workers > 0 `` is best understood as
219+ worker-side data preparation feeding a training loop on the selected device.
220+ It is not currently a separate mixed CPU/GPU execution pipeline.
221+
222+ If ``descriptor.device `` and ``config.device `` match, featurization and model
223+ compute happen on the same device. If they differ, samples are materialized on
224+ ``descriptor.device `` and transferred before the forward pass. The compact
225+ examples on this page create the descriptor on CPU, so later
226+ ``device='cuda' `` examples describe CPU-side sample preparation feeding GPU
227+ training unless you also move the descriptor to CUDA.
228+
229+ For HDF5-backed datasets, each worker reopens its own read-only file handle
230+ and keeps its own bounded ``in_memory_cache_size `` LRU cache. Trainer-owned
231+ runtime caches (``cache_features ``, ``cache_neighbors ``,
232+ ``cache_force_triplets ``) are also per process/worker, so
233+ ``cache_warmup=True `` is skipped automatically when ``num_workers > 0 ``. See
234+ :doc: `torch_datasets ` for persisted HDF5 cache precedence and for the
235+ distinction between build-time ``build_workers `` and training-time
236+ ``num_workers ``.
237+
238+ ``memory_mode='mixed' `` is reserved for a future real mixed-memory mode and
239+ currently raises ``NotImplementedError `` if requested. Today, the supported
240+ execution modes remain ``'cpu' `` and ``'gpu' ``.
241+
204242Performance Optimization Tips
205243~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
206244
@@ -245,6 +283,10 @@ Performance Optimization Tips
245283 and legacy non-graph paths
246284* **cache_force_triplets **: Cache CSR graphs and triplets for the default sparse
247285 force-training path instead of rebuilding them on demand
286+ * **cache_*_max_entries **: Bound the trainer-owned runtime caches per split
287+ and per process/worker instead of letting them grow without limit
288+ * **cache_warmup **: Optional single-process prefill of trainer-owned runtime
289+ caches before epoch 0; skipped automatically when ``num_workers > 0 ``
248290
249291These runtime caches are distinct from the on-disk HDF5 persisted cache
250292sections created with ``HDF5StructureDataset.build_database(...) ``. For HDF5
@@ -339,10 +381,11 @@ Large Dataset (> 500 structures)
339381 method = Adam(mu = 0.001 , batchsize = 64 ), # Larger batches
340382 testpercent = 10 ,
341383 force_weight = 0.1 ,
342- device = ' cuda' , # Use GPU for speedup
384+ device = ' cuda' , # Model/loss on GPU
343385 # Performance optimizations
344386 cache_features = True , # Runtime in-memory feature cache
345- num_workers = 8 , # Parallel data loading
387+ cache_feature_max_entries = 1024 ,
388+ num_workers = 8 , # Parallel CPU-side sample preparation
346389 prefetch_factor = 4
347390 )
348391
@@ -356,7 +399,8 @@ Energy-Only with Maximum Speed
356399 method = Adam(mu = 0.001 , batchsize = 32 ),
357400 testpercent = 10 ,
358401 force_weight = 0.0 , # Energy-only
359- cache_features = True , # Eager/runtime feature cache for this run
402+ cache_features = True , # Bounded runtime feature cache for this run
403+ cache_warmup = True , # Optional single-process prefill
360404 device = ' cuda'
361405 )
362406
@@ -371,8 +415,8 @@ Force Training with Optimizations
371415 testpercent = 10 ,
372416 force_weight = 0.1 ,
373417 force_fraction = 0.3 , # Use 30% of forces (3× faster)
374- cache_neighbors = True , # Cache neighbor lists
375- num_workers = 4 ,
418+ cache_neighbors = True , # Cache worker-local neighbor lists
419+ num_workers = 4 , # Parallel CPU-side sample preparation
376420 device = ' cuda'
377421 )
378422
@@ -410,6 +454,13 @@ To resume training from a checkpoint, pass the checkpoint path to
410454``train(..., resume_from="checkpoints/checkpoint_epoch_0050.pt") ``. The
411455notebook above contains the maintained checkpoint workflow.
412456
457+ When ``resume_from `` is provided, ``config.iterations `` means the number of
458+ additional epochs to run in that ``train() `` call. For example, resuming a
459+ checkpoint with ``iterations=10 `` runs 10 more epochs after the saved
460+ checkpoint epoch, regardless of how many epochs were completed in the
461+ original run. This applies to numbered checkpoints and ``best_model.pt ``
462+ alike.
463+
413464The trainer will automatically:
414465
415466* Load model and optimizer state
@@ -512,26 +563,51 @@ Performance & Caching
512563 * For force training (``force_weight > 0 ``): Caches features for structures not
513564 selected for force supervision in current epoch (useful with ``force_fraction < 1.0 ``)
514565
566+ **cache_feature_max_entries ** : int or None (default: 1024)
567+ Maximum number of trainer-owned energy-view feature entries to retain per
568+ split and per process/worker when ``cache_features=True ``. Use ``None `` for
569+ an explicit unbounded cache or ``0 `` to suppress storage.
570+
515571**cache_neighbors ** : bool (default: False)
516572 Cache per-structure neighbor graphs (indices, displacement vectors) across
517573 epochs. Avoids repeated neighbor searches for fixed geometries on
518574 energy-view reuse and legacy non-graph paths. Supported force training
519575 does not require this option.
520576
577+ **cache_neighbor_max_entries ** : int or None (default: 512)
578+ Maximum number of trainer-owned neighbor payload entries to retain per
579+ split and per process/worker when ``cache_neighbors=True ``. Use ``None `` for
580+ an explicit unbounded cache or ``0 `` to suppress storage.
581+
521582**cache_force_triplets ** : bool (default: False)
522583 Cache CSR neighbor graphs and precompute angular triplet indices for the
523584 default sparse force-training path. Leaving this disabled still uses the
524585 sparse graph/triplet path, but rebuilds those graph payloads on demand.
525586
587+ **cache_force_triplet_max_entries ** : int or None (default: 256)
588+ Maximum number of trainer-owned graph/triplet payload entries to retain per
589+ split and per process/worker when ``cache_force_triplets=True ``. Use
590+ ``None `` for an explicit unbounded cache or ``0 `` to suppress storage.
591+
526592**cache_persist_dir ** : str (default: None)
527593 Directory for persisting graph/triplet caches to disk for reuse across runs.
528594
529595**cache_scope ** : str (default: 'all')
530596 Which dataset splits to cache: ``'train' ``, ``'val' ``, or ``'all' ``.
531597
598+ **cache_warmup ** : bool (default: False)
599+ If True, pre-populate trainer-owned runtime caches before the first epoch
600+ in single-process training. When all enabled caches have finite entry
601+ limits, warmup stops once those limits are filled. Warmup is skipped
602+ automatically when ``num_workers > 0 `` because workers own their own cache
603+ instances and the main-process warmup would not populate those worker-local
604+ caches.
605+
532606**num_workers ** : int (default: 0)
533- Number of parallel DataLoader workers for on-the-fly featurization.
534- 0 = main process only. Values >0 enable parallel data loading.
607+ Number of parallel ``DataLoader `` workers for structure loading, HDF5
608+ reads, and on-the-fly featurization. ``0 `` keeps sample preparation in the
609+ main process. Values ``>0 `` parallelize worker-side sample preparation; they
610+ do not parallelize model compute.
535611
536612**prefetch_factor ** : int (default: 2)
537613 Number of batches to prefetch per worker when ``num_workers > 0 ``.
@@ -540,7 +616,9 @@ Performance & Caching
540616 Keep DataLoader workers alive between epochs for faster iteration.
541617 During training, this is disabled automatically when
542618 ``force_sampling='random' `` uses epoch-level resampling, because worker
543- copies would otherwise keep a stale force-supervision subset.
619+ copies would otherwise keep a stale force-supervision subset. Trainer-owned
620+ runtime caches and HDF5 ``in_memory_cache_size `` state are also
621+ worker-local when ``num_workers > 0 ``.
544622
545623
546624Data Filtering & Quality Control
@@ -593,7 +671,11 @@ Output & Diagnostics
593671 Save predicted energies for train/test sets to disk. The
594672 ``Path-of-input-file `` column preserves the original structure path or
595673 name when available; otherwise it uses a stable ``structure_XXXXXX ``
596- identifier from the pre-split input order.
674+ identifier from the pre-split input order. For HDF5-backed datasets,
675+ the identifier is reconstructed from persisted metadata as
676+ ``path#frame=N `` when the source path is available, ``name#frame=N ``
677+ when only the persisted name is available, and
678+ ``structure_XXXXXX#frame=N `` as the final fallback.
597679
598680**save_forces ** : bool (default: False)
599681 Save predicted forces for train/test sets to disk.
@@ -618,11 +700,17 @@ Advanced Options
618700
619701**memory_mode ** : str (default: 'gpu')
620702 Memory management strategy: ``'cpu' ``, ``'gpu' ``, or ``'mixed' ``.
621- Controls where data and intermediate results are stored.
703+ ``'mixed' `` is reserved for a future real mixed-memory implementation and
704+ currently raises ``NotImplementedError ``. Use ``'cpu' `` or ``'gpu' `` with
705+ ``descriptor.device `` and ``device `` set explicitly to control the current
706+ execution path.
622707
623708**device ** : str (default: None)
624709 PyTorch device: ``'cpu' ``, ``'cuda' ``, or ``'cuda:0' ``. Auto-detected if
625- None.
710+ None. This selects the model/training-loop device. ``descriptor.device ``
711+ separately controls where structures are featurized. When the two differ,
712+ samples are prepared on ``descriptor.device `` and moved to ``device ``
713+ before the forward pass.
626714
627715
628716Monitoring Training Progress
0 commit comments