Skip to content

Ecosystem: High-Throughput Data Generation for ML Surrogates #225

@jameslehoux

Description

@jameslehoux

Labels: ecosystem, machine-learning, phase:4-hpc
Priority: Medium (Strategic for AI citations)

Description

Machine learning researchers are actively trying to train Surrogate AI models to predict tortuosity, but they lack massive datasets of 3D microstructures with accurate, physics-based ground truths.

OpenImpala is perfectly positioned to be the "ground truth generator" for the AI battery community. We should provide an out-of-the-box script/pipeline for high-throughput synthetic data generation.

Acceptance Criteria

  • Expand data/create_sample_structure.py to support parameterized generation of stochastic porous media (e.g., overlapping spheres, Gaussian random fields).
  • Create an MPI-enabled batch script (examples/generate_ml_dataset.py) that generates $N$ structures, solves the tortuosity for each, and exports a paired dataset (e.g., HDF5 or WebDataset format containing the 3D array and the scalar labels).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions