Skip to content

Error at 256GPU/64Nodes on Lassen #777

@MTCam

Description

@MTCam

This is a weak scaling (periodic 3D box, p=3) case to see how far we can push running real simulations on Lassen GPUs. It has 48k elements/rank and exercises most of the components used in our prediction. After running successfully for all mesh sizes (nelem, ngpus)=[(48k, 1), (96k, 2), (192k, 4), (384k, 8), (768k, 16), (1.5M, 32), (3M, 64) , (6M, 128)], it fails at (nelem, ngpus) = (12M, 256).

*Note: To reproduce, make sure to request at least 5 hours on 64 Lassen batch nodes.

Instructions for reproducing on Lassen:

Branch: mirgecom@production
Driver: examples/combozzle-mpi.py
Setup:

  • Set the following parameters to control the mesh size, either hard-code or with a yaml file (e.g. setup.yaml)
x_scale: 4
y_scale: 1
z_scale: 1
weak_scale: 20
  • Set the following environment:
export PYOPENCL_CTX="port:tesla"
export XDG_CACHE_HOME="/tmp/$USER/xdg-scratch"
export POCL_CACHE_DIR_ROOT="/tmp/$USER/pocl-cache"
export LOOPY_NO_CACHE=1
export CUDA_CACHE_DISABLE=1
  • Run the case:
jsrun -a 1 -g 1 -n 256 bash -c 'POCL_CACHE_DIR=$POCL_CACHE_DIR_ROOT/$$ python -u -O -m mpi4py ./combozzle-mpi.py -i setup.yaml --lazy'

The case does have an unexplained hang at array context creation time which grows with the number of ranks. For this case, it will hang for about 3 hours, and then compile for about 10 minutes before finally ending with this error:

  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/mirgecom/mirgecom/integrators/lsrk.py", line 66, in euler_step
    return lsrk_step(EulerCoefs, state, t, dt, rhs)
  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/mirgecom/mirgecom/integrators/lsrk.py", line 53, in lsrk_step
    k = coefs.A[i]*k + dt*rhs(t + coefs.C[i]*dt, state)
  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/arraycontext/arraycontext/impl/pytato/compile.py", line 365, in __call__
    compiled_func = self._dag_to_compiled_func(
  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/grudge/grudge/array_context.py", line 286, in _dag_to_compiled_func
    ) = self._dag_to_transformed_pytato_prg(
  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/arraycontext/arraycontext/impl/pytato/compile.py", line 441, in _dag_to_transformed_pytato_prg
    pytato_program = (pytato_program
  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/pytato/pytato/target/loopy/__init__.py", line 142, in with_transformed_program
    return self.copy(program=f(self.program))
  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/meshmode/meshmode/array_context.py", line 1726, in transform_loopy_program
    raise err
  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/meshmode/meshmode/array_context.py", line 1723, in transform_loopy_program
    iel_to_idofs = _get_iel_to_idofs(knl)
  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/meshmode/meshmode/array_context.py", line 1066, in _get_iel_to_idofs
    raise NotImplementedError(f"Cannot fit loop nest '{insn.within_inames}'"

Gist with the entire batch output file:
gist.github.com/MTCam/f48991f584755b6a8530dd9345dc2de4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions