Error at 256GPU/64Nodes on Lassen

This is a weak scaling (periodic 3D box, p=3) case to see how far we can push running real simulations on Lassen GPUs.   It has 48k elements/rank and exercises most of the components used in our prediction.  After running successfully for all mesh sizes (nelem, ngpus)=[(48k, 1), (96k, 2), (192k, 4), (384k, 8), (768k, 16), (1.5M, 32),  (3M, 64) , (6M, 128)], it fails at (nelem, ngpus) = (12M, 256).  

*Note: To reproduce, make sure to request at least 5 hours on 64 Lassen batch nodes.

Instructions for reproducing on Lassen:

Branch: mirgecom@production
Driver: `examples/combozzle-mpi.py`
Setup:
 - Set the following parameters to control the mesh size, either hard-code or with a yaml file (e.g. `setup.yaml`)
```
x_scale: 4
y_scale: 1
z_scale: 1
weak_scale: 20
```
- Set the following environment:
```
export PYOPENCL_CTX="port:tesla"
export XDG_CACHE_HOME="/tmp/$USER/xdg-scratch"
export POCL_CACHE_DIR_ROOT="/tmp/$USER/pocl-cache"
export LOOPY_NO_CACHE=1
export CUDA_CACHE_DISABLE=1
```
- Run the case:
```
jsrun -a 1 -g 1 -n 256 bash -c 'POCL_CACHE_DIR=$POCL_CACHE_DIR_ROOT/$$ python -u -O -m mpi4py ./combozzle-mpi.py -i setup.yaml --lazy'
```

The case does have an unexplained hang at array context creation time which grows with the number of ranks.  For this case, it will hang for about 3 hours, and then compile for about 10 minutes before finally ending with this error:
```
  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/mirgecom/mirgecom/integrators/lsrk.py", line 66, in euler_step
    return lsrk_step(EulerCoefs, state, t, dt, rhs)
  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/mirgecom/mirgecom/integrators/lsrk.py", line 53, in lsrk_step
    k = coefs.A[i]*k + dt*rhs(t + coefs.C[i]*dt, state)
  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/arraycontext/arraycontext/impl/pytato/compile.py", line 365, in __call__
    compiled_func = self._dag_to_compiled_func(
  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/grudge/grudge/array_context.py", line 286, in _dag_to_compiled_func
    ) = self._dag_to_transformed_pytato_prg(
  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/arraycontext/arraycontext/impl/pytato/compile.py", line 441, in _dag_to_transformed_pytato_prg
    pytato_program = (pytato_program
  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/pytato/pytato/target/loopy/__init__.py", line 142, in with_transformed_program
    return self.copy(program=f(self.program))
  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/meshmode/meshmode/array_context.py", line 1726, in transform_loopy_program
    raise err
  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/meshmode/meshmode/array_context.py", line 1723, in transform_loopy_program
    iel_to_idofs = _get_iel_to_idofs(knl)
  File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/meshmode/meshmode/array_context.py", line 1066, in _get_iel_to_idofs
    raise NotImplementedError(f"Cannot fit loop nest '{insn.within_inames}'"
```


Gist with the entire batch output file:
[gist.github.com/MTCam/f48991f584755b6a8530dd9345dc2de4](https://gist.github.com/MTCam/f48991f584755b6a8530dd9345dc2de4)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error at 256GPU/64Nodes on Lassen #777

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error at 256GPU/64Nodes on Lassen #777

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions