This is a weak scaling (periodic 3D box, p=3) case to see how far we can push running real simulations on Lassen GPUs. It has 48k elements/rank and exercises most of the components used in our prediction. After running successfully for all mesh sizes (nelem, ngpus)=[(48k, 1), (96k, 2), (192k, 4), (384k, 8), (768k, 16), (1.5M, 32), (3M, 64) , (6M, 128)], it fails at (nelem, ngpus) = (12M, 256).
*Note: To reproduce, make sure to request at least 5 hours on 64 Lassen batch nodes.
Instructions for reproducing on Lassen:
Branch: mirgecom@production
Driver: examples/combozzle-mpi.py
Setup:
- Set the following parameters to control the mesh size, either hard-code or with a yaml file (e.g.
setup.yaml)
x_scale: 4
y_scale: 1
z_scale: 1
weak_scale: 20
- Set the following environment:
export PYOPENCL_CTX="port:tesla"
export XDG_CACHE_HOME="/tmp/$USER/xdg-scratch"
export POCL_CACHE_DIR_ROOT="/tmp/$USER/pocl-cache"
export LOOPY_NO_CACHE=1
export CUDA_CACHE_DISABLE=1
jsrun -a 1 -g 1 -n 256 bash -c 'POCL_CACHE_DIR=$POCL_CACHE_DIR_ROOT/$$ python -u -O -m mpi4py ./combozzle-mpi.py -i setup.yaml --lazy'
The case does have an unexplained hang at array context creation time which grows with the number of ranks. For this case, it will hang for about 3 hours, and then compile for about 10 minutes before finally ending with this error:
File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/mirgecom/mirgecom/integrators/lsrk.py", line 66, in euler_step
return lsrk_step(EulerCoefs, state, t, dt, rhs)
File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/mirgecom/mirgecom/integrators/lsrk.py", line 53, in lsrk_step
k = coefs.A[i]*k + dt*rhs(t + coefs.C[i]*dt, state)
File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/arraycontext/arraycontext/impl/pytato/compile.py", line 365, in __call__
compiled_func = self._dag_to_compiled_func(
File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/grudge/grudge/array_context.py", line 286, in _dag_to_compiled_func
) = self._dag_to_transformed_pytato_prg(
File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/arraycontext/arraycontext/impl/pytato/compile.py", line 441, in _dag_to_transformed_pytato_prg
pytato_program = (pytato_program
File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/pytato/pytato/target/loopy/__init__.py", line 142, in with_transformed_program
return self.copy(program=f(self.program))
File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/meshmode/meshmode/array_context.py", line 1726, in transform_loopy_program
raise err
File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/meshmode/meshmode/array_context.py", line 1723, in transform_loopy_program
iel_to_idofs = _get_iel_to_idofs(knl)
File "/p/gpfs1/mtcampbe/CEESD/Experimental/svm/production/meshmode/meshmode/array_context.py", line 1066, in _get_iel_to_idofs
raise NotImplementedError(f"Cannot fit loop nest '{insn.within_inames}'"
Gist with the entire batch output file:
gist.github.com/MTCam/f48991f584755b6a8530dd9345dc2de4
This is a weak scaling (periodic 3D box, p=3) case to see how far we can push running real simulations on Lassen GPUs. It has 48k elements/rank and exercises most of the components used in our prediction. After running successfully for all mesh sizes (nelem, ngpus)=[(48k, 1), (96k, 2), (192k, 4), (384k, 8), (768k, 16), (1.5M, 32), (3M, 64) , (6M, 128)], it fails at (nelem, ngpus) = (12M, 256).
*Note: To reproduce, make sure to request at least 5 hours on 64 Lassen batch nodes.
Instructions for reproducing on Lassen:
Branch: mirgecom@production
Driver:
examples/combozzle-mpi.pySetup:
setup.yaml)The case does have an unexplained hang at array context creation time which grows with the number of ranks. For this case, it will hang for about 3 hours, and then compile for about 10 minutes before finally ending with this error:
Gist with the entire batch output file:
gist.github.com/MTCam/f48991f584755b6a8530dd9345dc2de4