-
Notifications
You must be signed in to change notification settings - Fork 186
GPU offloading preconditioner #4953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
f084463
7a4ef2e
34e7b21
f887021
5c191d7
4e8ca5b
f2af668
b59f1d9
04d12d9
336f1a7
4375ed5
9ea7a8a
529ba79
221193a
fc7215b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| self-hosted-runner: | ||
| labels: | ||
| # Custom label for GPU-enabled self-hosted runners | ||
| - gpu | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -23,6 +23,10 @@ on: | |
| description: Whether to test using macOS | ||
| type: boolean | ||
| default: false | ||
| test_gpu: | ||
| description: Whether to test using CUDA-enabled PETSc | ||
| type: boolean | ||
| default: false | ||
| deploy_website: | ||
| description: Whether to deploy the website | ||
| type: boolean | ||
|
|
@@ -54,6 +58,10 @@ on: | |
| description: Whether to test using macOS | ||
| type: boolean | ||
| default: false | ||
| test_gpu: | ||
| description: Whether to test using CUDA-enabled PETSc | ||
| type: boolean | ||
| default: false | ||
| deploy_website: | ||
| description: Whether to deploy the website | ||
| type: boolean | ||
|
|
@@ -465,6 +473,141 @@ jobs: | |
| run: | | ||
| find . -delete | ||
|
|
||
| test_gpu: | ||
| name: Build and test Firedrake (Linux CUDA) | ||
| runs-on: [self-hosted, Linux, gpu] | ||
| container: | ||
| image: ubuntu:latest | ||
| options: --gpus all | ||
| if: inputs.test_gpu | ||
| env: | ||
| OMPI_ALLOW_RUN_AS_ROOT: 1 | ||
| OMPI_ALLOW_RUN_AS_ROOT_CONFIRM: 1 | ||
| OMP_NUM_THREADS: 1 | ||
| OPENBLAS_NUM_THREADS: 1 | ||
| FIREDRAKE_CI: 1 | ||
| PYOP2_SPMD_STRICT: 1 | ||
| # Disable fast math as it exposes compiler bugs | ||
| PYOP2_CFLAGS: -fno-fast-math | ||
| # NOTE: One should occasionally update test_durations.json by running | ||
| # 'make test_durations' inside a 'firedrake:latest' Docker image. | ||
| EXTRA_PYTEST_ARGS: --splitting-algorithm least_duration --timeout=600 --timeout-method=thread -o faulthandler_timeout=660 --durations-path=./firedrake-repo/tests/test_durations.json --durations=50 | ||
| PYTEST_MPI_MAX_NPROCS: 8 | ||
| PETSC_OPTIONS: -use_gpu_aware_mpi 0 | ||
| EXTRA_OPTIONS: -use_gpu_aware_mpi 0 | ||
| steps: | ||
| - name: Confirm Nvidia GPUs are enabled | ||
| # The presence of the nvidia-smi command indicates that the Nvidia drivers have | ||
| # successfully been imported into the container, there is no point continuing | ||
| # if nvidia-smi is not present | ||
| run: nvidia-smi | ||
|
|
||
| - name: Fix HOME | ||
| # For unknown reasons GitHub actions overwrite HOME to /github/home | ||
| # which will break everything unless fixed | ||
| # (https://github.com/actions/runner/issues/863) | ||
| run: echo "HOME=/root" >> "$GITHUB_ENV" | ||
|
|
||
|
|
||
| # Git is needed for actions/checkout and Python for firedrake-configure | ||
| # curl needed for adding new deb repositories to ubuntu | ||
| - name: Install system dependencies (1) | ||
| run: | | ||
| apt-get update | ||
| apt-get -y install git python3 curl | ||
|
|
||
|
|
||
| - name: Pre-run cleanup | ||
| # Make sure the current directory is empty | ||
| run: find . -delete | ||
|
|
||
| - uses: actions/checkout@v5 | ||
| with: | ||
| path: firedrake-repo | ||
| ref: ${{ inputs.source_ref }} | ||
|
|
||
| - name: Add Nvidia CUDA deb repositories | ||
connorjward marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| run: | | ||
| deburl=$( python3 ./firedrake-repo/scripts/firedrake-configure --show-extra-repo-pkg-url --gpu-arch cuda ) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This process definitely seems better. |
||
| debfile=$( basename "${deburl}" ) | ||
| curl -fsSLO "${deburl}" | ||
| dpkg -i "${debfile}" | ||
| apt-get update | ||
|
|
||
| - name: Install system dependencies (2) | ||
| run: | | ||
| apt-get -y install \ | ||
| $(python3 ./firedrake-repo/scripts/firedrake-configure --arch default --gpu-arch cuda --show-system-packages) | ||
| apt-get -y install python3-venv | ||
| : # Dependencies needed to run the test suite | ||
| apt-get -y install fonts-dejavu graphviz graphviz-dev parallel poppler-utils | ||
|
|
||
| - name: Install PETSc | ||
| run: | | ||
| if [ ${{ inputs.target_branch }} = 'release' ]; then | ||
| git clone --depth 1 \ | ||
| --branch $(python3 ./firedrake-repo/scripts/firedrake-configure --gpu-arch cuda --show-petsc-version) \ | ||
| https://gitlab.com/petsc/petsc.git | ||
| else | ||
| git clone --depth 1 https://gitlab.com/petsc/petsc.git | ||
| fi | ||
| cd petsc | ||
| python3 ../firedrake-repo/scripts/firedrake-configure \ | ||
| --arch default --gpu-arch cuda --show-petsc-configure-options | \ | ||
| xargs -L1 ./configure --with-make-np=4 | ||
| make | ||
| make check | ||
| { | ||
| echo "PETSC_DIR=/__w/firedrake/firedrake/petsc" | ||
| echo "PETSC_ARCH=arch-firedrake-default-cuda" | ||
| echo "SLEPC_DIR=/__w/firedrake/firedrake/petsc/arch-firedrake-default-cuda" | ||
| } >> "$GITHUB_ENV" | ||
|
|
||
| - name: Install Firedrake | ||
| id: install | ||
| run: | | ||
| export $(python3 ./firedrake-repo/scripts/firedrake-configure --arch default --gpu-arch cuda --show-env) | ||
| python3 -m venv venv | ||
| . venv/bin/activate | ||
|
|
||
| : # Empty the pip cache to ensure that everything is compiled from scratch | ||
| pip cache purge | ||
|
|
||
| if [ ${{ inputs.target_branch }} = 'release' ]; then | ||
| EXTRA_PIP_FLAGS='' | ||
| else | ||
| : # Install build dependencies | ||
| pip install "$PETSC_DIR"/src/binding/petsc4py | ||
| pip install -r ./firedrake-repo/requirements-build.txt | ||
|
|
||
| : # We have to pass '--no-build-isolation' to use a custom petsc4py | ||
| EXTRA_PIP_FLAGS='--no-build-isolation' | ||
| fi | ||
|
|
||
| pip install --verbose $EXTRA_PIP_FLAGS \ | ||
| --no-binary h5py \ | ||
| './firedrake-repo[check]' | ||
|
|
||
| firedrake-clean | ||
| pip list | ||
|
|
||
| - name: Run smoke tests | ||
| run: | | ||
| . venv/bin/activate | ||
| firedrake-check | ||
connorjward marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| timeout-minutes: 10 | ||
|
|
||
| - name: Verify GPU usage | ||
| run: | | ||
| . venv/bin/activate | ||
| export PETSC_OPTIONS="${PETSC_OPTIONS} -log_view_gpu_time -log_view" | ||
| python3 ./firedrake-repo/tests/firedrake/offload/test_poisson_offloading_pc.py | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is this doing?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a final sanity check on my end, I wanted to confirm that the GPU was being used as expected. The final column of the profiling output (https://github.com/firedrakeproject/firedrake/actions/runs/23132801814/job/67189609416) shows the fraction of work done on the GPU, I wanted to make sure that the low-level vector and matrix operations were actually happening on the GPU, which they are. Besides, this change is in a 'DROP BEFORE MERGE' commit, it will be going away. |
||
|
|
||
| - name: Post-run cleanup | ||
| if: always() | ||
| run: | | ||
| find . -delete | ||
|
|
||
| lint: | ||
| name: Lint codebase | ||
| runs-on: ubuntu-latest | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,99 @@ | ||||||||||||
| from firedrake.preconditioners.assembled import AssembledPC | ||||||||||||
| from firedrake.petsc import PETSc | ||||||||||||
| from firedrake.utils import device_matrix_type | ||||||||||||
| from firedrake.logging import logger | ||||||||||||
| from functools import cache | ||||||||||||
| import warnings | ||||||||||||
|
|
||||||||||||
| import firedrake.dmhooks as dmhooks | ||||||||||||
|
|
||||||||||||
| __all__ = ("OffloadPC",) | ||||||||||||
|
|
||||||||||||
|
|
||||||||||||
| @cache | ||||||||||||
| def offload_mat_type(pc_comm_rank) -> str | None: | ||||||||||||
| mat_type = device_matrix_type() | ||||||||||||
| if mat_type is None: | ||||||||||||
| if pc_comm_rank == 0: | ||||||||||||
| warnings.warn( | ||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For Python warnings there is a difference between
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, I wasn't aware of the distinction. Ironic given that if you set |
||||||||||||
| "This installation of Firedrake is not GPU-enabled, therefore OffloadPC" | ||||||||||||
| "will do nothing. For this preconditioner to function correctly PETSc" | ||||||||||||
| "will need to be rebuilt with some GPU capability (e.g. '--with-cuda=1')." | ||||||||||||
| ) | ||||||||||||
| return None | ||||||||||||
| try: | ||||||||||||
| dev = PETSc.Device.create() | ||||||||||||
| except PETSc.Error: | ||||||||||||
| if pc_comm_rank == 0: | ||||||||||||
| logger.warning( | ||||||||||||
| "This installation of Firedrake is GPU-enabled, but no GPU device has" | ||||||||||||
| "been detected. OffloadPC will do nothing on this host" | ||||||||||||
| ) | ||||||||||||
| return None | ||||||||||||
| if dev.getDeviceType() == "HOST": | ||||||||||||
| raise RuntimeError( | ||||||||||||
| "A GPU-enabled Firedrake build has been detected, and GPU hardware has been" | ||||||||||||
| "detected but a GPU device was unable to be initialised." | ||||||||||||
| ) | ||||||||||||
| dev.destroy() | ||||||||||||
| return mat_type | ||||||||||||
|
|
||||||||||||
|
|
||||||||||||
| class OffloadPC(AssembledPC): | ||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should the name be more precise and refer to "GPU"? |
||||||||||||
| """Offload PC from CPU to GPU and back. | ||||||||||||
|
|
||||||||||||
| Internally this makes a PETSc PC object that can be controlled by | ||||||||||||
| options using the extra options prefix ``offload_``. | ||||||||||||
| """ | ||||||||||||
|
|
||||||||||||
| _prefix = "offload_" | ||||||||||||
|
|
||||||||||||
| def initialize(self, pc): | ||||||||||||
| # Check if our PETSc installation is GPU enabled | ||||||||||||
| super().initialize(pc) | ||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we be calling
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm working my way through understanding how PETSc manages GPU devices, and it does appear that when I drop So the matrix does have to be assembled, but the previous PC might have performed assembly, so it wouldn't be necessary in that case. I think this might be an argument in favour of having device offload as an optional step in |
||||||||||||
| self.offload_mat_type = offload_mat_type(pc.comm.rank) | ||||||||||||
| if self.offload_mat_type is not None: | ||||||||||||
| with PETSc.Log.Event("Event: initialize offload"): | ||||||||||||
| A, P = pc.getOperators() | ||||||||||||
|
|
||||||||||||
| # Convert matrix to ajicusparse | ||||||||||||
| with PETSc.Log.Event("Event: matrix offload"): | ||||||||||||
| P_cu = P.convert(self.offload_mat_type) # todo | ||||||||||||
|
|
||||||||||||
| # Transfer nullspace | ||||||||||||
| P_cu.setNullSpace(P.getNullSpace()) | ||||||||||||
| P_cu.setTransposeNullSpace(P.getTransposeNullSpace()) | ||||||||||||
| P_cu.setNearNullSpace(P.getNearNullSpace()) | ||||||||||||
|
|
||||||||||||
| # Update preconditioner with GPU matrix | ||||||||||||
| self.pc.setOperators(A, P_cu) | ||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this because
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I genuinely don't know. On the surface it makes sense, but are there cases in which
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We certainly want PETSc options to give the user finer control over what to offload. If the original A and P point to the same Mat instance then it is reasonable to offload them both (as the same instance). This could be the default. |
||||||||||||
|
|
||||||||||||
| # Convert vectors to CUDA, solve and get solution on CPU back | ||||||||||||
| def apply(self, pc, x, y): | ||||||||||||
| if self.offload_mat_type is None: | ||||||||||||
| self.pc.apply(x, y) | ||||||||||||
| else: | ||||||||||||
| with PETSc.Log.Event("Event: apply offload"): # | ||||||||||||
| dm = pc.getDM() | ||||||||||||
| with dmhooks.add_hooks(dm, self, appctx=self._ctx_ref): | ||||||||||||
| with PETSc.Log.Event("Event: vectors offload"): | ||||||||||||
| y_cu = PETSc.Vec() # begin | ||||||||||||
| y_cu.createCUDAWithArrays(y) | ||||||||||||
|
Comment on lines
+80
to
+81
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||
| x_cu = PETSc.Vec() | ||||||||||||
| # Passing a vec into another vec doesnt work because original is locked | ||||||||||||
| x_cu.createCUDAWithArrays(x.array_r) | ||||||||||||
|
Comment on lines
+82
to
+84
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||
| with PETSc.Log.Event("Event: solve"): | ||||||||||||
| self.pc.apply(x_cu, y_cu) | ||||||||||||
| # Calling data to synchronize vector | ||||||||||||
| tmp = y_cu.array_r # noqa: F841 | ||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||
| with PETSc.Log.Event("Event: vectors copy back"): | ||||||||||||
| y.copy(y_cu) # | ||||||||||||
|
|
||||||||||||
| def applyTranspose(self, pc, X, Y): | ||||||||||||
| raise NotImplementedError | ||||||||||||
|
|
||||||||||||
| def view(self, pc, viewer=None): | ||||||||||||
connorjward marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||
| super().view(pc, viewer) | ||||||||||||
| if hasattr(self, "pc"): | ||||||||||||
| viewer.printfASCII("PC to solve on GPU\n") | ||||||||||||
| self.pc.view(viewer) | ||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's to address this linting error: https://github.com/firedrakeproject/firedrake/actions/runs/22885137596/job/66433010084. I will add a comment to the file to that effect.