diff --git a/README.md b/README.md index f499dee..da90a78 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,6 @@ +***Globus Data Transfer outage scheduled for February 4th 2026 09:00-13:00 PST. [More details here](changelog.md)*** + + Welcome to the SLAC Shared Scientific Data Facility (S3DF) at SLAC National Accelerator Laboratory. S3DF is a compute, storage, and network architecture designed to support diff --git a/accounts.md b/accounts.md index 972fee0..f7c2281 100644 --- a/accounts.md +++ b/accounts.md @@ -53,5 +53,18 @@ You can change your password via the SLAC Account self-service password update s If you have forgotten your password and need to reset it, please contact the [SLAC IT Service Desk](https://it.slac.stanford.edu/support). +## Support for urgent account-related issues + +Staff and users needing assistance outside of business hours should call the main IT Service Desk line at (650) 926-4357. They will be presented with a new menu for after-hours support: + +Option 1: For Account Lockouts and Password Resets + +Option 2: For all other issues + +When a user selects Option 1, the system is designed to maximize the chance of reaching an on-call technician promptly. The user can choose to wait on hold or go directly to voicemail. If they wait, the system will cycle between the primary and secondary on-call staff members in 15-second intervals to avoid rolling to personal voicemail. If the caller is unable to reach the scheduled agents, they will be asked to leave a detailed voicemail. Total hold time will be two minutes if all scheduled agents are unable to answer the call. + +The service level objective for these urgent off-hours account issues is to provide a response within 30 minutes during non-business hours (from 5 PM to midnight during weekdays and 8 AM - midnight on the weekends). +Between the hours of midnight and 8:00 AM, support will be provided on a best-effort basis. + diff --git a/assets/S3DF_container_lifecycle.png b/assets/S3DF_container_lifecycle.png new file mode 100644 index 0000000..fbf7f2b Binary files /dev/null and b/assets/S3DF_container_lifecycle.png differ diff --git a/batch-compute.md b/batch-compute.md index 21ada7f..017a1ce 100644 --- a/batch-compute.md +++ b/batch-compute.md @@ -79,6 +79,8 @@ See the table below to determine the specifications for each cluster (slurm part | Partition name | CPU model | Useable cores per node | Useable memory per node | GPU model | GPUs per node | Local scratch | Number of nodes | | --- | --- | --- | --- | --- | --- | --- | --- | +| torino | AMD Turin 9555 | 120 | 720 GB | - | - | 6 TB | 52 | +| hopper | AMD Turin 9575F | 224 (hyperthreaded) | 1344 GB | NVIDIA H200 | 4 | 21 TB | 3 | | roma | AMD Rome 7702 | 120 | 480 GB | - | - | 300 GB | 131 | | milano | AMD Milan 7713 | 120 | 480 GB | - | - | 6 TB | 270 | | ampere | AMD Rome 7542 | 112 (hyperthreaded) | 952 GB | Tesla A100 (40GB) | 4 | 14 TB | 42 | diff --git a/changelog.md b/changelog.md index 026676e..0426c6e 100644 --- a/changelog.md +++ b/changelog.md @@ -6,6 +6,10 @@ ### Upcoming +|When |Duration | What | +| --- | --- | --- | +| February 4th 2026 | 9:00-13:00 PST (planned) | Shutdown the Globus node  “sdfdtn004” for a network card upgrade. + ### Past |When |Duration | What | diff --git a/conda.md b/conda.md index ea5035a..0f10990 100644 --- a/conda.md +++ b/conda.md @@ -1,26 +1,46 @@ # Conda -> [!IMPORTANT] -> It is not recommended to store your conda environments in your $HOME due to 1) quota limits, and 2) an inability to share conda environments across groups. We generally recommend that you install software into your facility's group space (e.g., `/sdf/group//sw` - please see [facility storage](getting-started.md#group)). The following instructions for deploying Miniconda assume that you have write permissions in those directories. If you require write access, please consult with your facility's computing czar. The czar(s) for an S3DF facility can be found [here](https://coact.slac.stanford.edu/facilities). +[Conda](https://docs.conda.io/projects/conda/en/latest/index.html) and other Python package management tools such as [UV](https://docs.astral.sh/uv/) and [Poetry](https://python-poetry.org/) allow for easier installation and deployment of Python applications with complex dependencies. The following guide covers common usage patterns for Conda on S3DF. -## Option 1) Install Miniconda +## Conda Deployment Methods -Download the latest version of Miniconda from the [conda](https://docs.conda.io/en/latest/miniconda.html) website and follow the [Instructions](https://conda.io/projects/conda/en/latest/user-guide/install/linux.html#installing-on-linux). Change the installion `prefix` to point to an appropriate [facility directory](getting-started.md#group) replacing `` with the name of your facility as follows: +### Facility installation of Miniconda + +An S3DF facility (https://s3df.slac.stanford.edu/#/contact-us?id=poc) typically maintains software deployments relevant to its users under a path in that facility's group space, such as `/sdf/group//sw`. The computing czar for a facility, and those they grant permissions to, can install/manage software such as Conda under that facility's group space. + +Miniconda is a minimal distribution of Anaconda that contains Python, Conda, their dependencies, and some other useful packages. It is lightweight and has a smaller footprint than a full-fledged Conda or Anaconda deployment. Maintainers can download [Miniconda](https://docs.conda.io/en/latest/miniconda.html) and follow the [installation instructions](https://conda.io/projects/conda/en/latest/user-guide/install/linux.html#installing-on-linux) guide. The installation path can be overridden to point to an appropriate path under the facility group space as follows (replace `` with the name of the appropriate S3DF facility): ```bash -wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/Miniconda3-latest-Linux-x86_64.sh -bash /tmp/Miniconda3-latest-Linux-x86_64.sh -p /sdf/group//sw/conda/ +$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/Miniconda3-latest-Linux-x86_64.sh +$ bash /tmp/Miniconda3-latest-Linux-x86_64.sh -p /sdf/group//sw/conda/ ``` -Add the newly installed conda binary to your `$PATH` environment variable: +Users can add the newly-installed Conda binary to their `$PATH` environment variable by running: ```bash -export PATH=${PATH}:/sdf/group//sw/conda/bin +$ export PATH=${PATH}:/sdf/group//sw/conda/bin ``` -Modify your local [~/.condarc](https://conda.io/projects/conda/en/latest/user-guide/configuration/use-condarc.html) file to enable the creation of new envs and installation of packages to your facility's conda install directory: +## Conda Environments + +The most common use case for Conda is to resolve and manage Python application dependencies. [Conda environments](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) allow Python applications to have their own installation environments isolated from each other, with the goal of reducing conflicts between applications that require different versions of the same packages. Users can invoke the environment that has the dependencies required for a specific Python application and run the application within that environment, in isolation from other applications. + +Deploying Conda environments requires the following: +* Access to a `conda` binary executable +* Filesystem permissions to write to the directory where the Conda environment(s) will be stored +* Access (usually over the network) to any Python package repositories required (e.g., [the Python Package Index (PyPI)](https://pypi.org/) for the application +* Sufficient disk space in the env location to store the package artifacts + +### Creating a Conda environment + +Users can modify their individual Conda startup configuration file to set the default Python package channels, any packages, and override the default install path for new Conda envs, among other settings (for a complete list, see: https://conda.io/projects/conda/en/latest/user-guide/configuration/use-condarc.html). The following example configures Conda envs to: +* use the `defaults`, `anaconda`, `conda-forge`, and `pytorch` channels to access Python packages +* sets the environment to be installed under the `envs` directory of the facility's Conda instance +* sets the packages downloaded for environments to be installed under the `pkgs` directory of the facility's Conda instance + ```bash +$ cat << EOF > ~/.condarc channels: - defaults - anaconda @@ -31,19 +51,20 @@ envs_dirs: pkgs_dirs: - /sdf/group//sw/conda/pkgs auto_activate_base: false -``` - -> [!TIP] -> This Miniconda installation should be used for an entire facility with conda environment(s) installed for the various facility users. Conda installations and packages can be quite large, so they are not recommended to be placed in $HOME. -### Create a conda environment +EOF +``` -Conda environments are a nice way of switching between different software versions/packages without multiple conda installs. +Conda environments can be created declaratively using YAML files (see: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-from-an-environment-yml-file). -There is no unique way to create a conda environment, we illustrate here how to do so from a .yaml file (see the conda [documentation](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-from-an-environment-yml-file) for more details). +The following YAML manifest generates a Conda environment called `mytest` with these packages and their dependencies pre-installed: -In order to create an environment called `mytest` with `python=3.12` and the `numpy` and `pandas` packages installed, create `mytest-env.yaml`: -```bash +* `python=3.12` +* `numpy` +* `pandas` + +```yaml +--- name: mytest dependencies: - python=3.12 @@ -51,114 +72,175 @@ dependencies: - pandas ``` -Then run the following command: `conda env create -f mytest-env.yaml`. - -If successful, you should see `mytest` when listing your environments: `conda env list`. - -You can now activate your environment and use it: `conda activate mytest`. To double-check that you have the right packages, you can type `conda list` once in the environment and check that you see `numpy` and `pandas`. +Create the Conda environment: -Additionally, existing conda environments can be exported into yaml-formatted files with `conda env export > my-existing-env.yaml`. - -## Option 2) Create and/or pull a Conda container - -### Method 1 - Using Apptainer - -Apptainer can be used directly on S3DF to build a specific Conda environment within a container and run it. In addition to providing a desired Conda enviroment as exampled above in `mytest-env.yaml`, an apptainer definition file specifies the image-building procedure for installing Conda and the desired environment packages. To start, create `mytest-app.def`: -```apptainer -Bootstrap: docker -From: continuumio/miniconda3:latest +```bash +$ conda env create -f mytest-env.yaml +``` -%files - mytest-env.yaml /mytest-env.yaml +Once created, the `mytest` env should appear when listing Conda environments (`` is a placeholder): -%post - /opt/conda/bin/conda env create -f /mytest-env.yaml +```bash +$ conda env list +# conda environments: +# +base /sdf/group//sw/conda +mytest /sdf/group//sw/conda/envs/mytest +``` -%environment - source /opt/conda/etc/profile.d/conda.sh - conda activate mytest +The Conda environment may be activated by running: -%runscript - exec "$@" +```bash +$ conda activate mytest ``` -To build the image (.sif file) run the command: +Once the environment has been activated, the installed package list can be seen by running: -`apptainer build --fakeroot mytest-env-image.sif mytest-env.def` +```bash +(mytest) $ conda list +# packages in environment at /sdf/group//sw/conda/envs/mytest: +# +# Name Version Build Channel +[...] +numpy 2.1.0 py312h58c1407_1 conda-forge +[...] +pandas 2.2.2 pypi_0 pypi +[...] +python 3.12.5 h2ad013b_0_cpython conda-forge +[...] +``` +Note that the package list will show not only the pre-defined packages from the environment's YAML manifest, but any dependencies installed along with them. -This should create the Singularity image file: `mytest-env-image.sif` which contains the desired conda environment settings. +Existing Conda environments can be exported into YAML manifests by running the following (change the name of the environment YAML file as desired): -To use this apptainer interactively, simply run: +```bash +$ conda env export > my-existing-env.yaml +``` -`apptainer shell mytest-env-image.sif` +> [!Note] +> Conda environments should not be stored in the user $HOME directory due to quota limits (30GB per user) and the inability to share the environment with other users. It is recommended to install Conda and other software into the appropriate facility group space (e.g., `/sdf/group//sw` - please see [facility storage](getting-started.md#group)). To obtain proper filesystem permissions, please consult with the appropriate facility computing czar. The list of czars for S3DF facilities can be found at: https://coact.slac.stanford.edu/facilities. -which opens a terminal prompt within the image with the conda environment activated. Then, from within the Apptainer shell, running `python` opens a python terminal in the installed conda environment. +## Containerizing Conda environments -Alternatively, the apptainer can be used as an executable to run a desired python script (e.g. `my_script.py`) by using the command: +Conda environments are designed to isolate a Python application and its dependencies. However, scientific applications frequently have a large number of heavyweight dependencies (e.g., `numpy` and `pandas`) that can utilize a large amount of disk space, as well as needing to be built against specific platforms and architectures for compatibility and performance. Additionally, re-deploying a Python application and its accompanying dependencies within a Conda environment to other systems (e.g., deploying the same application to SLAC and NERSC) can be time-consuming and lead to future maintenance issues as different deployments fall out of sync. Containerization (see: https://aws.amazon.com/what-is/containerization/) allows for the creation of an image that contains a full application stack all the way down to the operating system, which can be run on systems using a compatible container runtime such as `Docker`, `Podman`, `Apptainer`, etc. A Python application, with its accompanying Conda environment, Python packages, and the Conda installation itself, can be built into a container image. This can offer much more portability, especially when running applications in multiple high performance compute environments such as S3DF. -`apptainer run mytest-env-image.sif python my_script.py` +### Creating, Publishing, and Running Conda Container Images in S3DF -where the arguments following the .sif image are to be run as code within the container. +Container images can be built with a variety of tools in different container image formats, e.g. `Dockerfile`; it is recommended to use a format that conforms to the [Open Container Initiative (OCI)](https://opencontainers.org/) standard for portability and compatibility with the available userspace container runtime on S3DF ([Apptainer](https://apptainer.org/)). Docker images are supported by most container runtimes including Apptainer, thus allowing the container images to be used at multiple compute facilities. -> [!TIP] -> Only the Singularity image file (.sif) needs to be provided to other S3DF users as everything is self-contained! +Due to the fact that Docker's container build utility (e.g. `docker build...`) requires admin privileges on the host build system, users must create their Docker container images on a non-S3DF host where they have admin privileges (e.g. a work laptop with `sudo` privileges). Once built, the container image can be uploaded to an online container repository such as [GitHub Container Registry (ghcr.io)](https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry) or [GitLab Container Registry](https://docs.gitlab.com/user/packages/container_registry/), then pulled onto S3DF interactive or batch nodes using the `apptainer` container runtime. The following diagram shows the development lifecycle for an application container used on S3DF: +![S3DF container lifecycle](assets/S3DF_container_lifecycle.png) > [!NOTE] -> Container images are immutable and must be rebuilt any time changes are needed in the conda environment. +> The host system platform and architecture where a container image is built may differ from the platform and architecture where the container is run. For example, container images can be built on a MacOS or Windows host system, while S3DF batch nodes are currently running RHEL8/Rocky Linux 8 (and will eventually be migrated to Rocky Linux 9/10 and beyond). Docker and other container build tools can be configured to target different platforms and architectures, so ensure that the built container image is compatible with the target platform and architecture on S3DF nodes. For more information, see: https://docs.docker.com/build/building/multi-platform/. -### Method 2 - Using Docker +The following example shows the workflow for creating a Conda environment in a Docker container image. -Since Docker is not installed on S3DF, this method relies on building a docker container image on a *different* machine (e.g. a laptop), uploading that image to a repository such as [Docker Hub](https://hub.dockerhub.com), and pulling the remotely hosted image onto S3DF with Apptainer. This method is more complex than building and running a container image locally with Apptainer, but can be useful for users that already have docker container images built elsewhere. +1. Create an example Conda environment using a YAML manifest: -To begin, install docker on your local machine (not S3DF) by visiting the [Docker homepage](https://www.docker.com/get-started). Additionally, sign up for a free Docker Hub account if an online repository is needed. - -> [!TIP] -> To build a docker container on a Windows system, consider installing WSL2 (Windows Subsystem for Linux 2) first and then have Docker Desktop link to your WSL2 distribution. Then use docker commands from within WSL2 to ensure container compatibility since S3DF runs on a Linux system. +```yaml +--- +name: test-env +channels: + - default + - conda-forge +dependencies: + - python=3.9 + - bokeh=2.4.2 + - conda-forge::numpy=1.21.* + - nodejs=16.13.* + - flask + - pip + - pip: + - Flask-Testing +``` -> [!WARNING] -> Docker Hub repositories are *PUBLIC* by default, so your container images will be visable by anyone on the internet. Be sure to make a private repo if needed, but doing so will require authentication when pushing/pulling images to/from Docker Hub. +2. Create an entrypoint script that will be run whenever the image is invoked by a container runtime: -Once an appropriate repo has been created (e.g. `docker.io//`), a docker image can be built with a provided Dockerfile: +```bash +$ cat << EOF > entrypoint.sh +#!/bin/bash --login +# The --login ensures the bash configuration is loaded, enabling Conda. +# Enable strict mode. +set -euo pipefail +# ... Run whatever commands ... +echo "bash ${MINIFORGE3_DIR}/etc/profile.d/conda.sh" >> ${HOME}/.bashrc +echo "conda init bash" >> ${HOME}/.bashrc +# Temporarily disable strict mode and activate conda: +set +euo pipefail +conda activate test-env +# Re-enable strict mode: +set -euo pipefail +EOF ``` + +3. Create a Dockerfile to copy the Conda environment manifest, create the Conda environment, and copy the entrypoint script into the container image + +```bash +$ cat << EOF > Dockerfile FROM continuumio/miniconda3:latest WORKDIR /app -COPY mytest-env.yaml . +COPY test-env.yaml . -RUN conda env create -f /app/mytest-env.yaml && conda clean -afy +RUN conda env create -f /app/test-env.yaml && conda clean -afy ENV PATH=/opt/conda/envs/test-env/bin:$PATH -CMD ["python"] -``` - -> [!NOTE] -> This file must be called `Dockerfile` and is used to build the container in the same directory as the `mytest-env.yaml` file. +COPY entrypoint.sh /entrypoint.sh -To build the container image, run the command: +ENTRYPOINT ["/entrypoint.sh"] +EOF +``` -`docker build -t / .` +4. On a build host with appropriate privileges, use the Docker runtime to build the image (replace `` and `` placeholders as appropriate). For further details, see [https://docs.docker.com/get-started/docker-concepts/building-images/build-tag-and-publish-an-image/](https://docs.docker.com/get-started/docker-concepts/building-images/build-tag-and-publish-an-image/): -This will build the image within the Dockerfile directory and tag it to the name given by `/` with the Docker Hub username and repo name created earlier. Custom version tags can be appended as well for version control (e.g. `/:`), and if not specified, the default tag is `latest`. +```bash +# '.' to build and tag an image using a Dockerfile in the current working directory +$ docker build -t / . +``` -> [!TIP] -> Once the docker container is built, it can be tested/debugged locally with `docker run -it /` which runs it in interactive mode. +5. Publish the image (may require authentication for private container repos. See: [https://docs.docker.com/get-started/docker-concepts/building-images/build-tag-and-publish-an-image/#publishing-images](https://docs.docker.com/get-started/docker-concepts/building-images/build-tag-and-publish-an-image/#publishing-images): -Next, *push* (upload) the docker image to an online repository with the command: +```bash +$ docker push / +``` -`docker push /` +6. Pull the published image onto an S3DF [interactive](https://s3df.slac.stanford.edu/#/interactive-compute?id=interactive-pools) or [batch](https://s3df.slac.stanford.edu/#/interactive-compute?id=interactive-compute-session-using-slurm) node in a specified path with the Apptainer container runtime (see the [S3DF Apptainer Usage documentation](https://s3df.slac.stanford.edu/#/apptainer?id=apptainer)): -This will upload the container image to Docker Hub (or other specified online repository). +```bash +$ apptainer pull /test_img.sif docker:/// +``` -Now on S3DF, *pull* (download) the docker container image with Apptainer by using the command: +The Apptainer container image (`.sif`) can now be launched within an S3DF batch job or interactive batch session: -`apptainer pull docker:///` +* Submit an S3DF batch job and load the Conda environment (see: [https://s3df.slac.stanford.edu/#/slurm?id=create-a-batch-script](https://s3df.slac.stanford.edu/#/slurm?id=create-a-batch-script)): -This will create a Singularity image file (.sif) with the automatically generated name based on the repo and tag name: `_latest.sif`. +```bash +$ cat << EOF > submit_job.bash +#!/bin/bash + +#SBATCH --partition=ampere +#SBATCH --job-name=test +#SBATCH --output=output-%j.txt +#SBATCH --error=output-%j.txt +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task=12 +#SBATCH --mem-per-cpu=1g +#SBATCH --time=0-00:10:00 +#SBATCH --gpus 1 + +# invoke conda environment from container image +apptainer shell /path/to/test_img.sif +EOF +``` -Lastly, run the container image with Apptainer with the command: +* Create an S3DF interactive batch session an invoke the Conda environment using (see: [https://s3df.slac.stanford.edu/#/interactive-compute?id=interactive-compute-session-using-slurm](https://s3df.slac.stanford.edu/#/interactive-compute?id=interactive-compute-session-using-slurm)): -`apptainer run _latest.sif` +```bash +$ srun --partition --account : -n 1 --time=01:00:00 --pty /bin/bash -Based on the configuration of the Dockerfile above, the container will immediately open a python interface with the appropriate conda environment. +# once the interactive session has been scheduled, invoke a shell into the container image to load the Conda environment +$ apptainer shell /path/to/test_img.sif +``` diff --git a/data-and-storage.md b/data-and-storage.md index 0391a3e..eb4690f 100644 --- a/data-and-storage.md +++ b/data-and-storage.md @@ -43,7 +43,7 @@ can be found at\ `/sdf/{home, sw, group}/.snapshots//` GMT_time indicates the time the snapshot directory was created. Choose a time that corresponds to the file versions you want and simply copy back the files. -- Files/objects under `/sdf/data` will be backed up or archived according to a data retention policy defined by the facility. Facilities will be responsible for covering the media costs and overhead required by their policy. Similar to the /sdf/home area, you can also check in /sdf/data/\/.snapshots to see if snapshots are enabled for self-service restores. +- Files/objects under `/sdf/data` will be backed up or archived according to a data retention policy defined by the facility. Facilities will be responsible for covering the media costs and overhead required by their policy. Similar to the /sdf/home area (but with a slightly different path structure), you can also check in /sdf/data/\/.snapshots to see if snapshots are enabled for self-service restores. - The scratch spaces under `/sdf/scratch` and all directories named "nobackup" (located *anywhere* in any /sdf path) will not be backed up or archived. Please use as many "nobackup" subdirectory locations as required for any files that do not need backup. That can save significant tape and processing resources. diff --git a/interactive-compute.md b/interactive-compute.md index 38c9574..01665fe 100644 --- a/interactive-compute.md +++ b/interactive-compute.md @@ -13,12 +13,12 @@ The currently available pools are shown in the table below (The facility can be |Pool name | Facility | Resources | | --- | --- | --- | |iana | For all S3DF users | 4 servers, 40 HT cores and 384 GB per server | -|rubin-devl | Rubin | 4 servers, 128 cores and 512 GB per server | -|psana | LCLS | 4 servers, 40 HT cores and 384 GB per server | -|fermi-devl | Fermi | 1 server, 64 HT cores and 512 GB per server | +|rubin-devl | Rubin | 11 servers, 128 cores and 512 GB per server | +|psana | LCLS | 7 servers, 40 HT cores and 384 GB per server | +|fermi-devl | Fermi | 2 server, 64 HT cores and 512 GB per server | |faders | FADERS | 1 server, 128 HT cores and 512 GB per server | |ldmx | LDMX | 1 server, 128 HT cores and 512 GB per server | -|ad | AD | 3 servers, 128 HT cores and 512 GB per server | +|ad | AD | 2 servers, 128 HT cores and 512 GB per server | |epptheory | EPPTheory | 2 servers, 128 HT cores and 512 GB per server | |cdms | SuperCDMS | (points to iana) | |suncat | SUNCAT | (points to iana) | @@ -67,4 +67,4 @@ Users are welcome to submit a github pull-request to have their Jupyter environm ### Other Custom Ondemand Applications -If you wish to deploy your own custom Open Ondemand applications/services to the SLAC Ondemand Service, please [contact us](contact-us.md). \ No newline at end of file +If you wish to deploy your own custom Open Ondemand applications/services to the SLAC Ondemand Service, please [contact us](contact-us.md).