diff --git a/GettingStarted.md b/GettingStarted.md new file mode 100644 index 0000000..b4d12f7 --- /dev/null +++ b/GettingStarted.md @@ -0,0 +1,47 @@ +# Getting Started at S3DF + +This document will guide you through the basics of using S3DF's clusters, storage systems, and services. + +## Get a S3DF Account + +To utilize the S3DF facilities, you must first [acquire a S3DF account](accounts.md#account), and your user account should be associated with a S3DF allocation to run jobs + +## Connect to S3DF +There are three different ways to [access S3DF](accounts.md#connect) + +## Computing Resources +S3DF offers a variety of high-performance computing resources that are accessible. +Refer to the table below to find the specifications for each cluster + +| Partition name | CPU model | Useable cores per node | Useable memory per node | GPU model | GPUs per node | Local scratch | Number of nodes | +| --- | --- | --- | --- | --- | --- | --- | --- | +| [roma](systems.md#roma) | Rome 7702 | 120 | 480 GB | - | - | 300 GB | 129 | +| [milano](systems.md#milano)| Milan 7713 | 120 | 480 GB | - | - | 6 TB | 193 | +| [ampere](systems.md#ampere) | Rome 7542 | 112 (hyperthreaded) | 952 GB | Tesla A100 (40GB) | 4 | 14 TB | 42 | +| [turing](systems.md#turing) | Intel Xeon Gold 5118 | 40 (hyperthreaded) | 160 GB | NVIDIA GeForce 2080Ti | 10 | 300 GB | 27 | +| [ada](systems.md#ada) | AMD EPYC 9454 | 72 (hyperthreaded) | 702 GB | NVIDIA L40S | 10 | 21 TB | 6 | + +## Storage Resources +To ensure long-term consistency, the [S3DF directory structure](storage.md) features immutable paths that are independent of the underlying file system organization and technology. + +## Software +- In addition, S3DF utilizes Lmod to manage software packages installed through alternative methods. Through Lmod, S3DF provides support for a select number of software packages that are widely utilized by the SLAC communities. + +- S3DF encourages experts outside of the SCS to leverage Lmod for providing, supporting, maintaining, and sharing the software tools they develop. + +## Running Jobs +There are three different ways of [run jobs](run.md) on S3DF +- [Interactive](interactive-compute.md): Commands that you issue are executed immediately. +- [Batch](batch-compute.md): Jobs are submitted to a queue and are executed as soon as resources become available. +- [Service](service-compute.md): Long-lived jobs that run in the background waiting for data to analyze. + +## Data Transfers +s3dfdtn.slac.stanford.edu is a load-balanced DNS name which points to a pool of dedicated data transfer nodes. It is open to everyone with an S3DF account. Common tools like scp/sftp/rsync are available for casual data transfers. For serious large volume data transfer, you may consider bbcp and globus. You can refer to [this](managedata.md) for detailed information on data transfers in S3DF + +# Overview of the S3DF facilities + +![Resource](assets/Resource.png) + +## Getting Help +There are many [resources](help.md) available to assist you in utilizing S3DF effectively. The S3DF support team is always here to help you with any questions or challenges you may encounter. + diff --git a/README.md b/README.md index 0cfd0f8..441be34 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,25 @@ -Welcome to the SLAC Shared Scientific Data Facility (S3DF). The S3DF -is a compute, storage and network architecture designed to support +# S3DF Documentation + +Welcome to the SLAC Shared Scientific Data Facility (S3DF). + +- The S3DF is a compute, storage and network architecture designed to support massive scale analytics required by all SLAC experimental facilities and programs, including LCLS/LCLS-II, UED, cryo-EM, the accelerator, -and the Rubin observatory. The S3DF infrastructure is optimized for -data analytics and is characterized by large, massive throughput, high +and the Rubin observatory. +- The S3DF infrastructure is optimized for data analytics and is characterized by large, massive throughput, high concurrency storage systems. ## Quick Reference +- [Get Started](GettingStarted.md) - Information for new and existing S3DF users +- [Beginner's Guide](beginner-user.md) - Step by step guide for new users +- [Get Help](help.md) - How to get help +- [OnDemand Document](interactive-compute.md#ondemand) - Access S3DF through Open OnDemand via any (modern) browser +- [Jupyter](interactive-compute.md#jupyter) - Interactive Jupyter Notebooks at S3DF +- [Example Job Scripts](examplescripts.md) - example job scripts +- [Multi-Factor Authentication (MFA)](multifactor.md) - How to set up and use MFA for your S3DF account +- [Systems](systems.md) - Computing resources at S3DF + | Access | Address | | :--- | :--- | @@ -22,4 +34,4 @@ concurrency storage systems. | S3DF Dashboard & Monitoring | https://grafana.slac.stanford.edu| -![SRCF-II](assets/srcf-ii.png) +![Resource](assets/Resource.png) diff --git a/_sidebar.md b/_sidebar.md index effe5be..21f5dbf 100644 --- a/_sidebar.md +++ b/_sidebar.md @@ -1,14 +1,11 @@ * [Welcome](/) -* [Access](accounts-and-access.md) -* [Usage](getting-started.md) -* [Interactive Compute](interactive-compute.md) -* [Batch Compute](batch-compute.md) -* [Service Compute](service-compute.md) +* [Get Started](GettingStarted.md) +* [Beginner's Guide](beginner-user.md) +* [Accounts & Access](accounts.md) +* [Systems](systems.md) +* [Storage](storage.md) * [Software](software.md) -* [Storage](data-and-storage.md) -* [Transferring Data](data-transfer.md) -* [Tutorials](tutorials.md) -* [Business Model](business-model.md) -* [Reference](reference.md) -* [Status & Outages](changelog.md) -* [Contact Us](contact-us.md) +* [Run Jobs](run.md) +* [Manage Data](managedata.md) +* [Get Help](help.md) + diff --git a/accounts-and-access.md b/accounts-and-access.md index 7ccd4c4..7712fa6 100644 --- a/accounts-and-access.md +++ b/accounts-and-access.md @@ -40,38 +40,36 @@ If you have forgotten your password and need to reset it, [please contact the IT Make sure you comply with all SLAC training and cybersecurity requirements to avoid having your account disabled. You will be notified of these requirements via email. -## How to connect - -There are three mechanisms to access S3DF: - -1. **SSH**: You can connect using any SSH client, such as -[OpenSSH](www.openssh.com) or -[PuTTY](https://www.chiark.greenend.org.uk/~sgtatham/putty/), on the -standard TCP port 22, to connect to the S3DF load balanced bastion pool -`s3dflogin.slac.stanford.edu`. Note that these nodes do not have access to -storage (except for your home directory). From these bastion hosts, you -should hop onto an [Interactive -Node](interactive-compute.md#interactive-pools) to access S3DF batch compute -and storage. - -?> Windows users may see an error message about a "*Corrupted MAC on -input*" or "*message authentication code incorrect.*" -The workaround is to add "*-m hmac-sha2-512*" to the ssh command, i.e. -`ssh -m hmac-sha2-512 @s3dflogin.slac.stanford.edu` - -2. **NoMachine**: NoMachine provides a special remote desktop that is -specifically designed to improve, compared to ssh, the performance of -X11 graphics over slow connection speeds. Another important feature is -that it preserves the state of your desktop across multiple -sessions, including when your internet session unexpectedly gets dropped. The login pool for NoMachine is -`s3dfnx.slac.stanford.edu`. You can find more information about this -access mode in the [NoMachine reference](reference.md#nomachine). - -3. **OnDemand**: If you do not have a terminal handy or you want to -use applications like Jupyter, you can also launch a web-based -terminal using OnDemand:\ -[`https://s3df.slac.stanford.edu/ondemand`](https://s3df.slac.stanford.edu/ondemand).\ -You can find more information about using OnDemand in the [OnDemand +## How to connect :id=connect + +There are three primary methods to access S3DF: + +1. **SSH** (Secure Shell): + + - You can connect using any SSH client, such as [OpenSSH](www.openssh.com) or [PuTTY](https://www.chiark.greenend.org.uk/~sgtatham/putty/), via standard TCP port 22 to reach the S3DF load-balanced bastion pool at s3dflogin.slac.stanford.edu + + ssh username@login-node-address + + - Please note that these bastion hosts do not have storage access except for your home directory. After connecting, you must hop onto an [Interactive +Node](interactive-compute.md#interactive-pools)to access S3DF batch compute resources and storage. + + ssh username@pool-node-address + + - For Windows Users: If you encounter an error message regarding a “Corrupted MAC on input” or “message authentication code incorrect,” you can resolve this by adding “-m hmac-sha2-512” to your SSH command. For example: + + ssh -m hmac-sha2-512 @s3dflogin.slac.stanford.edu + +2. **NoMachine**: + + - NoMachine offers a specialized remote desktop solution that enhances X11 graphics performance over slow connections compared to SSH. + - An added benefit is that it maintains your desktop state across sessions, even if your internet connection is dropped unexpectedly. + - Use the login pool for NoMachine at s3dfnx.slac.stanford.edu. Additional details about this access method can be found in the NoMachine reference documentation [NoMachine reference](reference.md#nomachine) + +3. **OnDemand**: + + - If you prefer not to use a terminal or want to run applications such as Jupyter, you can access a web-based terminal via OnDemand [`https://s3df.slac.stanford.edu/ondemand`](https://s3df.slac.stanford.edu/ondemand). + - For further information on using OnDemand, please refer to the OnDemand reference documentation [OnDemand reference](interactive-compute.md#ondemand). + ![S3DF users access](assets/S3DF_users_access.png) diff --git a/accounts.md b/accounts.md new file mode 100644 index 0000000..a9d81bc --- /dev/null +++ b/accounts.md @@ -0,0 +1,50 @@ +# Accounts and Access + +## How to get an account :id=account + +### Eligibility for S3DF Accounts +SLAC employees, affiliated researchers, and experimental facility users are eligible for an S3DF account. +?> Please note that S3DF authentication requires a SLAC UNIX account. + +### Steps to Acquire a S3DF Account + +#### Step 1: Obtain a SLAC UNIX Account +If you do not already have a SLAC UNIX account, follow these steps to [get a SLAC UNIX account](slac-unix-account.md) + +#### Step 2: Get A S3DF Account +After you get a SLAC UNIX account, [register Your SLAC UNIX Account in S3DF](slac-unix-account.md#register) + + +## How to connect :id=connect + +There are three primary methods to access S3DF: + +1. **SSH** (Secure Shell): + + - You can connect using any SSH client, such as [OpenSSH](www.openssh.com) or [PuTTY](https://www.chiark.greenend.org.uk/~sgtatham/putty/), via standard TCP port 22 to reach the S3DF load-balanced bastion pool at s3dflogin.slac.stanford.edu + + ssh username@s3dflogin.slac.stanford.edu + + - Please note that these bastion hosts do not have storage access except for your home directory. After connecting, you must hop onto an [Interactive +Node](interactive-compute.md#interactive-pools)to access S3DF batch compute resources and storage. + + ssh pool-node-address + + - For Windows Users: If you encounter an error message regarding a “Corrupted MAC on input” or “message authentication code incorrect,” you can resolve this by adding “-m hmac-sha2-512” to your SSH command. For example: + + ssh -m hmac-sha2-512 @s3dflogin.slac.stanford.edu + +2. **NoMachine**: + + - NoMachine offers a specialized remote desktop solution that enhances X11 graphics performance over slow connections compared to SSH. + - An added benefit is that it maintains your desktop state across sessions, even if your internet connection is dropped unexpectedly. + - Use the login pool for NoMachine at s3dfnx.slac.stanford.edu. Additional details about this access method can be found in the NoMachine reference documentation [NoMachine reference](reference.md#nomachine) + +3. **OnDemand**: + + - If you prefer not to use a terminal or want to run applications such as Jupyter, you can access a web-based terminal via OnDemand [`https://s3df.slac.stanford.edu/ondemand`](https://s3df.slac.stanford.edu/ondemand). + - For further information on using OnDemand, please refer to the OnDemand reference documentation [OnDemand +reference](interactive-compute.md#ondemand). + + +![S3DF users access](assets/S3DF_users_access.png) diff --git a/assets/Resource.png b/assets/Resource.png new file mode 100644 index 0000000..4da8c01 Binary files /dev/null and b/assets/Resource.png differ diff --git a/assets/ada.png b/assets/ada.png new file mode 100644 index 0000000..311f8f3 Binary files /dev/null and b/assets/ada.png differ diff --git a/assets/ampere.png b/assets/ampere.png new file mode 100644 index 0000000..de63409 Binary files /dev/null and b/assets/ampere.png differ diff --git a/assets/milano.png b/assets/milano.png new file mode 100644 index 0000000..e83c970 Binary files /dev/null and b/assets/milano.png differ diff --git a/assets/roma.png b/assets/roma.png new file mode 100644 index 0000000..4252cb8 Binary files /dev/null and b/assets/roma.png differ diff --git a/assets/turing.png b/assets/turing.png new file mode 100644 index 0000000..5e8de8b Binary files /dev/null and b/assets/turing.png differ diff --git a/beginner-guide.md b/beginner-guide.md new file mode 100644 index 0000000..fa38778 --- /dev/null +++ b/beginner-guide.md @@ -0,0 +1,61 @@ +# A Beginner's Guide to using S3DF + +Welcome to S3DF! This guide provides a clear, step-by-step workflow for all users, particularly those with limited computing experience. In this document, we will walk you through how to: + +- Log in to the S3DF system +- Navigate directories and storage spaces +- Access supported applications +- Prepare and submit a job script + +These items illustrate a typical workflow for many S3DF users, particularly those utilizing our systems for extensive calculations. These calculations may encompass simulations of physical phenomena, data pre-processing or post-processing, and various forms of data generation or analysis. + +Before we dive into the details, please remember that you can always reach out for [assistance](contact-us.md) + +## Connect to S3DF: there are three primary methods to [access](accounts-and-access.md#connect) S3DF + - **SSH** (Secure Shell): + - You can connect Login Node using any SSH client + + ssh username@login-node-address + + - After successfully connecting to the Login Node, establish a second connection to a [Pool Node] (interactive-compute.md#interactive-pools) using SSH to access S3DF batch compute resources and storage. + + ssh username@pool-node-address + - **NoMachine**: [NoMachine reference](reference.md#nomachine) offers a specialized remote desktop solution that enhances X11 graphics performance over slow connections compared to SSH. + - **OnDemand**: you can access a web-based terminal via OnDemand [`https://s3df.slac.stanford.edu/ondemand`](https://s3df.slac.stanford.edu/ondemand). For further information, please refer to [OnDemand +reference](interactive-compute.md#ondemand). + +## Storage System + +- You can run your desired software interactively. For instance, if you need to use HFSS, launch it from the pool node. +- Alternatively, if you're configuring input files for other software, such as ACE3P, proceed to the next step. + +## Compute Nodes + +- Prepare and configure the necessary input files for the software you intend to use. Ensure all files are correctly set up for your simulations. + +## Prepare and Submit Slurm Job Scripts + +- Use the sbatch command to submit your jobs to a compute node for execution. +- Example command: + + sbatch your-job-script.sbatch + +## Accessing Supported Applications Check Status of Running Jobs (Optional) + +- To monitor the status of your submitted jobs, use the following command: + + squeue -u username + +## Get Help + + - Once your jobs have completed, you can view the data output directly on the pool node to ensure results are as expected. + +## 8. Transfer Data (if necessary) + +- If you need to transfer data, connect to a data transfer node to facilitate the movement of your files. +- Use appropriate file transfer commands (e.g., scp, rsync) to move your data to the desired location. + + +By following this workflow, you can effectively utilize the S3DF system for your computational needs. +Ensure you have all necessary software and dependencies installed before starting, +and refer to additional documentation for specific software setup if needed. diff --git a/beginner-user.md b/beginner-user.md new file mode 100644 index 0000000..9d576e8 --- /dev/null +++ b/beginner-user.md @@ -0,0 +1,115 @@ +# Beginner's Guide + +Welcome to S3DF! This guide provides a clear, step-by-step workflow for all users, particularly those with limited computing experience. In this document, we will walk you through how to: + +- Log in to the S3DF system +- Navigate directories and storage spaces +- Access supported applications +- Prepare and submit a job script +Follow these instructions to efficiently connect to the S3DF environment and run your desired software. Let's get started! + + +## 1. Connect to a Login Node + +- Use SSH or NoMachine to connect to a login node. This is your initial access point to the system. +- Example command for SSH: + + ssh username@login-node-address + +## 2. Connect to a Pool Node + +- After successfully connecting to the login node, establish a second connection to a pool node using SSH. +- Example command: + + ssh username@pool-node-address + +## 3. Setup Running Environment + +- S3DF uses the Lmod Module system to administrate common software packages +- There are default modules that are loaded into your environment upon logging in +- S3DF encourages experts from non-SCS to use Lmod to provide, support, maintain and share software tools they build. +- +## 4. Slurm Job Script + +- [Slurm](refernece.md#slurm-faq) is a batch scheduler that enables users to submit compute jobs of varying scope to our compute clusters. +- It will queue up jobs such that the compute resources available in S3DF are fairly and efficiently shared and distributed for all users. +- Prepare a slurm job script + +## 5. Submit Jobs to a Compute Node + +- Use the sbatch command to submit your jobs to a compute node for execution. +- Example command: + + sbatch your-job-script.sbatch + +## 6. Check Status of Running Jobs (Optional) + +- To monitor the status of your submitted jobs, use the following command: + + squeue -u username + +## 7. View Data Output + + - Once your jobs have completed, you can view the data output directly on the pool node to ensure results are as expected. + +## 8. Transfer Data (if necessary) + +- If you need to transfer data, connect to a data transfer node to facilitate the movement of your files. +- Use appropriate file transfer commands (e.g., scp, rsync) to move your data to the desired location. + + +By following this workflow, you can effectively utilize the S3DF system for your computational needs. +Ensure you have all necessary software and dependencies installed before starting, +and refer to additional documentation for specific software setup if needed. + +# Examples + +## Logging In Through SSH + +This example provides a clear, step-by-step workflow for running software, ACE3P (Advanced Computational Electromagnetics 3D Parallel), on S3DF throgh SSH. + +- 1. Connect to a Login Node +To start, connect to the login node using the following command: + + ssh username@s3dflogin.slac.stanford.edu + +- 2. Connect to a Pool Node +After successfully connecting to the login node, establish a second connection to a pool node using SSH. For example: + + ssh iana + +- 3. Set Up the Running Environment +To set up the running environment, create a bash file containing all necessary commands, and then execute the bash file. + +- 4. Configure an SLURM Job Script +Here is an example SLURM job script named run.sbatch: + + + #!/bin/bash + #SBATCH --partition=milano + #SBATCH --account=rfar + #SBATCH --job-name=test + #SBATCH --output=output-%j.txt + #SBATCH --error=error-%j.txt + #SBATCH --nodes=1 + #SBATCH --ntasks-per-node=16 + #SBATCH --time=0-00:10:00 + mpirun /sdf/group/rfar/ace3p/bin/omega3p pillbox.omega3p + + + - 5. Submit Jobs to a Compute Node +Use the sbatch command to submit your job to a compute node for execution: + + sbatch run.sbatch + + - 6. Check the Status of Running Jobs (Optional) +To monitor the status of your submitted jobs, run the following command: + + squeue -u username + +- 7. View Data Output +Once your jobs have completed, you can view the data output directly on the pool node to verify that the results are as expected. + +- 8. Transfer Data (If Necessary) +If you need to transfer data, connect to a data transfer node to facilitate the movement of your files. Use appropriate file transfer commands (e.g., scp, rsync) to move your data to the desired location. + diff --git a/gettingstarted/clusters-and-repos.md b/gettingstarted/clusters-and-repos.md new file mode 100644 index 0000000..63e43dc --- /dev/null +++ b/gettingstarted/clusters-and-repos.md @@ -0,0 +1,49 @@ +# S3DF Compute Clusters Overview +The S3DF environment consists of several compute clusters designed to support a variety of computational needs. Below is a detailed breakdown of the different types of nodes and their specific characteristics. + +## Node Types +- Login/OnDemand Nodes + + - Purpose: Access to other resources within the S3DF environment. + +- Data Transfer Nodes + + - Purpose: Facilitates the downloading and uploading of files. + +- Interactive Pool Nodes + + - Purpose: Used for compiling code, submitting jobs, and executing tasks interactively. + +- Compute/Batch Nodes + + - Purpose: Dedicated to running High-Performance Computing (HPC) jobs utilizing either CPUs or GPUs. + + ![Node types](nodetype.png) + +## Compute Node Clusters +The compute nodes are partitioned into three distinct clusters: + +### 1. Milano Cluster +- Number of Nodes: 120 +- Node Type: Dual-CPU Node +- Memory: 512 GB +- CPUs: + - 2x AMD Milan 7713 (64 cores each) + +### 2. Roma Cluster + - Number of Nodes: 39 + - Node Type: Dual-CPU Node + - Memory: 512 GB + - CPUs: + - 2x AMD Rome 7702 (64 cores each) + +### 3. Ampere Cluster + - Number of Nodes: 23 + - Node Type: CPU/GPU Hybrid Node + - Memory: 1024 GB + - CPUs: + - AMD Rome 7542 (64 cores) + - GPUs: + - 4x Nvidia Tesla A100 + +This structure of clusters and node types ensures that S3DF can meet a wide range of computational demands efficiently. Please refer to additional documentation for specific usage guidelines and best practices for optimizing performance in your computational tasks. diff --git a/gettingstarted/index.md b/gettingstarted/index.md new file mode 100644 index 0000000..682c276 --- /dev/null +++ b/gettingstarted/index.md @@ -0,0 +1,14 @@ +# A Beginner’s Guide to Using S3DF + +Welcome to the SLAC Shared Scientific Data Facility (S3DF)! This guide is designed for all users—especially those new to high-performance computing. Whether you're just getting started or need a quick refresher, this guide will walk you through the essentials: + + +## Table of Contents +- [Logging on to S3DF](.//logging-on-to-s3df.md) +- [Clusters & Repos](./gettingstarted/clusters-and-repos.md) +- [Preparing and Submitting Slurm Job Scripts](/gettingstarted/preparing-and-submitting-slurm-job-scripts.md) +- [Examples](../examples/) + +Let’s dive in and make your first S3DF experience smooth and productive! + +For further details, refer to [S3DF Documentation](https://s3df.slac.stanford.edu/#/documentation). diff --git a/gettingstarted/logging-on-to-s3df.md b/gettingstarted/logging-on-to-s3df.md new file mode 100644 index 0000000..d379149 --- /dev/null +++ b/gettingstarted/logging-on-to-s3df.md @@ -0,0 +1,72 @@ + +# 🔑 How to Access S3DF + +S3DF supports three main access methods depending on your needs: terminal (SSH), remote desktop (NoMachine), and browser-based access (OnDemand). Below is a breakdown of each option: + + +## 1. 🖥️ SSH (Terminal Access) + +If you're comfortable using a terminal, SSH is the most direct way to access S3DF. + +- Use any SSH client such as: + + - macOS/Linux: Built-in terminal with ssh + + - Windows: PuTTY or Windows Terminal with OpenSSH + +- Connect to the S3DF login pool using this command: + + ssh your_username@s3dflogin.slac.stanford.edu + +- These login nodes are bastion hosts and only give access to your home directory. + +- To use storage or run compute jobs, you’ll need to SSH again from the login node to an interactive compute node. + +### ⚠️ Windows Users: +If you see an error like: + + Corrupted MAC on input or message authentication code incorrect + +try adding this flag to your SSH command: + + ssh -m hmac-sha2-512 your_username@s3dflogin.slac.stanford.edu + +## 2. 🖼️ NoMachine (Remote Desktop Access) + +NoMachine offers a graphical desktop environment that works well even on slower internet connections. It’s especially useful for applications that require graphical interfaces (e.g., X11-based tools). + + - Benefits: + + - Better performance for remote graphics + + - Preserves session state if your connection drops + + - Connect to: + + s3dfnx.slac.stanford.edu + + - Download and install the NoMachine client for your system. + + - For detailed instructions, refer to the NoMachine access guide (). + +## 3. 🌐 OnDemand (Web Portal Access) + +OnDemand provides a browser-based interface for users who prefer not to use the terminal. + +- Access it here: + + 👉 https://s3df.slac.stanford.edu/ondemand + +- Features available after login: + + - Launch a web-based terminal + + - Start Jupyter notebooks + + - Access remote desktops + + - Manage SLURM jobs and file browsing + +### 💡 Ideal for beginners or anyone needing quick access without configuring SSH or desktop clients. + + ![Login Screenshot](access.png) diff --git a/gettingstarted/preparing-and-submitting-slurm-job-scripts.md b/gettingstarted/preparing-and-submitting-slurm-job-scripts.md new file mode 100644 index 0000000..6dca0e8 --- /dev/null +++ b/gettingstarted/preparing-and-submitting-slurm-job-scripts.md @@ -0,0 +1,94 @@ +# Running Jobs + +S3DF provides two main ways to run jobs: +- Interactive Jobs +- Batch Jobs. +This guide will help you understand how to use both methods effectively. + +## 1. Interactive Jobs +Interactive jobs allow you to access compute resources for tasks such as building, debugging, running analyses, or submitting jobs to the batch system. + +### Steps to Run an Interactive Job + +#### 1. Log in to the Bastion Host: + +Use an SSH terminal session (or NoMachine) to log into the bastion host: + + ssh s3dflogin.slac.stanford.edu + +#### 2. Connect to an Interactive Pool: +After logging into the bastion host, SSH into an interactive pool: + + ssh + +#### 3. Run Your Commands: +You can execute commands directly in the interactive session. For example: + + ./your_program + +### Additional Notes: +- Ensure that you have sufficient resources for your tasks. +- When finished, simply type exit to end your interactive session. + +## 2. Batch Jobs +Batch jobs in S3DF are managed through Slurm, a batch scheduler that allows users to submit compute jobs across clusters. This system ensures fair and efficient sharing of resources among all users. + +### Why Use Batch Jobs? + - Enhanced Resources: Batch jobs can utilize significantly more CPU, GPU, and memory than personal computers, enabling large computations and data processing tasks. + - Efficient Processing: S3DF servers offer rapid access to centralized storage and have a variety of pre-installed software, facilitating quick and large-scale computation without impacting personal devices. + - Slurm Transition: S3DF uses Slurm due to its compatibility with modern systems, including NVIDIA GPUs, improving scheduling efficiency and user experience compared to previous batch systems. + +### Key Concepts in Batch Jobs + - Batch Nodes: These are servers configured for running batch jobs. + - Slurm Partition: A logical grouping of batch nodes with similar specifications (e.g., CPU types). Example partitions include roma and milano. + - Resource Monitoring: Use the following command to check the status of nodes: + + sinfo --Node --format="%10N %.6D %10P %10T %20E %.4c %.8z %8O %.6m %10e %.6w %.60f" + +### Submitting a Batch Job + +#### 1. Create a Batch Script: +Write a script file (e.g., script.sh) with Slurm commands and the job commands you want to execute: + + #!/bin/bash + + #SBATCH --partition=milano + #SBATCH --job-name=test + #SBATCH --output=output-%j.txt + #SBATCH --error=output-%j.txt + #SBATCH --ntasks=1 + #SBATCH --cpus-per-task=12 + #SBATCH --mem-per-cpu=1g + #SBATCH --time=0-00:10:00 + #SBATCH --gpus 1 + + + +- Replace with the specific commands for your job. +- The script will log output and error messages to output-%j.txt, where %j is replaced by the job ID. + +#### 2. Submit the Job: +Use the sbatch command to submit your batch script: + + sbatch script.sh + +#### 3. Monitor Your Job: +Check the status of your job using: + + scontrol show job + +### Submitting Jobs Without a Batch Script +Alternatively, you can submit jobs directly from the command line using the --wrap option: + + sbatch --wrap="your_command_here" + +### Specifying Job Duration +It is crucial to set a meaningful duration for your job, allowing the Slurm scheduler to prioritize jobs effectively. Use the --time option with formats such as: + +- M (minutes) +- H:M:S (hours, minutes, seconds) +- D-H (days, hours) + +Jobs exceeding the specified time will terminate, potentially wasting computational resources. + +This guide provides an overview of how to run both interactive and batch jobs in S3DF. Using these resources effectively can enhance your computational efficiency and overall experience on the system. If you have further questions, please refer to the S3DF documentation or reach out for support. diff --git a/gettingstarted/quickstart.md b/gettingstarted/quickstart.md new file mode 100644 index 0000000..2b9ed0f --- /dev/null +++ b/gettingstarted/quickstart.md @@ -0,0 +1,53 @@ +# S3DF General Workflow Guide + +This guide provides a clear step-by-step workflow for using the S3DF system. Follow these instructions to efficiently connect to the S3DF environment and run your desired software. + +## 1. Connect to a Login Node + +- Use SSH or NoMachine to connect to a login node. This is your initial access point to the system. +- Example command for SSH: + + ssh username@login-node-address + +## 2. Connect to a Pool Node + +- After successfully connecting to the login node, establish a second connection to a pool node using SSH. +- Example command: + + ssh username@pool-node-address + +## 3. Run Desired Software + +- You can run your desired software interactively. For instance, if you need to use HFSS, launch it from the pool node. +- Alternatively, if you're configuring input files for other software, such as ACE3P, proceed to the next step. + +## 4. Configure Input Files + +- Prepare and configure the necessary input files for the software you intend to use. Ensure all files are correctly set up for your simulations. + +## 5. Submit Jobs to a Compute Node + +- Use the sbatch command to submit your jobs to a compute node for execution. +- Example command: + + sbatch your-job-script.sbatch + +## 6. Check Status of Running Jobs (Optional) + +- To monitor the status of your submitted jobs, use the following command: + + squeue -u username + +## 7. View Data Output + + - Once your jobs have completed, you can view the data output directly on the pool node to ensure results are as expected. + +## 8. Transfer Data (if necessary) + +- If you need to transfer data, connect to a data transfer node to facilitate the movement of your files. +- Use appropriate file transfer commands (e.g., scp, rsync) to move your data to the desired location. + + +By following this workflow, you can effectively utilize the S3DF system for your computational needs. +Ensure you have all necessary software and dependencies installed before starting, +and refer to additional documentation for specific software setup if needed. diff --git a/help.md b/help.md new file mode 100644 index 0000000..2594e15 --- /dev/null +++ b/help.md @@ -0,0 +1,50 @@ +# Contact Us + +For requests made during office hours you can expect a response from +us within 2 hours and a resolution within 24 hours. Outside of office +hours, please contact your facility [PoC](contact-us.md#facpoc) if an +S3DF problem is preventing one of the SLAC experimental facilities +from taking data or the accelerator from operating and the PoC will +contact us directly on the phone. + +| | | +|--- |--- | +| Email for reporting problems and getting help | s3df-help@slac.stanford.edu | +| Slack channel for general discussion | [slac.slack.com #comp-sdf](https://slac.slack.com/app_redirect?channel=comp-sdf) | + + +### Facilities and Point of Contacts + +The table below shows the organizations, programs, projects, or groups +that own resources within the S3DF. Contact us if you want to use the +S3DF and you don't see your facility in this table. + +|Facility | PoC | Primary POSIX group| +|--- |--- |--- | +|Rubin | James Chiang, Adam Bolton | rubin_users | +|SuperCDMS | Concetta Cartaro | cdms | +|LCLS | pcds-datamgt-l@slac.stanford.edu | ps-users | +|MLI| Daniel Ratner | mli | +|Neutrino| Kazuhiro Terao | nu | +|AD | Greg White | cd | +|SUNCAT | Johannes Voss| suncat-norm | +|Fermi | Seth Digel, Nicola Omodei| glast-pipeline | +|EPPTheory | Tom Rizzo | theorygrp | +|FACET | Nathan Majernik | facet | +|DESC | Heather Kelly | desc | +|KIPAC | Marcelo Alvarez | ki | +|RFAR | David Bizzozero | rfar | +|SIMES | Tom Devereaux, Brian Moritz | simes | +|CryoEM | Patrick Pascual | cryo-data | +|SSRL | Riti Sarangi | ssrl | +|LDMX | Omar Moreno | ldmx | +|HPS | Mathew Graham | hps | +|EXO | Brian Mong | exo | +|ATLAS | Wei Yang, Michael Kagan | atlas | +|CDS | Ernest Williams | cds | +|SRS | Tony Johnson | srs | +|FADERS | Ryan Herbst | faders | +|TOPAS | Joseph Perl | topas | +|RP | Thomas Frosio | esh-rp | +|Projects | Yemi Adesanya, Ryan Herbst | - | +|SCS | Omar Quijano, Yee Ting Li, Gregg Thayer | - | diff --git a/managedata.md b/managedata.md new file mode 100644 index 0000000..fe22b26 --- /dev/null +++ b/managedata.md @@ -0,0 +1,74 @@ +# Data transfer + +`s3dfdtn.slac.stanford.edu` is a load-balanced DNS name which points +to a pool of dedicated data transfer nodes. It is open to everyone +with an S3DF account. Common tools like scp/sftp/rsync are available +for casual data transfers. For serious large volume data transfer, you +may consider `bbcp` and `globus`. + +## bbcp + +This is a high performance multi-stream data transfer tool developed +at SLAC. In its simplest form, the `bbcp` command line is similar to +that of `scp`. A simple command using bbcp at SDF looks like this:\ +`bbcp me@remote.univ.edu:/tmp/myfile ./myfile`\ +You may need to type your password for `me@remote.univ.edu`, unless +you setup password-less login to `remote.univ.edu` (e.g. ssh key). + +To achieve high performance, bbcp opens an additional TCP port. This +sometime won't work if there is a firewall. The `-Z` option allows +you to specify a range of TCP ports that are not blocked by +firewall. The `-z` is another commonly used option to work with +firewall. Type `bbcp --help` or go to the [bbcp web +page](https://www.slac.stanford.edu/~abh/bbcp/) for more info. + +Both source and destination must have the bbcp executable in +$PATH. The bbcp executable can be downloaded by following the link in +the bbcp web page. If bbcp is not in `$PATH`, use the `-S` or `-T` +option to specify the non-standard location. Please carefully read the +bbcp web page with regards to these options as they are not as +intuitive as you may think. Also, sometimes a cut-n-paste of dash +(`-`) from the web page end up with something that looks like a dash +but not a dash. In that case, just replace it with a real dash. + +Using the above command line as a example, if you copy bbcp to your +home directory at `remote.univ.edu`, enter:\ +`bbcp -S 'ssh -l %U %H ~/bbcp' me@remote.univ.edu:/tmp/myfile ./myfile`\ +Here we use option `-S` because `remote.univ.edu` is the data +source. bbcp will substitute `%U` and `%H` with `me` and +`remote.univ.edu` respectively. + +[More examples from NERSC](https://docs.nersc.gov/services/bbcp/). You +can find more information at the [bbcp +page](https://www.slac.stanford.edu/~abh/bbcp/). + +## Globus + +S3DF has a Globus 5 testing endpoint `slac#s3df_globus5`. This service is +available to everyone with an S3DF account. You can find more information +at the [Globus page](https://www.globus.org). + +## Trouble shooting + +A common issue with data transfer is that "it is slow". The performance +of the wide area network data tranfers involves the SLAC storage, the +storage at the other side, and the network in between. + +### Checking the storage + +The following example assumes a posix storage. The storage at both ends +should be checked: + + - dd if=/dev/zero of=$HOME/zeros bs=2k count=65536 oflag=direct + - dd if=$HOME/zeros of=/dev/null bs=2k iflag=direct + +At SLAC, this can be done on a data transfer node (s3dfdtn.slac.stanford.edu). +The speed of the write/read from the above commands aren't the very important +(as long as they are not below single MB/s range). The change overtime is +important (indicating potential problems). + +### Checking the WAN (wide area network) + +SLAC DTNs have `iperf3` installed. One can run an `iperf3` server/client at +the SLAC DTN and run `iperf3` client/server at the other end. This will give +an estimation of expected network performance. diff --git a/quickstart.md b/quickstart.md new file mode 100644 index 0000000..a5a6528 --- /dev/null +++ b/quickstart.md @@ -0,0 +1,54 @@ +# S3DF General Workflow Guide + +This guide provides a clear step-by-step workflow for using the S3DF system. Follow these instructions to efficiently connect to the S3DF environment and run your desired software. + +## 1. Connect to a Login Node + +- Use SSH or NoMachine to connect to a login node. This is your initial access point to the system. +- Example command for SSH: + + ssh username@login-node-address + +## 2. Connect to a Pool Node + +- After successfully connecting to the login node, establish a second connection to a pool node using SSH. +- Example command: + + ssh username@pool-node-address + +## 3. Run Desired Software + +- You can run your desired software interactively. For instance, if you need to use HFSS, launch it from the pool node. +- Alternatively, if you're configuring input files for other software, such as ACE3P, proceed to the next step. + +## 4. Configure Input Files + +- Prepare and configure the necessary input files for the software you intend to use. Ensure all files are correctly set up for your simulations. + +## 5. Submit Jobs to a Compute Node + +- Use the sbatch command to submit your jobs to a compute node for execution. +- Example command: + + sbatch your-job-script.sbatch + +## 6. Check Status of Running Jobs (Optional) + +- To monitor the status of your submitted jobs, use the following command: + + squeue -u username + +## 7. View Data Output + + - Once your jobs have completed, you can view the data output directly on the pool node to ensure results are as expected. + +## 8. Transfer Data (if necessary) + +- If you need to transfer data, connect to a data transfer node to facilitate the movement of your files. +- Use appropriate file transfer commands (e.g., scp, rsync) to move your data to the desired location. + + +By following this workflow, you can effectively utilize the S3DF system for your computational needs. +Ensure you have all necessary software and dependencies installed before starting, +and refer to additional documentation for specific software setup if needed. + diff --git a/run.md b/run.md new file mode 100644 index 0000000..42bbdf7 --- /dev/null +++ b/run.md @@ -0,0 +1,38 @@ +# Getting Started + +## Modes of Operation + +There are three different ways of utilizing S3DF: + +1. [**Interactive**](interactive-compute.md): Commands that you issue are executed immediately. This is the most common approach for activities like building and debugging code, running simple analysis which require limited resources and/or require interaction with plots and logs. There are two modes for doing interactive work in S3DF: through a terminal, by opening a shell on one of the interactive pools, or through a browser, via OnDemand. + +2. [**Batch**](batch-compute.md): Jobs are submitted to a queue and are executed as soon as resources become available. This is the most common approach for running large jobs or for executing many jobs. Note that this is the most efficient mechanism form a facility perspective because it provides the best use of the available resources. Also, note that, by far, the largest fraction of S3DF computing cycles is in the batch system. S3DF uses SLURM as workload manager for batch jobs. + +3. [**Service**](service-compute.md): This approach is for running long-lived jobs that run in the background waiting for data to analyze. This method is for service activities and require specific resources. These resources may be dedicated hardware acquired by your organization or may be dynamically allocated from a larger pool. For the latter approach, S3DF uses Kubernetes, an open source framework for automating deployment, scaling, and management of containerized applications. + +![S3DF users view](assets/S3DF_users_view.png) + +Users can use SSH, NoMachine or a browser to log into the system. The login nodes are designed to do just that - to let you into the system. In order to actually analyze the data, you will need to access one of the [interactive pools](interactive-compute.md) or one of the [batch partitions](batch-compute.md). + + +## Do's and Don'ts + +- Do [talk to us](contact-us.md) about your requirements. + +- Don't perform any compute tasks on the login nodes as those are meant to operate only as bastion hosts, not for doing analysis or accessing data. + +- Don't perform compute intensive tasks on the interactive nodes, use the batch system instead. + +- Do be respectful of other users' jobs - you shall be sharing a limited set of nodes with many many other users. Please consider the type, size and quantity of jobs that you submit so that you do not starve others of compute resources. We do implement fair sharing to limit the impact upon others, however there are ways to game the system and your organization is charged time for the resources you utilize, see [batch banking](batch-compute.md#banking). + +- Don't run interactive sessions on the batch system for a long time. Opening an interactive session on SLURM (using `srun --pty bash`) and not actually running any heavy processes can be wasteful of resources and could potentially be preventing others from doing their work. Consider using the interactive pools or OnDemand for these activities. + +- Avoid keeping many (thousands to millions) of files in a single directory if possible - file systems typically do a lot better when you use a small number of large files. + +- Do keep an eye on your file system [quotas](reference.md#storagequota) - your jobs will likely fail if it cannot write to disk due to a full quota. You can either choose a different file system to write to, or request a quota increase, or remove files you don't need anymore. + +- Limit I/O intensive sessions where your jobs reads or writes a lot of data, or performs intensive meta data operations such as stat'ing many files or directories and opening and closing files in quick succession. + +- Do test your jobs before launching many (potentially hundreds) of them for your actual analysis. + +- Do request only the resources that you need. If you ask for more time or more CPUs or GPUs than you can actually use in your job, then it will take longer for your job to start, you will be reducing your fairshare so that your later jobs may be de-prioritised, and it prevents others from using potentially idle resources. diff --git a/sidebar-backup.md b/sidebar-backup.md new file mode 100644 index 0000000..b876bad --- /dev/null +++ b/sidebar-backup.md @@ -0,0 +1,24 @@ +* [Welcome](/) +* [Getting Started](GettingStarted.md) +* [Beginner Guide](beginner-guide.md) +* [Systems](systems.md) +* [Account](accounts.md) +* [Connect](connect.md) +* [Run Jobs](run.md) +* [Software](softeare.md) +* [Manage Data](managedata.md) +* [Statues & Outages](log.md) +* [Help](help.md) +* [Access](accounts-and-access.md) +* [Usage](getting-started.md) +* [Interactive Compute](interactive-compute.md) +* [Batch Compute](batch-compute.md) +* [Service Compute](service-compute.md) +* [Software](software.md) +* [Storage](data-and-storage.md) +* [Transferring Data](data-transfer.md) +* [Tutorials](tutorials.md) +* [Business Model](business-model.md) +* [Reference](reference.md) +* [Status & Outages](changelog.md) +* [Contact Us](contact-us.md) diff --git a/slac-unix-account.md b/slac-unix-account.md new file mode 100644 index 0000000..3fd9ad2 --- /dev/null +++ b/slac-unix-account.md @@ -0,0 +1,39 @@ +# SLAC UNIX account + +## Get a SLAC UNIX ID + - Affiliated users/experimental facility users: Obtain a SLAC ID via the [Scientific Collaborative Researcher Registration process](https://it.slac.stanford.edu/identity/scientific-collaborative-researcher-registration) form + - SLAC employees: You should already have a SLAC ID number. + +## Take training +Take the appropriate cybersecurity SLAC training course via the [SLAC training portal](https://slactraining.slac.stanford.edu/how-access-the-web-training-portal): + - All lab users and non-SLAC/Stanford employees: "CS100: Cyber Security for Laboratory Users Training". + - All SLAC/Stanford employees or term employees of SLAC or the University: "CS200: Cyber Security Training for Employees". + - Depending on role, you may be required to take additional cybersecurity training. Consult with your supervisor or SLAC Point of Contact (POC) for more details. + +## Request a UNIX account +Ask your [SLAC POC](contact-us.md#facpoc) to submit a ticket to SLAC IT requesting a UNIX account. +In your request indicate your SLAC ID and your preferred account name (include a second choice in case). + +## Managing your UNIX account password + + - You can change your password via [the SLAC UNIX self-service password update site](https://unix-password.slac.stanford.edu/). + + - If you have forgotten your password and need to reset it, [please contact the IT Service Desk](https://it.slac.stanford.edu/support). + + - ake sure you comply with all SLAC training and cybersecurity requirements to avoid having your account disabled. You will be notified of these requirements via email. + +# Register Your SLAC UNIX Account in S3DF :register + + - Log into the [Coact S3DF User Portal](https://s3df.slac.stanford.edu/coact) using your SLAC UNIX account via the "Log in with S3DF (unix)" option. + - Click on "Repos" in the menu bar. + - Click the "Request Access to Facility" button and select a facility from the dropdown. + - Include your affiliation and other contextual information for your request in the "Notes" field, then submit. + - A czar for the S3DF facility you requested access to will review your request. **Once approved by a facility czar**, the registration process should be completed in about 1 hour. + +?> To access files and folders in facilities such as Rubin and LCLS, you will need to ask your +SLAC POC to add your username to the [POSIX +group](contact-us.md#facpoc) that manages access to that facility's +storage space. In the future, access to facility storage will be part of the S3DF registration process in Coact. + + + diff --git a/storage.md b/storage.md new file mode 100644 index 0000000..260982c --- /dev/null +++ b/storage.md @@ -0,0 +1,62 @@ +# Storage + +## Directory Structure + +To promote long term consistency, the S3DF directory structure provides immutable paths, independent from the underlying file system organization and technology: + +* `/sdf`: Root mount point. + +* `/sdf/home//`: Home directories. Space quotas imposed for all users. + +* `/sdf/sw//`: For general purpose software not installed on each node, e.g., EPICS, Matlab, matplotlib, GEANT4, etc. Not meant for software that is used by only one group. + +* `/sdf/group//` or `/sdf/group/`: For group/project specific software (e.g., lcls/psdm, ad/hla, etc.) + +* `/sdf/data//…`: For science data (as opposed to code, documents, etc), including raw, calibrated data, and results. Some examples: + - LCLS experimental: `/sdf/data/lcls//` + - LCLS accelerator: `/sdf/data/lcls/accel/` + - FACET experimental: `/sdf/data/facet//` + - FACET accelerator: `/sdf/data/facet/accel/` + - CryoEM: `/sdf/data/cryoem//` + +* `/sdf/scratch//…`: 3 months retention on a best effort basis (actual retention can be shorter or longer depending on actual usage. NOTE: as of July 2024, the auto-purge policy is not in effect.) + +?> Access to AFS, GPFS, and SDF Lustre from S3DF is described in this +[reference section on legacy file systems](reference.md#legacyfs). + +## Policies + +- Home directory permissions will be delegated to each user. By default, home folders will be readable by everyone, though you can change that by changing UNIX permissions on one or more of your folders. Everyone will be able to list `/sdf/home/`. + +- General purpose software will go under sw. Group specific software will go under `/sdf/group` and will be maintained by each group. + +- Some groups may decide to logically hold all their information under `/sdf/group/`. Such a structure may be implemented by each group via symlinks. The actual mount points and relative backup and archive policies will be based on the structure shown above. + +?> __TODO__ Desktop/endpoint access to S3DF file systems will likely be via authenticated NFS v4. This is currently a topic of investigation as we wait for an updated WekaFS release. + + +## Backup and Archiving + +- Everything under `/sdf/{home, sw, group}` will be backed up. We currently use snapshots taken at regular intervals (e.g., a few times a day) that users can access with no intervention from system administrators. A subset of the snapshots will be copied to tape at a lower rate (e.g., once a day). Snapshots for\ +`/sdf/{home, sw, group}/`\ +can be found at\ +`/sdf/{home, sw, group}/.snapshots//` + +- Files/objects under `/sdf/data` will be backed up or archived according to a data retention policy defined by the facility. Facilities will be responsible for covering the media costs and overhead required by their policy. Similar to the /sdf/home area, you can also check in /sdf/data/\/.snapshots to see if snapshots are enabled for self-service restores. + +- The scratch spaces under `/sdf/scratch` and all directories named "nobackup" (located *anywhere* in an /sdf path) will not be backed up or archived. Please use as many "nobackup" subdirectory locations as required for any files that do not need backup. That can save significant tape and processing resources. + +- A subset of users in some groups will be able to access the command line interface to HPSS for the purpose of archiving/retrieving data to/from tape. Unlike backups, which will be automatically performed by the storage team within SCS, archiving will be the responsibility of each group (contact SCS for assistance). + +?> The current and target backup and archiving policies are summarized in this [reference section on data backup](reference.md#backup). + +## Change to AFS Tape Backup Retention Policy +March 31, 2025 + +Summary: due to the upcoming retirement of a legacy tape library, the tape backup retention policy for the legacy AFS file system will be reduced from one year to one month (31 days), effective May 14, 2025 (the deadline date). This policy change means any request to restore accidentally deleted files must be created within one month (rather than one year) of the deletion event. In addition, if a user deleted an AFS file more than one month prior to the deadline date and needs it restored, they have until the deadline date to create a restore request via email to s3df-help@slac.stanford.edu. + +Details: the AFS file system used for legacy Unix home directory and group file storage originally started with a one-year tape backup retention policy. This meant that the tape backup system used to store backup copies of files had the ability to go back approximately one year and restore files that were accidentally deleted. + +The legacy AFS file system will be retired before the end of 2025. Since a restore request for an AFS file requires the AFS file system itself to be online, the current retention policy would require the AFS file system to remain running for one year beyond its public retirement date just to satisfy restore requests. Since that will not be possible, the tape backup retention policy for legacy AFS files is being reduced from one year to one month (defined as 31 days) effective May 14, 2025 (the deadline date). + +This policy change means any request to restore accidentally deleted files must be created within one month (rather than one year) of the deletion event. In addition, if a user deleted an AFS file more than one month prior to the deadline date and needs it restored, they have until the deadline date to create a restore request via email to s3df-help@slac.stanford.edu. After the deadline date, the backup system will be able to restore an AFS file only if it was deleted within the last one month. diff --git a/systems.md b/systems.md new file mode 100644 index 0000000..c16feb4 --- /dev/null +++ b/systems.md @@ -0,0 +1,57 @@ +# Computing Resources + + +## roma :roma + + - CPU Model: Rome 7702 + - Usable Cores per Node: 120 + - Usable Memory per Node: 480 GB + - GPU Model: None + - GPUs per Node: None + - Local Scratch: 300 GB + - Number of Nodes: 129 + Overview: The Roma cluster is equipped with 120 cores and 480 GB of memory per node, making it suitable for a variety of computational tasks requiring substantial processing power. + +## milano :milano + + - CPU Model: Milan 7713 + - Usable Cores per Node: 120 + - Usable Memory per Node: 480 GB + - GPU Model: None + - GPUs per Node: None + - Local Scratch: 6 TB + - Number of Nodes: 193 + Overview: The Milano cluster features similar core and memory specifications as the Roma cluster but offers significantly larger local scratch space (6 TB), making it ideal for data-intensive applications. + +## ampere :ampere + + - CPU Model: Rome 7542 + - Usable Cores per Node: 112 (hyperthreaded) + - Usable Memory per Node: 952 GB + - GPU Model: Tesla A100 (40 GB) + - GPUs per Node: 4 + - Local Scratch: 14 TB + - Number of Nodes: 42 +Overview: The Ampere cluster offers high memory and GPU capabilities with 4 Tesla A100 GPUs per node, making it well-suited for machine learning and high-performance computing tasks that require both substantial memory and processing power. + + +## turing :turing + + - CPU Model: Intel Xeon Gold 5118 + - Usable Cores per Node: 40 (hyperthreaded) + - Usable Memory per Node: 160 GB + - GPU Model: NVIDIA GeForce 2080Ti + - GPUs per Node: 10 + - Local Scratch: 300 GB + - Number of Nodes: 27 +Overview: The Turing cluster combines a moderate number of cores with multiple NVIDIA GeForce 2080Ti GPUs, making it suitable for graphical computations, simulations, and parallel processing tasks. + +## ada :ada +CPU Model: AMD EPYC 9454 +Usable Cores per Node: 72 (hyperthreaded) +Usable Memory per Node: 702 GB +GPU Model: NVIDIA L40S +GPUs per Node: 10 +Local Scratch: 21 TB +Number of Nodes: 6 +Overview: The Ada cluster features high-core counts and ample memory, along with 10 NVIDIA L40S GPUs per node, providing excellent resources for advanced computation and research requiring both CPU and GPU resources.