Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 63 additions & 23 deletions rucio_stress_test/README.rucio_stress_test
Original file line number Diff line number Diff line change
@@ -1,25 +1,65 @@
To set up the rucio stress test, do the following:
For the rucio stress test, do the following:

1) it is recommended that you have a standalone rucio client

1) it is recommended that you have a standalone rucio client
2) create a python virtual environment, activate it, and then
pip install rucio-clients
3) copy the files from this etc directory into the etc directory of the
rucio environment. There are two, one for the dune-int-rucio and one
for the regular dune-rucio
4) copy the shell scripts into the main level of the rucio directory
5) There is a set of shell scripts (launch0100f.sh and so forth)..
each one of them will make 100
simultaneous rucio uploads of a file, in the background.
6) For full stress test, log into each of dunegpvm01-16
7) do "htgettoken -i dune -r production -a htvaultprod.fnal.gov"
on each of those dunegpvm.. you will need to auth with the browser on each one and then stay logged in
8) make a test file on each of the machines
dd if=/dev/zero of=/tmp/1gbtestfile.<YYYYMMDD> bs=1024 count=1000000
where YYYYMMDD is today's date
9) modify the launch script such that you have the name of the files you want, if necessary
10) each launch script can only be run once per day.. if you need to make another set of files you have to modify by hand the suffix to something else.
11) modify the atjob_f.sh file to set the actual time that you want the job to run
12) launch the atjobs
13) use the window you have on each vm to monitor the load.. typically the load will go up to 80 or 90 on each vm while the rucio uploads are happening
14) once all is done, check that the right number of rules were made
and grep all the individual error files for errors
pip install rucio-clients

3) Sync token on all dunegpvm hosts
From the repo top directory (data-mgmt-ops/rucio_stress_test) , run:
./bin/sync_oidc_token_local.sh [username]
Example:
./bin/sync_oidc_token_local.sh timm
What it does:
- loops over dunegpvm01..16
- runs htgettoken on each host (interactive retry when needed)
- writes token to /tmp/.rucio/duneprod.token on each host

4) Run the stress test
Launch all 16 host jobs:

./bin/atjob_f.sh [username] [at_time] [file_base]

Examples:

./bin/atjob_f.sh timm "now + 2 minutes" runA
./bin/atjob_f.sh timm 15:11 runB

Arguments:
- username: SSH user for dunegpvm hosts (default: current user)
- at_time: at(1) schedule time (default: now)
- file_base: base name for /tmp test files and uploaded DID names
(default: 1gbtestfile)

Notes:
- Each launchXXXX_f.sh starts 100 background uploads on one host.
- The launcher creates /tmp/<file_base>.<YYYYMMDD> once if missing.
- Worker uploads use suffixes like .f0100 ... .f1699.
- Logs are written under output/test.out.* and output/test.err.*

5) Validate results

Check for errors:
rg -n "ERROR|Invalid header value|not successful|Traceback" output/test.err.*

Optional quick check of files in Rucio:
rucio did list --filter 'type=file' test:runA.20260324.* | wc -l


6) Cleanup test data files from /tmp on all hosts

Run cleanup with explicit date:

./bin/cleanup_tmp_test_files.sh <file_base> <yyyymmdd>

Example:

./bin/cleanup_tmp_test_files.sh runA 20260324

Or default to today:

./bin/cleanup_tmp_test_files.sh runA

Cleanup removes on each host:
- /tmp/<file_base>.<yyyymmdd>.*
- /tmp/<file_base>.<yyyymmdd>
64 changes: 48 additions & 16 deletions rucio_stress_test/bin/atjob_f.sh
Original file line number Diff line number Diff line change
@@ -1,18 +1,50 @@
#!/bin/bash
# Usage:
# ./bin/atjob_f.sh [username] [at_time] [file_base]
# Examples:
# ./bin/atjob_f.sh timm 15:11
# ./bin/atjob_f.sh timm "now + 5 minutes"
# ./bin/atjob_f.sh timm "now + 2 minutes" runA
# ./bin/atjob_f.sh timm # defaults to "now" and "1gbtestfile"
# ./bin/atjob_f.sh # defaults to current user, "now" and "1gbtestfile"
#
# Token prerequisite (run once before scheduling launches):
# htgettoken -i dune -r production -a htvaultprod.fnal.gov
# ./bin/sync_oidc_token_local.sh
#
# The launch scripts expect the shared token at:
# /tmp/.rucio/duneprod.token
# Optional test file base name (arg 3):
# file_base (default: 1gbtestfile)
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
USERNAME="${1:-$USER}"
AT_TIME="${2:-now}"
FILE_BASE="${3:-1gbtestfile}"

ssh timm@dunegpvm01 "at 15:11 -f ~/ruciov38_3/launch0100_f.sh"
ssh timm@dunegpvm02 "at 15:11 -f ~/ruciov38_3/launch0200_f.sh"
ssh timm@dunegpvm03 "at 15:11 -f ~/ruciov38_3/launch0300_f.sh"
ssh timm@dunegpvm04 "at 15:11 -f ~/ruciov38_3/launch0400_f.sh"
ssh timm@dunegpvm05 "at 15:11 -f ~/ruciov38_3/launch0500_f.sh"
ssh timm@dunegpvm06 "at 15:11 -f ~/ruciov38_3/launch0600_f.sh"
ssh timm@dunegpvm07 "at 15:11 -f ~/ruciov38_3/launch0700_f.sh"
ssh timm@dunegpvm08 "at 15:11 -f ~/ruciov38_3/launch0800_f.sh"
ssh timm@dunegpvm09 "at 15:11 -f ~/ruciov38_3/launch0900_f.sh"
ssh timm@dunegpvm10 "at 15:11 -f ~/ruciov38_3/launch1000_f.sh"
ssh timm@dunegpvm11 "at 15:11 -f ~/ruciov38_3/launch1100_f.sh"
ssh timm@dunegpvm12 "at 15:11 -f ~/ruciov38_3/launch1200_f.sh"
ssh timm@dunegpvm13 "at 15:11 -f ~/ruciov38_3/launch1300_f.sh"
ssh timm@dunegpvm14 "at 15:11 -f ~/ruciov38_3/launch1400_f.sh"
ssh timm@dunegpvm15 "at 15:11 -f ~/ruciov38_3/launch1500_f.sh"
ssh timm@dunegpvm16 "at 15:11 -f ~/ruciov38_3/launch1600_f.sh"
echo "Scheduling as user: ${USERNAME}"
echo "Schedule time: ${AT_TIME}"
echo "Test file base: ${FILE_BASE}"

schedule_host() {
local host="$1"
local launch_script="$2"
echo " -> ${host}: ${launch_script}"
ssh "${USERNAME}@${host}" "printf '%s\n' 'bash ${SCRIPT_DIR}/${launch_script} ${FILE_BASE}' | at '${AT_TIME}'"
}

schedule_host dunegpvm01 launch0100_f.sh
schedule_host dunegpvm02 launch0200_f.sh
schedule_host dunegpvm03 launch0300_f.sh
schedule_host dunegpvm04 launch0400_f.sh
schedule_host dunegpvm05 launch0500_f.sh
schedule_host dunegpvm06 launch0600_f.sh
schedule_host dunegpvm07 launch0700_f.sh
schedule_host dunegpvm08 launch0800_f.sh
schedule_host dunegpvm09 launch0900_f.sh
schedule_host dunegpvm10 launch1000_f.sh
schedule_host dunegpvm11 launch1100_f.sh
schedule_host dunegpvm12 launch1200_f.sh
schedule_host dunegpvm13 launch1300_f.sh
schedule_host dunegpvm14 launch1400_f.sh
schedule_host dunegpvm15 launch1500_f.sh
schedule_host dunegpvm16 launch1600_f.sh
41 changes: 41 additions & 0 deletions rucio_stress_test/bin/cleanup_tmp_test_files.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#!/bin/bash
# Cleanup stress-test files from /tmp on all dune GPVM hosts.
#
# Usage:
# ./bin/cleanup_tmp_test_files.sh <file_base> [yyyymmdd]
#
# Examples:
# ./bin/cleanup_tmp_test_files.sh runC 20260319
# ./bin/cleanup_tmp_test_files.sh runC
#
# Removes on each host:
# /tmp/<file_base>.<yyyymmdd>.*
# /tmp/<file_base>.<yyyymmdd>

set -euo pipefail

if [ $# -lt 1 ] || [ $# -gt 2 ]; then
echo "Usage: $0 <file_base> [yyyymmdd]" >&2
exit 1
fi

FILE_BASE="$1"
DATE_TAG="${2:-$(date +%Y%m%d)}"
PATTERN="/tmp/${FILE_BASE}.${DATE_TAG}.*"
BASE_FILE="/tmp/${FILE_BASE}.${DATE_TAG}"

HOSTS=(
dunegpvm01 dunegpvm02 dunegpvm03 dunegpvm04
dunegpvm05 dunegpvm06 dunegpvm07 dunegpvm08
dunegpvm09 dunegpvm10 dunegpvm11 dunegpvm12
dunegpvm13 dunegpvm14 dunegpvm15 dunegpvm16
)

echo "Cleaning file base '${FILE_BASE}' for date '${DATE_TAG}'"

for host in "${HOSTS[@]}"; do
echo "[$host] removing ${PATTERN} and ${BASE_FILE}"
ssh "${USER}@${host}" "set -euo pipefail; rm -f ${PATTERN} ${BASE_FILE}; ls -l ${PATTERN} ${BASE_FILE} 2>/dev/null || true"
done

echo "Cleanup finished on all hosts."
Loading