-
Notifications
You must be signed in to change notification settings - Fork 7
Add Habrok-to-Kapteyn data transfer guide #660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
e0101a4
21440a9
b071f92
63c30c9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -13,8 +13,8 @@ ssh-copy-id -i ~/.ssh/id_rsa.pub YOUR_USERNAME@login1.hb.hpc.rug.nl | |
|
|
||
| Once you have added your SSH key to Habrok, modify the entry below and insert it into your `~/.ssh/config` file | ||
| ``` | ||
| Host habrok1 | ||
| HostName interactive1.hb.hpc.rug.nl | ||
| Host habrok | ||
| HostName login1.hb.hpc.rug.nl | ||
| User YOUR_USERNAME | ||
| IdentityFile ~/.ssh/id_rsa | ||
| ServerAliveInterval 120 | ||
|
|
@@ -80,3 +80,100 @@ You can also submit single PROTEUS runs to the nodes. For example: | |
| ```console | ||
| sbatch --mem-per-cpu=3G --time=1440 --wrap "proteus start -oc input/all_options.toml" | ||
| ``` | ||
|
|
||
| ## Transferring data from Habrok to Kapteyn | ||
|
|
||
| Habrok and Kapteyn are on different networks. Habrok cannot reach Kapteyn (the firewall blocks outgoing SSH), and although Kapteyn can reach Habrok, Habrok requires two-factor authentication (2FA) for every connection, which makes automated transfers from Kapteyn difficult. | ||
|
|
||
| So you cannot simply run `rsync` or `scp` in either direction between the two clusters. The workaround is to relay data through a machine that can reach both, like your laptop: | ||
|
|
||
| ``` | ||
| Habrok --> your laptop --> Kapteyn (norma2) | ||
| pull push | ||
| ``` | ||
|
|
||
| ### Prerequisites | ||
|
|
||
| You need SSH access to both clusters configured on your laptop. See the [Habrok SSH setup](#access-the-habrok-cluster) above and the [Kapteyn cluster guide](kapteyn_cluster_guide.md) for SSH config instructions, including the ProxyJump setup needed to reach `norma2`. | ||
|
|
||
| Test that both connections work before proceeding: | ||
|
|
||
| ```console | ||
| ssh habrok # will ask for your TOTP code | ||
|
timlichtenberg marked this conversation as resolved.
|
||
| ssh norma2 # key-based, no 2FA | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you separate this two commands? Like in different boxes, because if someone copies the full block and paste it in the terminal it will be calling norma2 from habrok, and I don't know if that can break the rest from there. Just to be safe. |
||
| ``` | ||
|
timlichtenberg marked this conversation as resolved.
|
||
|
|
||
| ### Step 1: Pull data from Habrok to your laptop | ||
|
|
||
| On Habrok, PROTEUS output typically lives in `/scratch/<habrok_user>/proteus_output/`. Check what is there: | ||
|
|
||
| ```console | ||
| ssh habrok 'ls -lh /scratch/<habrok_user>/proteus_output/' | ||
| ``` | ||
|
|
||
| Pull it to a temporary folder on your laptop: | ||
|
|
||
| ```console | ||
| mkdir -p /tmp/habrok_transfer | ||
| rsync -avz habrok:/scratch/<habrok_user>/proteus_output/my_run/ /tmp/habrok_transfer/my_run/ | ||
| ``` | ||
|
|
||
| Replace `<habrok_user>` with your Habrok username (e.g., `p000000`) and `my_run` with your simulation directory name. | ||
|
|
||
| If you only need the CSV and plots (not the raw per-timestep data), add `--exclude=data/` to save time and disk space: | ||
|
|
||
| ```console | ||
| rsync -avz --exclude=data/ habrok:/scratch/<habrok_user>/proteus_output/my_run/ /tmp/habrok_transfer/my_run/ | ||
| ``` | ||
|
|
||
| ### Step 2: Push data from your laptop to Kapteyn | ||
|
|
||
| Push the staged data to the Kapteyn dataserver: | ||
|
|
||
| ```console | ||
| ssh norma2 'mkdir -p /dataserver/users/formingworlds/<kapteyn_user>/proteus_output/my_run' | ||
| rsync -avz /tmp/habrok_transfer/my_run/ norma2:/dataserver/users/formingworlds/<kapteyn_user>/proteus_output/my_run/ | ||
| ``` | ||
|
|
||
| Replace `<kapteyn_user>` with your Kapteyn username. | ||
|
|
||
| ### Step 3: Clean up | ||
|
|
||
| Remove the temporary staging data from your laptop: | ||
|
|
||
| ```console | ||
| rm -rf /tmp/habrok_transfer/my_run | ||
| ``` | ||
|
|
||
| ### Alternative: direct pipe (no staging on your laptop) | ||
|
|
||
| Instead of storing data on your laptop in between, you can pipe the data straight through in a single command using SSH and `tar`: | ||
|
|
||
| First, make sure the target directory exists on Kapteyn: | ||
|
|
||
| ```console | ||
| ssh norma2 'mkdir -p /dataserver/users/formingworlds/<kapteyn_user>/proteus_output' | ||
| ``` | ||
|
|
||
| Then pipe the data through: | ||
|
|
||
| ```console | ||
| ssh habrok 'tar -cf - -C /scratch/<habrok_user>/proteus_output my_run' \ | ||
| | ssh norma2 'tar -xf - -C /dataserver/users/formingworlds/<kapteyn_user>/proteus_output' | ||
|
timlichtenberg marked this conversation as resolved.
|
||
| ``` | ||
|
timlichtenberg marked this conversation as resolved.
|
||
|
|
||
| This streams data from Habrok through your laptop to Kapteyn without writing anything to disk locally. The downside is that if the connection drops, you have to start over from scratch (unlike `rsync`, which can resume). This approach is best for smaller transfers. | ||
|
|
||
| To exclude the `data/` directory (slim transfer): | ||
|
|
||
| ```console | ||
| ssh habrok 'tar -cf - --exclude=data -C /scratch/<habrok_user>/proteus_output my_run' \ | ||
| | ssh norma2 'tar -xf - -C /dataserver/users/formingworlds/<kapteyn_user>/proteus_output' | ||
| ``` | ||
|
timlichtenberg marked this conversation as resolved.
|
||
|
|
||
| ### Tips | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe add a tip here to do a security check that the transfer was done correctly. (I check sizes in both places before and after the transfer), but other types of checks can be done. |
||
|
|
||
| - **rsync is incremental.** If the transfer gets interrupted (laptop goes to sleep, WiFi drops), re-run the same `rsync` command. It picks up where it left off and only transfers new or changed files. | ||
| - **Check sizes first.** Before pulling, check how large the data is: `ssh habrok 'du -sh /scratch/<habrok_user>/proteus_output/my_run/'`. Large runs can be tens of GB. | ||
| - **The `data/` directory is often not needed.** It contains raw NetCDF/JSON output at every timestep. The `runtime_helpfile.csv` and `plots/` directory are usually sufficient for analysis. | ||
| - **Kapteyn storage quotas.** The formingworlds dataserver has also limited space. Check your usage with `ssh norma2 'du -sh /dataserver/users/formingworlds/<kapteyn_user>/'` before transferring large datasets. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already in other pages of the documentation, maybe you can refer directly to those pages? Just to avoid repetition, a student could create twice SSH key and could be messy.