Skip to content

Optimize zstash update by using scandir#420

Merged
forsyth2 merged 2 commits into
mainfrom
issue-409-optimize-update
Jan 22, 2026
Merged

Optimize zstash update by using scandir#420
forsyth2 merged 2 commits into
mainfrom
issue-409-optimize-update

Conversation

@forsyth2
Copy link
Copy Markdown
Collaborator

@forsyth2 forsyth2 commented Jan 13, 2026

Summary

Objectives:

  • Optimize zstash update by using os.scandir-based directory scanning and by loading archived file metadata from the database in a single query. This reduces both filesystem syscalls and per-file DB queries, improving update throughput for large trees resulting 20-30x speed up in a production-like test.

What changed:

New scanner
    Added DirectoryScanner and get_files_to_archive_with_stats() in zstash/utils.py.
    Uses os.scandir() to collect file entries and stat information during the walk.
    Builds a dict mapping normalized relative path -> (size, mtime).
    Records empty directories (size 0) and normalizes paths to preserve compatibility with existing DB keys.
    Filters include/exclude patterns while preserving the same directory-first, filename-second sort order as before.

Update logic
    zstash/update.py now calls get_files_to_archive_with_stats() and obtains an ordered mapping of filesystem paths -> (size, mtime).
    Loads archived rows once with a single query (SELECT name, size, mtime FROM files) and builds an in-memory dict, keeping the row with the latest mtime if duplicates exist.
    Compares each filesystem entry to the in-memory archived data using O(1) dict lookup:
        Considered unchanged if size matches and abs(mtime_new - mtime_archived) <= TIME_TOL.
        Otherwise considered new/changed and scheduled for archive.

Issue resolution:

Select one: This pull request is...

  • a bug fix: increment the patch version
  • a small improvement: increment the minor version
  • a new feature: increment the minor version
  • an incompatible (non-backwards compatible) API change: increment the major version

Small Change

  • To merge, I will use "Squash and merge". That is, this change should be a single commit.
  • Logic: I have visually inspected the entire pull request myself.
  • Pre-commit checks: All the pre-commits checks have passed.

@forsyth2 forsyth2 self-assigned this Jan 13, 2026
@forsyth2 forsyth2 added the semver: small improvement Small improvement (will increment patch version) label Jan 13, 2026
@forsyth2
Copy link
Copy Markdown
Collaborator Author

The list order issue found during performance profiling is now resolved.

Added unit tests
Perlmutter tests passed
Chrysalis tests passed
File gathering list order matches that of the main branch
@forsyth2 forsyth2 force-pushed the issue-409-optimize-update branch from 8234b99 to a704282 Compare January 14, 2026 00:55
Copy link
Copy Markdown
Collaborator Author

@forsyth2 forsyth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review of code. I've confirmed the entire test suite passes and the file addition list is unchanged from main.

@@ -0,0 +1,657 @@
"""
Unit tests for the optimized file scanning implementation in zstash.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this review, I'm mainly focused on the non-test changes. I did confirm these Claude-generated tests pass though.

Comment thread zstash/update.py
logger.debug("Keep local tar files : {}".format(keep))

files: List[str] = get_files_to_archive(cache, args.include, args.exclude)
file_stats: Dict[str, Tuple[int, datetime]] = get_files_to_archive_with_stats(
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, we call the optimized get_files_to_archive

Comment thread zstash/update.py
archived_files: Dict[str, Tuple[int, datetime]] = {}

cur.execute("SELECT name, size, mtime FROM files")
db_rows = cur.fetchall()
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fetch all the rows at once

Comment thread zstash/update.py
size: int = row[1]
mtime: datetime = row[2]

# If file appears multiple times, keep the one with latest mtime
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Comment thread zstash/update.py
Comment on lines +224 to +225
# Get the stat info we already collected during filesystem walk
size_new, mdtime_new = file_stats[file_path]
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we don't have to run os.lstat -- we already have the value in memory.

Comment thread zstash/utils.py
Comment on lines +69 to +74
# Normalize the path.
# By building from path + entry.name,
# we guarantee the argument to normpath is constructed
# identically to how os.walk did it in previous code iterations.
normalized_path = os.path.normpath(os.path.join(path, entry.name))
self.file_stats[normalized_path] = (size, mtime)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important for maintaining sort order

Comment thread zstash/utils.py
logger.warning(f"Error accessing {entry.path}: {e}")
continue

# Handle empty directories BEFORE recursing into subdirs
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important for maintaining sort order

Comment thread zstash/utils.py
Comment on lines +205 to +206
# Note: broken symlinks are caught by not os.path.isfile() but not by os.path.isdir(). We want the latter.
if size == 0 and os.path.isdir(path):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important for maintaining sort order

Comment thread zstash/utils.py
Comment on lines +214 to +215
# Sort on directory and filename like original
path_components.sort(key=lambda x: (x[0], x[1]))
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important for maintaining sort order

Comment thread zstash/utils.py
LEGACY VERSION: Still used for `zstash create`.
Uses the optimized version but returns only the file list.
"""
file_stats = get_files_to_archive_with_stats(cache, include, exclude)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For zstash create, we just used reduced output from the optimized function

@forsyth2
Copy link
Copy Markdown
Collaborator Author

How to do a performance profiling

I'm using Perlmutter throughout these directions.

I'll use constants to refer to specific paths. Change these out to work on the machine you're on & to use your username.

HPSS_PATH=/home/projects/e3sm/www/CoupledSystem/E3SMv3/LR/v3.LR.amip_0151

PROFILE_PATH=/pscratch/sd/f/forsyth/performance_profiling/
EXTRACTED_DIRECTORY_PATH=${PROFILE_PATH}extraction/
LOG_PATH=${PROFILE_PATH}logs/
CACHE_PATH=${PROFILE_PATH}cache/
INDEX_COPIES_PATH=${PROFILE_PATH}index_copies/

Also note I'm using --dry-run in these directions, but you can remove that option to run the complete zstash update functionality.

Extract an archive

Skip this step if you already have an archive ready to use.

screen # This could run for a while, so let's use `screen`
screen -ls # We need to know what node this screen is on, so we can come back to it.
mkdir ${PROFILE_PATH}
mkdir ${CACHE_PATH}
mkdir ${EXTRACTED_DIRECTORY_PATH}
mkdir ${LOG_PATH}

cd ${EXTRACTED_DIRECTORY_PATH}
# We'll extract using the Unified environment's version of zstash
source /global/common/software/e3sm/anaconda_envs/load_latest_e3sm_unified_pm-cpu.sh # Activate E3SM Unified
zstash extract --hpss=${HPSS_PATH} --cache=${CACHE_PATH} -v 2>&1 | tee ${LOG_PATH}$/extract.log

Be sure to save the index.db as it exists initially

cp ${CACHE_PATH}/index.db ${INDEX_COPIES_PATH}/index_original.db

Profile the before-case

Profile using zstash as it exists in the latest E3SM Unified environment.

source /global/common/software/e3sm/anaconda_envs/load_latest_e3sm_unified_pm-cpu.sh # Activate E3SM Unified
before_case_log_file=${LOG_PATH}update_before_changes.log
cd ${EXTRACTED_DIRECTORY_PATH} # Must run update from the archived directory!
zstash update --hpss=${HPSS_PATH} --cache=${CACHE_PATH} -v --dry-run 2>&1 | tee ${before_case_log_file}
mv ${CACHE_PATH}index.db ${INDEX_COPIES_PATH}index_before_changes.db # Optional, only if you want to keep the index
cp ${INDEX_COPIES_PATH}/index_original.db ${CACHE_PATH}index.db # Reset the index!!
grep -n real ${before_case_log_file}
# This will show the real time spent

Profile the after-case

Profile using zstash as it exists on this branch.

# Skip this block if you already have a zstash repo:
cd ~
git clone git@github.com:E3SM-Project/zstash.git
cd zstash
git remote -v
# You should see:
# origin	git@github.com:E3SM-Project/zstash.git (fetch)
# origin	git@github.com:E3SM-Project/zstash.git (push)

# Now, get this branch:
git fetch upstream issue-409-optimize-update
git checkout -b issue-409-optimize-update upstream/issue-409-optimize-update

# Now, set up the conda environment
lcrc_conda # Activate conda. This is my function to do so. You may use a different method.
rm -rf build
conda clean --all --y
conda env create -f conda/dev.yml -n profile-optimized-zstash-update
conda activate profile-optimized-zstash-update
python -m pip install . # Install the branch's changes into this conda environment.

# Now, we can actually profile, just as we did with the before-case
after_case_log_file=${LOG_PATH}update_after_changes.log
cd ${EXTRACTED_DIRECTORY_PATH} # Must run update from the archived directory!
zstash update --hpss=${HPSS_PATH} --cache=${CACHE_PATH} -v --dry-run 2>&1 | tee ${after_case_log_file}
mv ${CACHE_PATH}index.db ${INDEX_COPIES_PATH}index_after_changes.db # Optional, only if you want to keep the index
cp ${INDEX_COPIES_PATH}/index_original.db ${CACHE_PATH}index.db # Reset the index!!
grep -n real ${after_case_log_file}
# This will show the real time spent

Comparing profiling results

To get the speedup, divide the real time spent: before_case_real_time / after_case_real_time = #x speedup. Assuming you used --dry-run, this will be measuring how much the pre-add-files steps improved. Otherwise, this will be measuring how much the total time improved.

To check the file list accuracy, you can just compare the logs directly (assuming you used --dry-run):

diff ${before_case_log_file} ${after_case_log_file}
# Should show no results (if --dry-run was used)

@forsyth2 forsyth2 marked this pull request as ready for review January 14, 2026 01:57
@forsyth2
Copy link
Copy Markdown
Collaborator Author

@chengzhuzhang @golaz @wlin7 I've created this PR as a simplified code change since #414 had a lot of performance logging in it that won't be necessary on the main branch. I've confirmed the entire test suite passes and the file addition list is unchanged from main. That means this PR is now ready for review.

@chengzhuzhang @golaz if you'd like to do a visual inspection of the code, you can review the file diffs and my explanatory notes above. The implementation is what we've been discussing -- the replacing of os.lstat with os.scandir.

@wlin7 I think we are ready to have you do a production/large/real use case test. I've included directions above. Please let me know if you have any questions.

Thanks all!

Copy link
Copy Markdown
Collaborator

@chengzhuzhang chengzhuzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@forsyth2 I visually checked the code update and it looks good to me. I added to the PR summary, so that it provides more information for reviewers.

@forsyth2
Copy link
Copy Markdown
Collaborator Author

Thanks @chengzhuzhang!

@wlin7
Copy link
Copy Markdown

wlin7 commented Jan 19, 2026

Hi @forsyth2 . This is the test results using a production simulation (v3.LR.amip_0101). The previously archived (zstash create) data size is 8.4 TB for a total of 39K files. The tests are done by adding the full archive/atm/hist from v3.LR.amip_0151, with data size of 810 GB and 11K files.

Test      Time spent on file gathering Total zstash time
Current 556 seconds 78m 18s
New 152 seconds 72 m 28s

The Current test used zstash from released e3sm_unified. The New test using zstash installed from this PR branch. Time spent on gathering file was counted as the time before creating the first tar ball during zstash update. The total zstash time only counted the time until the final tar ball and the final push to globus. The new zstash reduces file gathering time by 63%.

Copy link
Copy Markdown

@wlin7 wlin7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test results verified the performance gain and the integrity of zstash update.

@forsyth2
Copy link
Copy Markdown
Collaborator Author

Fantastic, thank you for testing, @wlin7!

@forsyth2
Copy link
Copy Markdown
Collaborator Author

I think this should be ready to merge. After that, I can try to set up a dev environment others can access.

Completed:

  • Tests pass. My comment "I've confirmed the entire test suite passes and the file addition list is unchanged from main." was made after adding the last commit
  • Code review from Jill
  • Jill improved PR description
  • Wuyin ran a larger test case

@forsyth2 forsyth2 merged commit 880b9fd into main Jan 22, 2026
5 checks passed
@forsyth2
Copy link
Copy Markdown
Collaborator Author

@wlin7 @xuezhengllnl I've created an environment zstash-user-env-20260122 which I think should be available by running:

conda env create -f /home/ac.forsyth2/user_envs/zstash-user-env-20260122.yml

I typically only create conda environments for myself, so let me know if that doesn't work. In any case, I think @chengzhuzhang has been maintaining a working conda environment for others to use.

Steps I used to set up this environment

Chrysalis:

cd ~/ez/zstash
git status
# nothing to commit, working tree clean
git fetch upstream main
git checkout main
git reset --hard upstream/main
git log --oneline | head -n 2
# 880b9fd Optimize zstash update by using scandir (#420)
# 01ffe34 Add test for tar deletion (#404)
# => Good, we've included the optimization.
lcrc_conda # My function to activate Conda.
rm -rf build
conda clean --all --y
conda env create -f conda/dev.yml -n zstash-user-env-20260122
conda activate zstash-user-env-20260122
pre-commit run --all-files # Optional since we didn't add any code
python -m pip install .

mkdir ~/user_envs
conda env export > ~/user_envs/zstash-user-env-20260122.yml
ls -l ~/user_envs/zstash-user-env-20260122.yml
# -rw-r--r-- ...

@forsyth2 forsyth2 mentioned this pull request Apr 6, 2026
@forsyth2 forsyth2 mentioned this pull request Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

semver: small improvement Small improvement (will increment patch version)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: speed up zstash update

3 participants