Optimize zstash update by using scandir by forsyth2 · Pull Request #420 · E3SM-Project/zstash

forsyth2 · 2026-01-13T20:47:37Z

Summary

Objectives:

Optimize zstash update by using os.scandir-based directory scanning and by loading archived file metadata from the database in a single query. This reduces both filesystem syscalls and per-file DB queries, improving update throughput for large trees resulting 20-30x speed up in a production-like test.

What changed:

New scanner
    Added DirectoryScanner and get_files_to_archive_with_stats() in zstash/utils.py.
    Uses os.scandir() to collect file entries and stat information during the walk.
    Builds a dict mapping normalized relative path -> (size, mtime).
    Records empty directories (size 0) and normalizes paths to preserve compatibility with existing DB keys.
    Filters include/exclude patterns while preserving the same directory-first, filename-second sort order as before.

Update logic
    zstash/update.py now calls get_files_to_archive_with_stats() and obtains an ordered mapping of filesystem paths -> (size, mtime).
    Loads archived rows once with a single query (SELECT name, size, mtime FROM files) and builds an in-memory dict, keeping the row with the latest mtime if duplicates exist.
    Compares each filesystem entry to the in-memory archived data using O(1) dict lookup:
        Considered unchanged if size matches and abs(mtime_new - mtime_archived) <= TIME_TOL.
        Otherwise considered new/changed and scheduled for archive.

Issue resolution:

Closes [Bug]: speed up zstash update #409.
Implements optimization that was profiled in Performance profiling of zstash update #414.

Select one: This pull request is...

a bug fix: increment the patch version
a small improvement: increment the minor version
a new feature: increment the minor version
an incompatible (non-backwards compatible) API change: increment the major version

Small Change

To merge, I will use "Squash and merge". That is, this change should be a single commit.
Logic: I have visually inspected the entire pull request myself.
Pre-commit checks: All the pre-commits checks have passed.

forsyth2 · 2026-01-13T20:48:52Z

The list order issue found during performance profiling is now resolved.

Added unit tests Perlmutter tests passed Chrysalis tests passed File gathering list order matches that of the main branch

forsyth2

Self-review of code. I've confirmed the entire test suite passes and the file addition list is unchanged from main.

forsyth2 · 2026-01-14T01:45:37Z

@@ -0,0 +1,657 @@
+"""
+Unit tests for the optimized file scanning implementation in zstash.


For this review, I'm mainly focused on the non-test changes. I did confirm these Claude-generated tests pass though.

forsyth2 · 2026-01-14T01:46:42Z

    logger.debug("Keep local tar files  : {}".format(keep))

-    files: List[str] = get_files_to_archive(cache, args.include, args.exclude)
+    file_stats: Dict[str, Tuple[int, datetime]] = get_files_to_archive_with_stats(


Now, we call the optimized get_files_to_archive

forsyth2 · 2026-01-14T01:46:55Z

+    archived_files: Dict[str, Tuple[int, datetime]] = {}
+
+    cur.execute("SELECT name, size, mtime FROM files")
+    db_rows = cur.fetchall()


Fetch all the rows at once

forsyth2 · 2026-01-14T01:48:28Z

+        size: int = row[1]
+        mtime: datetime = row[2]
+
+        # If file appears multiple times, keep the one with latest mtime


forsyth2 · 2026-01-14T01:49:24Z

+        # Get the stat info we already collected during filesystem walk
+        size_new, mdtime_new = file_stats[file_path]


Now we don't have to run os.lstat -- we already have the value in memory.

forsyth2 · 2026-01-14T01:52:40Z

+                    # Normalize the path.
+                    # By building from path + entry.name,
+                    # we guarantee the argument to normpath is constructed
+                    # identically to how os.walk did it in previous code iterations.
+                    normalized_path = os.path.normpath(os.path.join(path, entry.name))
+                    self.file_stats[normalized_path] = (size, mtime)


Important for maintaining sort order

forsyth2 · 2026-01-14T01:52:45Z

+                logger.warning(f"Error accessing {entry.path}: {e}")
+                continue
+
+        # Handle empty directories BEFORE recursing into subdirs


Important for maintaining sort order

forsyth2 · 2026-01-14T01:53:19Z

+        # Note: broken symlinks are caught by not os.path.isfile() but not by os.path.isdir(). We want the latter.
+        if size == 0 and os.path.isdir(path):


Important for maintaining sort order

forsyth2 · 2026-01-14T01:53:33Z

+    # Sort on directory and filename like original
+    path_components.sort(key=lambda x: (x[0], x[1]))


Important for maintaining sort order

forsyth2 · 2026-01-14T01:54:09Z

+    LEGACY VERSION: Still used for `zstash create`.
+    Uses the optimized version but returns only the file list.
+    """
+    file_stats = get_files_to_archive_with_stats(cache, include, exclude)


For zstash create, we just used reduced output from the optimized function

forsyth2 · 2026-01-14T01:56:34Z

How to do a performance profiling

I'm using Perlmutter throughout these directions.

I'll use constants to refer to specific paths. Change these out to work on the machine you're on & to use your username.

HPSS_PATH=/home/projects/e3sm/www/CoupledSystem/E3SMv3/LR/v3.LR.amip_0151

PROFILE_PATH=/pscratch/sd/f/forsyth/performance_profiling/
EXTRACTED_DIRECTORY_PATH=${PROFILE_PATH}extraction/
LOG_PATH=${PROFILE_PATH}logs/
CACHE_PATH=${PROFILE_PATH}cache/
INDEX_COPIES_PATH=${PROFILE_PATH}index_copies/

Also note I'm using --dry-run in these directions, but you can remove that option to run the complete zstash update functionality.

Extract an archive

Skip this step if you already have an archive ready to use.

screen # This could run for a while, so let's use `screen`
screen -ls # We need to know what node this screen is on, so we can come back to it.
mkdir ${PROFILE_PATH}
mkdir ${CACHE_PATH}
mkdir ${EXTRACTED_DIRECTORY_PATH}
mkdir ${LOG_PATH}

cd ${EXTRACTED_DIRECTORY_PATH}
# We'll extract using the Unified environment's version of zstash
source /global/common/software/e3sm/anaconda_envs/load_latest_e3sm_unified_pm-cpu.sh # Activate E3SM Unified
zstash extract --hpss=${HPSS_PATH} --cache=${CACHE_PATH} -v 2>&1 | tee ${LOG_PATH}$/extract.log

Be sure to save the index.db as it exists initially

cp ${CACHE_PATH}/index.db ${INDEX_COPIES_PATH}/index_original.db

Profile the before-case

Profile using zstash as it exists in the latest E3SM Unified environment.

source /global/common/software/e3sm/anaconda_envs/load_latest_e3sm_unified_pm-cpu.sh # Activate E3SM Unified
before_case_log_file=${LOG_PATH}update_before_changes.log
cd ${EXTRACTED_DIRECTORY_PATH} # Must run update from the archived directory!
zstash update --hpss=${HPSS_PATH} --cache=${CACHE_PATH} -v --dry-run 2>&1 | tee ${before_case_log_file}
mv ${CACHE_PATH}index.db ${INDEX_COPIES_PATH}index_before_changes.db # Optional, only if you want to keep the index
cp ${INDEX_COPIES_PATH}/index_original.db ${CACHE_PATH}index.db # Reset the index!!
grep -n real ${before_case_log_file}
# This will show the real time spent

Profile the after-case

Profile using zstash as it exists on this branch.

# Skip this block if you already have a zstash repo:
cd ~
git clone git@github.com:E3SM-Project/zstash.git
cd zstash
git remote -v
# You should see:
# origin	git@github.com:E3SM-Project/zstash.git (fetch)
# origin	git@github.com:E3SM-Project/zstash.git (push)

# Now, get this branch:
git fetch upstream issue-409-optimize-update
git checkout -b issue-409-optimize-update upstream/issue-409-optimize-update

# Now, set up the conda environment
lcrc_conda # Activate conda. This is my function to do so. You may use a different method.
rm -rf build
conda clean --all --y
conda env create -f conda/dev.yml -n profile-optimized-zstash-update
conda activate profile-optimized-zstash-update
python -m pip install . # Install the branch's changes into this conda environment.

# Now, we can actually profile, just as we did with the before-case
after_case_log_file=${LOG_PATH}update_after_changes.log
cd ${EXTRACTED_DIRECTORY_PATH} # Must run update from the archived directory!
zstash update --hpss=${HPSS_PATH} --cache=${CACHE_PATH} -v --dry-run 2>&1 | tee ${after_case_log_file}
mv ${CACHE_PATH}index.db ${INDEX_COPIES_PATH}index_after_changes.db # Optional, only if you want to keep the index
cp ${INDEX_COPIES_PATH}/index_original.db ${CACHE_PATH}index.db # Reset the index!!
grep -n real ${after_case_log_file}
# This will show the real time spent

Comparing profiling results

To get the speedup, divide the real time spent: before_case_real_time / after_case_real_time = #x speedup. Assuming you used --dry-run, this will be measuring how much the pre-add-files steps improved. Otherwise, this will be measuring how much the total time improved.

To check the file list accuracy, you can just compare the logs directly (assuming you used --dry-run):

diff ${before_case_log_file} ${after_case_log_file}
# Should show no results (if --dry-run was used)

forsyth2 · 2026-01-14T02:02:21Z

@chengzhuzhang @golaz @wlin7 I've created this PR as a simplified code change since #414 had a lot of performance logging in it that won't be necessary on the main branch. I've confirmed the entire test suite passes and the file addition list is unchanged from main. That means this PR is now ready for review.

@chengzhuzhang @golaz if you'd like to do a visual inspection of the code, you can review the file diffs and my explanatory notes above. The implementation is what we've been discussing -- the replacing of os.lstat with os.scandir.

@wlin7 I think we are ready to have you do a production/large/real use case test. I've included directions above. Please let me know if you have any questions.

Thanks all!

chengzhuzhang

@forsyth2 I visually checked the code update and it looks good to me. I added to the PR summary, so that it provides more information for reviewers.

forsyth2 · 2026-01-16T18:48:20Z

Thanks @chengzhuzhang!

wlin7 · 2026-01-19T17:42:48Z

Hi @forsyth2 . This is the test results using a production simulation (v3.LR.amip_0101). The previously archived (zstash create) data size is 8.4 TB for a total of 39K files. The tests are done by adding the full archive/atm/hist from v3.LR.amip_0151, with data size of 810 GB and 11K files.

Test	Time spent on file gathering	Total zstash time
Current	556 seconds	78m 18s
New	152 seconds	72 m 28s

The Current test used zstash from released e3sm_unified. The New test using zstash installed from this PR branch. Time spent on gathering file was counted as the time before creating the first tar ball during zstash update. The total zstash time only counted the time until the final tar ball and the final push to globus. The new zstash reduces file gathering time by 63%.

wlin7

Test results verified the performance gain and the integrity of zstash update.

forsyth2 · 2026-01-21T02:56:36Z

Fantastic, thank you for testing, @wlin7!

forsyth2 · 2026-01-22T17:09:35Z

I think this should be ready to merge. After that, I can try to set up a dev environment others can access.

Completed:

Tests pass. My comment "I've confirmed the entire test suite passes and the file addition list is unchanged from main." was made after adding the last commit
Code review from Jill
Jill improved PR description
Wuyin ran a larger test case

forsyth2 · 2026-01-22T17:51:38Z

@wlin7 @xuezhengllnl I've created an environment zstash-user-env-20260122 which I think should be available by running:

conda env create -f /home/ac.forsyth2/user_envs/zstash-user-env-20260122.yml

I typically only create conda environments for myself, so let me know if that doesn't work. In any case, I think @chengzhuzhang has been maintaining a working conda environment for others to use.

Steps I used to set up this environment

Chrysalis:

cd ~/ez/zstash
git status
# nothing to commit, working tree clean
git fetch upstream main
git checkout main
git reset --hard upstream/main
git log --oneline | head -n 2
# 880b9fd Optimize zstash update by using scandir (#420)
# 01ffe34 Add test for tar deletion (#404)
# => Good, we've included the optimization.
lcrc_conda # My function to activate Conda.
rm -rf build
conda clean --all --y
conda env create -f conda/dev.yml -n zstash-user-env-20260122
conda activate zstash-user-env-20260122
pre-commit run --all-files # Optional since we didn't add any code
python -m pip install .

mkdir ~/user_envs
conda env export > ~/user_envs/zstash-user-env-20260122.yml
ls -l ~/user_envs/zstash-user-env-20260122.yml
# -rw-r--r-- ...

Optimize zstash update by using scandir

9ac1fad

forsyth2 self-assigned this Jan 13, 2026

forsyth2 added the semver: small improvement Small improvement (will increment patch version) label Jan 13, 2026

All tests passing

a704282

Added unit tests Perlmutter tests passed Chrysalis tests passed File gathering list order matches that of the main branch

forsyth2 force-pushed the issue-409-optimize-update branch from 8234b99 to a704282 Compare January 14, 2026 00:55

forsyth2 commented Jan 14, 2026

View reviewed changes

forsyth2 marked this pull request as ready for review January 14, 2026 01:57

forsyth2 mentioned this pull request Jan 14, 2026

Performance profiling of zstash update #414

Closed

16 tasks

forsyth2 requested review from chengzhuzhang, golaz and wlin7 January 14, 2026 02:02

chengzhuzhang approved these changes Jan 15, 2026

View reviewed changes

wlin7 approved these changes Jan 19, 2026

View reviewed changes

forsyth2 merged commit 880b9fd into main Jan 22, 2026
5 checks passed

forsyth2 mentioned this pull request Apr 6, 2026

Bump to 1.6.0rc1 #438

Merged

forsyth2 mentioned this pull request Apr 15, 2026

Bump to 1.6.0rc3 #443

Merged

		@@ -0,0 +1,657 @@
		"""
		Unit tests for the optimized file scanning implementation in zstash.

		# Get the stat info we already collected during filesystem walk
		size_new, mdtime_new = file_stats[file_path]

		# Note: broken symlinks are caught by not os.path.isfile() but not by os.path.isdir(). We want the latter.
		if size == 0 and os.path.isdir(path):

		# Sort on directory and filename like original
		path_components.sort(key=lambda x: (x[0], x[1]))

Conversation

forsyth2 commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Small Change

Uh oh!

forsyth2 commented Jan 13, 2026

Uh oh!

forsyth2 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

forsyth2 commented Jan 14, 2026

How to do a performance profiling

Extract an archive

Be sure to save the index.db as it exists initially

Profile the before-case

Profile the after-case

Comparing profiling results

Uh oh!

forsyth2 commented Jan 14, 2026

Uh oh!

chengzhuzhang left a comment

Choose a reason for hiding this comment

Uh oh!

forsyth2 commented Jan 16, 2026

Uh oh!

wlin7 commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wlin7 left a comment

Choose a reason for hiding this comment

Uh oh!

forsyth2 commented Jan 21, 2026

Uh oh!

forsyth2 commented Jan 22, 2026

Uh oh!

Uh oh!

forsyth2 commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

forsyth2 commented Jan 13, 2026 •

edited

Loading

wlin7 commented Jan 19, 2026 •

edited

Loading