Optimize zstash update by using scandir#420
Conversation
|
The list order issue found during performance profiling is now resolved. |
Added unit tests Perlmutter tests passed Chrysalis tests passed File gathering list order matches that of the main branch
8234b99 to
a704282
Compare
forsyth2
left a comment
There was a problem hiding this comment.
Self-review of code. I've confirmed the entire test suite passes and the file addition list is unchanged from main.
| @@ -0,0 +1,657 @@ | |||
| """ | |||
| Unit tests for the optimized file scanning implementation in zstash. | |||
There was a problem hiding this comment.
For this review, I'm mainly focused on the non-test changes. I did confirm these Claude-generated tests pass though.
| logger.debug("Keep local tar files : {}".format(keep)) | ||
|
|
||
| files: List[str] = get_files_to_archive(cache, args.include, args.exclude) | ||
| file_stats: Dict[str, Tuple[int, datetime]] = get_files_to_archive_with_stats( |
There was a problem hiding this comment.
Now, we call the optimized get_files_to_archive
| archived_files: Dict[str, Tuple[int, datetime]] = {} | ||
|
|
||
| cur.execute("SELECT name, size, mtime FROM files") | ||
| db_rows = cur.fetchall() |
There was a problem hiding this comment.
Fetch all the rows at once
| size: int = row[1] | ||
| mtime: datetime = row[2] | ||
|
|
||
| # If file appears multiple times, keep the one with latest mtime |
| # Get the stat info we already collected during filesystem walk | ||
| size_new, mdtime_new = file_stats[file_path] |
There was a problem hiding this comment.
Now we don't have to run os.lstat -- we already have the value in memory.
| # Normalize the path. | ||
| # By building from path + entry.name, | ||
| # we guarantee the argument to normpath is constructed | ||
| # identically to how os.walk did it in previous code iterations. | ||
| normalized_path = os.path.normpath(os.path.join(path, entry.name)) | ||
| self.file_stats[normalized_path] = (size, mtime) |
There was a problem hiding this comment.
Important for maintaining sort order
| logger.warning(f"Error accessing {entry.path}: {e}") | ||
| continue | ||
|
|
||
| # Handle empty directories BEFORE recursing into subdirs |
There was a problem hiding this comment.
Important for maintaining sort order
| # Note: broken symlinks are caught by not os.path.isfile() but not by os.path.isdir(). We want the latter. | ||
| if size == 0 and os.path.isdir(path): |
There was a problem hiding this comment.
Important for maintaining sort order
| # Sort on directory and filename like original | ||
| path_components.sort(key=lambda x: (x[0], x[1])) |
There was a problem hiding this comment.
Important for maintaining sort order
| LEGACY VERSION: Still used for `zstash create`. | ||
| Uses the optimized version but returns only the file list. | ||
| """ | ||
| file_stats = get_files_to_archive_with_stats(cache, include, exclude) |
There was a problem hiding this comment.
For zstash create, we just used reduced output from the optimized function
How to do a performance profilingI'm using Perlmutter throughout these directions. I'll use constants to refer to specific paths. Change these out to work on the machine you're on & to use your username. HPSS_PATH=/home/projects/e3sm/www/CoupledSystem/E3SMv3/LR/v3.LR.amip_0151
PROFILE_PATH=/pscratch/sd/f/forsyth/performance_profiling/
EXTRACTED_DIRECTORY_PATH=${PROFILE_PATH}extraction/
LOG_PATH=${PROFILE_PATH}logs/
CACHE_PATH=${PROFILE_PATH}cache/
INDEX_COPIES_PATH=${PROFILE_PATH}index_copies/Also note I'm using Extract an archiveSkip this step if you already have an archive ready to use. screen # This could run for a while, so let's use `screen`
screen -ls # We need to know what node this screen is on, so we can come back to it.
mkdir ${PROFILE_PATH}
mkdir ${CACHE_PATH}
mkdir ${EXTRACTED_DIRECTORY_PATH}
mkdir ${LOG_PATH}
cd ${EXTRACTED_DIRECTORY_PATH}
# We'll extract using the Unified environment's version of zstash
source /global/common/software/e3sm/anaconda_envs/load_latest_e3sm_unified_pm-cpu.sh # Activate E3SM Unified
zstash extract --hpss=${HPSS_PATH} --cache=${CACHE_PATH} -v 2>&1 | tee ${LOG_PATH}$/extract.logBe sure to save the index.db as it exists initiallycp ${CACHE_PATH}/index.db ${INDEX_COPIES_PATH}/index_original.dbProfile the before-caseProfile using zstash as it exists in the latest E3SM Unified environment. source /global/common/software/e3sm/anaconda_envs/load_latest_e3sm_unified_pm-cpu.sh # Activate E3SM Unified
before_case_log_file=${LOG_PATH}update_before_changes.log
cd ${EXTRACTED_DIRECTORY_PATH} # Must run update from the archived directory!
zstash update --hpss=${HPSS_PATH} --cache=${CACHE_PATH} -v --dry-run 2>&1 | tee ${before_case_log_file}
mv ${CACHE_PATH}index.db ${INDEX_COPIES_PATH}index_before_changes.db # Optional, only if you want to keep the index
cp ${INDEX_COPIES_PATH}/index_original.db ${CACHE_PATH}index.db # Reset the index!!
grep -n real ${before_case_log_file}
# This will show the real time spentProfile the after-caseProfile using zstash as it exists on this branch. # Skip this block if you already have a zstash repo:
cd ~
git clone git@github.com:E3SM-Project/zstash.git
cd zstash
git remote -v
# You should see:
# origin git@github.com:E3SM-Project/zstash.git (fetch)
# origin git@github.com:E3SM-Project/zstash.git (push)
# Now, get this branch:
git fetch upstream issue-409-optimize-update
git checkout -b issue-409-optimize-update upstream/issue-409-optimize-update
# Now, set up the conda environment
lcrc_conda # Activate conda. This is my function to do so. You may use a different method.
rm -rf build
conda clean --all --y
conda env create -f conda/dev.yml -n profile-optimized-zstash-update
conda activate profile-optimized-zstash-update
python -m pip install . # Install the branch's changes into this conda environment.
# Now, we can actually profile, just as we did with the before-case
after_case_log_file=${LOG_PATH}update_after_changes.log
cd ${EXTRACTED_DIRECTORY_PATH} # Must run update from the archived directory!
zstash update --hpss=${HPSS_PATH} --cache=${CACHE_PATH} -v --dry-run 2>&1 | tee ${after_case_log_file}
mv ${CACHE_PATH}index.db ${INDEX_COPIES_PATH}index_after_changes.db # Optional, only if you want to keep the index
cp ${INDEX_COPIES_PATH}/index_original.db ${CACHE_PATH}index.db # Reset the index!!
grep -n real ${after_case_log_file}
# This will show the real time spentComparing profiling resultsTo get the speedup, divide the real time spent: To check the file list accuracy, you can just compare the logs directly (assuming you used diff ${before_case_log_file} ${after_case_log_file}
# Should show no results (if --dry-run was used) |
|
@chengzhuzhang @golaz @wlin7 I've created this PR as a simplified code change since #414 had a lot of performance logging in it that won't be necessary on the @chengzhuzhang @golaz if you'd like to do a visual inspection of the code, you can review the file diffs and my explanatory notes above. The implementation is what we've been discussing -- the replacing of @wlin7 I think we are ready to have you do a production/large/real use case test. I've included directions above. Please let me know if you have any questions. Thanks all! |
chengzhuzhang
left a comment
There was a problem hiding this comment.
@forsyth2 I visually checked the code update and it looks good to me. I added to the PR summary, so that it provides more information for reviewers.
|
Thanks @chengzhuzhang! |
|
Hi @forsyth2 . This is the test results using a production simulation (
The |
wlin7
left a comment
There was a problem hiding this comment.
Test results verified the performance gain and the integrity of zstash update.
|
Fantastic, thank you for testing, @wlin7! |
|
I think this should be ready to merge. After that, I can try to set up a dev environment others can access. Completed:
|
|
@wlin7 @xuezhengllnl I've created an environment conda env create -f /home/ac.forsyth2/user_envs/zstash-user-env-20260122.ymlI typically only create Steps I used to set up this environmentChrysalis: cd ~/ez/zstash
git status
# nothing to commit, working tree clean
git fetch upstream main
git checkout main
git reset --hard upstream/main
git log --oneline | head -n 2
# 880b9fd Optimize zstash update by using scandir (#420)
# 01ffe34 Add test for tar deletion (#404)
# => Good, we've included the optimization.
lcrc_conda # My function to activate Conda.
rm -rf build
conda clean --all --y
conda env create -f conda/dev.yml -n zstash-user-env-20260122
conda activate zstash-user-env-20260122
pre-commit run --all-files # Optional since we didn't add any code
python -m pip install .
mkdir ~/user_envs
conda env export > ~/user_envs/zstash-user-env-20260122.yml
ls -l ~/user_envs/zstash-user-env-20260122.yml
# -rw-r--r-- ... |
Summary
Objectives:
What changed:
Issue resolution:
Select one: This pull request is...
Small Change