Skip to content

Commit a4f0b1e

Browse files
authored
Merge pull request #152 from JinZhou5042/jinzhou
update current status
2 parents dd8e52e + 4828055 commit a4f0b1e

1 file changed

Lines changed: 79 additions & 18 deletions

File tree

pages/postdocs/jinzhou.md

Lines changed: 79 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -14,42 +14,103 @@ institution: University of Notre Dame
1414
e-mail: jzhou24@nd.edu
1515
project_title: Scalable Data Analysis Applications for High Energy Physics
1616
project_goal: >
17-
- Accelerate CMS analysis workflows, focusing on those using Coffea, Dask, and TaskVine.
18-
- Reduce storage usage in data-intensive workflows to support more ambitious computations.
19-
- Improve fault tolerance on unreliable clusters through replication and checkpointing.
20-
- Explore graph optimization strategies to minimize makespan using real-time information.
17+
CMS analysis workflows that run on opportunistic clusters (e.g., HTCondor) often use node-local storage for intermediate data. At scale, worker failures and disk pressure make these runs unreliable. This project aimed to make such workflows manageable in practice—not to remove the trade-offs, but to control storage growth and recovery so that large campaigns finish predictably. We worked in the Coffea/Dask-on-TaskVine stack and developed SciWIND: mechanisms for storage minimization, efficient recovery, and resilience tuning. The goal was to reduce failed runs and turnaround time for HL-LHC-era analyses.
2118
2219
mentors:
2320
- Douglas Thain (Cooperative Computing Lab, University of Notre Dame)
2421
- Kevin Lannon (Physics department, University of Notre Dame)
2522

23+
24+
presentations:
25+
- title: "Improving the Scalability and Reliability of CMS Data Analysis Workflows in Python"
26+
date: "May 25, 2025"
27+
url: https://docs.google.com/presentation/d/1N6ZYhrThoTdHjhl35eenNdW-0hMPX89Yy3cOQiM3FSg/edit?usp=sharing
28+
meeting: "HEP Weekly Meeting"
29+
meetingurl: null
30+
- title: "Effectively Exploiting Node-Local Storage for Data-Intensive Scientific Workflows"
31+
date: "May 8, 2025"
32+
url: https://docs.google.com/presentation/d/1pU2AKSimx8v5k0S4Ba28axxYclB_dXP63ZxRcSo-fP0/edit?usp=sharing
33+
meeting: "12th Greater Chicago Area Systems Research Workshop (GCASR)"
34+
meetingurl: https://gcasr.org/2025/
35+
- title: "Effectively Exploiting Node-Local Storage for Data-Intensive Scientific Workflows"
36+
date: "Feb 27, 2025"
37+
url: https://docs.google.com/presentation/d/1pU2AKSimx8v5k0S4Ba28axxYclB_dXP63ZxRcSo-fP0/edit?usp=sharing
38+
meeting: "17th Annual CSE Poster Presentation"
39+
meetingurl: null
40+
2641
current_status: >
2742
<br>
28-
<b>2025 Q1 </b>
43+
<b>2025 Q4 </b>
44+
<br>
45+
46+
* Paper and final results
47+
* Paper "Effectively Exploiting Node-Local Storage For Data-Intensive Scientific Workflows" accepted at IPDPS 2026 after revised submission and expanded evidence (following rejections at IPDPS 2025 and SC'25).
48+
* Consolidated SciWIND framework (NLS minimization, efficient recovery, resilience reinforcement). Validated on TopEFT, RSTriPhoton, and DV5 under repeated failure conditions.
49+
* Final results: up to 99.0% peak-NLS reduction, 70.1% makespan reduction, 99.8% recovery-task reduction. In the DV5 hero run, recovery tasks dropped from 2.45M to ~197K, makespan from ~37.9K s to ~10.9K s, total NLS footprint from ~8.1 TB to ~428 GB.
50+
* Lightweight hybrid settings (e.g., PD2+RC2+CP10%) were the most reliable operating point in repeated runs. Heavier settings sometimes improved one axis but added noticeable pruning or metadata overhead.
51+
* Lessons and recommendations
52+
* Failure handling has to be treated as a normal execution path, not an exceptional case. Queue and storage traces during runtime are as important as final summary numbers.
53+
* Keep figure-driven evaluation as standard practice: architecture, storage traces, concurrency traces, and full trade-off tables should be reported together.
54+
* Next steps
55+
* Build adaptive online tuning for PD/RC/CP based on observed DAG shape and failure patterns.
56+
* Extend policy interfaces so similar mechanisms can be adopted by other workflow engines. Continue scaling studies on broader CMS workflows to validate transferability beyond the current three benchmarks.
57+
58+
<br>
59+
<b>2025 Q3 </b>
2960
<br>
3061
31-
* Progress
32-
*
33-
* Developed the large-input first (LIF) algorithm and the pruning algorithm which effectively reduce the storage consumption by over 90% while running hundreds of thousands of tasks.
34-
* Enhanced the resource allocation and temp file replication on the task scheduler side.
35-
* Attempted to submit a paper to IPDPS 2025 though was rejected.
62+
* After SC'25 rejection (submission was in Q2)
63+
* Continued figure-driven evaluation: storage trajectories per worker, task concurrency dynamics (waiting vs executing), full trade-off tables for PD, RC, CP and hybrid policies across TopEFT, RSTriPhoton, and DV5.
64+
* Several early policy variants looked good in aggregate metrics but still produced long scheduler stalls mid-run. Queue separation plus prioritized recovery was the first version that consistently removed those stalls.
65+
* TaskVine runtime and tooling
66+
* Significant runtime fixes: queue behavior under mixed runnable and non-runnable tasks, cache-state races, and transfer-path issues that only surfaced under heavy worker-to-worker traffic. Some failures looked like scheduler policy problems at first but were actually runtime bugs. Fixing those changed stability more than any single tuning knob.
67+
* Expanded report toolchain so large logs are parsed once and reused for repeated analysis and plotting. That changed iteration speed a lot when runs were long and log volumes were huge.
3668
* Next steps
37-
* Sketch a paper about effectively using limited storage to accomplish enormous computations.
38-
* Develop an algorithm that divides long running tasks in DV5 into smaller ones, which reduces the overhead of rerunning tasks on worker evictions but increases the latency of scheduling a large number of small tasks, so the next plan would be trying to strike a balance between task scheduling and fault tolerance.
39-
* Develop an algorithm that checkpoints remote temp files on time to reduce the risk of losing critical files.
69+
* Improve scheduling efficiency by better handling pending and ready tasks—an issue that has caused severe slowdowns on unreliable clusters.
70+
* Finalize stable Conda release for TaskVine so all users have a reliable build.
4071
4172
<br>
4273
<b>2025 Q2 </b>
4374
<br>
4475
45-
* Progress
46-
* Paper “Effectively Exploiting Node-Local Storage For Data-Intensive Scientific Workflows” submitted to SC’ 25.
47-
* Implemented checkpointing and replication strategies in TaskVine, both significantly improve workflow performance on unreliable clusters.
48-
* Resolved fundamental issues and inefficiencies in TaskVine; the scheduler now handles very large workflows efficiently. Our most recent success was that we completed an 8-million-task workflow in 20 hours.
76+
* Paper and resilience mechanisms
77+
* Paper "Effectively Exploiting Node-Local Storage For Data-Intensive Scientific Workflows" submitted to SC'25.
78+
* Implemented checkpointing and replication strategies in TaskVine. Both significantly improve workflow performance on unreliable clusters. Parameter sweeps over Prune Depth (PD), Replication Count (RC), and Checkpoint Percentage (CP), plus hybrid settings (e.g., PD2+RC2+CP10%), used to evaluate practical operating points.
79+
* NLS usage and storage control
80+
* The shift from baseline to load shifting, then pruning, then LIF scheduling is clear in per-worker storage trajectories: peaks drop and worst-worker imbalance narrows. In practice, aggressive NLS usage is feasible only when cleanup and scheduling are coupled. Running either piece alone was usually not enough.
81+
* TaskVine scale and tooling
82+
* Resolved fundamental issues and inefficiencies in TaskVine. The scheduler now handles very large workflows efficiently. Our most recent success was completing an 8-million-task workflow in 20 hours.
4983
* Developing a web-based visualization tool for TaskVine logs, optimized for fast log parsing, CSV generation, and displaying key statistics. Available on [GitHub](https://github.com/cooperative-computing-lab/taskvine-report-tool).
5084
* Next steps
5185
* Discussed with team members how to improve scheduling efficiency by better handling pending and ready tasks—an issue that has caused severe slowdowns on unreliable clusters and remained unresolved for over half a year.
52-
* Finalizing our recent fixes and improvements in TaskVine, make sure we have a stable Conda release by the end of June and all our users are happy to use it.
86+
* Finalize recent fixes and improvements in TaskVine and aim for a stable Conda release by the end of June so all users can rely on it.
5387
* Study the implications and challenges when scheduling massive workflows with millions of tasks.
5488
89+
<br>
90+
<b>2025 Q1 </b>
91+
<br>
92+
93+
* NLS minimization and recovery
94+
* Developed the large-input first (LIF) algorithm and the pruning algorithm, which effectively reduce storage consumption by over 90% while running hundreds of thousands of tasks.
95+
* Enhanced resource allocation and temp file replication on the task scheduler side. Work moved from ad hoc tuning toward a repeatable engineering loop.
96+
* Established configuration names for experiments: No SciWIND (baseline), SciWIND-Min (lightweight reference), SciWIND-Core (reference for resilience analysis with PD=1, RC=1, CP=0%), and PD/RC/CP sweeps and hybrid settings for practical operating points.
97+
* Paper and next submission
98+
* Submitted paper to IPDPS 2025 though it was rejected. Plan to revise and submit to SC'25.
99+
* Next steps
100+
* Sketch a revised paper about effectively using limited storage to accomplish enormous computations, incorporating feedback from IPDPS 2025.
101+
* Develop an algorithm that divides long-running tasks in DV5 into smaller ones to reduce rerun overhead on worker evictions while balancing the latency of scheduling many small tasks. Develop checkpointing of remote temp files on time to reduce the risk of losing critical files.
102+
103+
<br>
104+
<b>2024 Q4 </b>
105+
<br>
106+
107+
* Project start and operating model
108+
* Project start (August 2024). Set up operating model with TaskVine manager/scheduler, HTCondor workers, node-local storage (NLS), and shared parallel file system (PFS).
109+
* Experiments run with TaskVine on HTCondor. Chose three workloads—TopEFT, RSTriPhoton, and DV5—as benchmarks with different depth, intermediate volume, and failure sensitivity so that a setting that looks good on one can fail on another.
110+
* Experimental setup and metrics
111+
* Injected worker losses in a controlled way so that comparisons were interpretable. Decided to track four metrics throughout: recovery task count, makespan, storage peak, and pruning overhead. Using only makespan repeatedly hid important regressions during development, so all four are needed.
112+
* Began baseline studies and failure-injection experiments to pin down baseline behavior and naming (e.g., No SciWIND / Initial) before adding new mechanisms.
113+
* Next steps
114+
* Implement first SciWIND mechanisms: NLS minimization and recovery scheduling. Move from ad hoc tuning to a repeatable engineering loop.
115+
55116
---

0 commit comments

Comments
 (0)