feat: Phase 3 DORA metrics and job lifecycle tracking#1136
feat: Phase 3 DORA metrics and job lifecycle tracking#1136GrammaTonic merged 1 commit intodevelopfrom
Conversation
- Add native runner hook scripts (job-started.sh, job-completed.sh) using ACTIONS_RUNNER_HOOK_JOB_STARTED/COMPLETED env vars - CSV jobs.log format: timestamp,job_id,status,duration_seconds,queue_time_seconds - Rewrite metrics-collector.sh with histogram, queue time, cache stubs - Update entrypoints and all 3 Dockerfiles with hook integration - Replace Grafana overview dashboard with 24-panel DORA version - Add dora-metrics.json and job-analysis.json dashboards - Add integration test (33 assertions passing) - Add Phase 3 documentation and update plan Closes #1061
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request implements Phase 3 of the DORA metrics and job lifecycle tracking initiative, significantly enhancing the monitoring capabilities for GitHub Actions self-hosted runners. It introduces granular tracking of job durations and queue times through new lifecycle hooks, enriches Prometheus metrics with detailed performance indicators, and provides a suite of new Grafana dashboards for in-depth visualization and analysis of software delivery performance. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant new functionality for DORA metrics and job lifecycle tracking, including new hook scripts, a rewritten metrics collector, and new Grafana dashboards. However, it contains critical security vulnerabilities, including a Python injection vulnerability in job-completed.sh that could allow arbitrary code execution, and insecure temporary file handling in /tmp that could lead to symlink attacks. Additionally, there are race conditions in jobs.log access, a bug in histogram calculation, and insufficient integration tests. Addressing these issues is crucial before merging.
| if [[ ! -f "$JOBS_LOG" ]] || [[ ! -s "$JOBS_LOG" ]]; then | ||
| echo "0" | ||
| return | ||
| fi | ||
|
|
||
| # Read queue times from completed jobs (field 5 = queue_time_seconds) | ||
| while IFS=',' read -r _ts _id status _duration queue_time; do | ||
| [[ "$status" == "running" ]] && continue | ||
| [[ -z "$queue_time" ]] && continue | ||
| if ! [[ "$queue_time" =~ ^[0-9]+$ ]]; then | ||
| continue | ||
| fi | ||
|
|
||
| total_queue=$((total_queue + queue_time)) | ||
| queue_count=$((queue_count + 1)) | ||
|
|
||
| if [[ "$queue_count" -ge "$max_jobs" ]]; then | ||
| break | ||
| fi | ||
| done < <(tail -n "$max_jobs" "$JOBS_LOG" 2>/dev/null | grep -v '^$' || true) |
There was a problem hiding this comment.
Similar to calculate_histogram, this function reads jobs.log without a lock, creating a critical race condition. The collector could read the file while job-completed.sh is in the middle of a non-atomic update, leading to inconsistent data and incorrect queue time calculations.
You should wrap the file-reading logic in a flock block to ensure a consistent view of the data.
( flock 200
if [[ ! -f "$JOBS_LOG" ]] || [[ ! -s "$JOBS_LOG" ]]; then
echo "0"
return
fi
# Read queue times from completed jobs (field 5 = queue_time_seconds)
while IFS=',' read -r _ts _id status _duration queue_time; do
[[ "$status" == "running" ]] && continue
[[ -z "$queue_time" ]] && continue
if ! [[ "$queue_time" =~ ^[0-9]+$ ]]; then
continue
fi
total_queue=$((total_queue + queue_time))
queue_count=$((queue_count + 1))
if [[ "$queue_count" -ge "$max_jobs" ]]; then
break
fi
done < <(tail -n "$max_jobs" "$JOBS_LOG" 2>/dev/null | grep -v '^$' || true)
) 200>"${JOBS_LOG}.lock"| if date -d "$ts" +%s 2>/dev/null; then | ||
| return | ||
| fi | ||
| python3 -c "from datetime import datetime; print(int(datetime.fromisoformat('${ts}'.replace('Z','+00:00')).timestamp()))" 2>/dev/null || echo "0" |
There was a problem hiding this comment.
The iso_to_epoch function is vulnerable to Python injection. The variable ${ts} is directly interpolated into a Python script string passed to python3 -c. If an attacker can control the input to this function (e.g., via the GITHUB_RUN_CREATED_AT environment variable), they can execute arbitrary Python code by breaking out of the single quotes.
To remediate this, pass the timestamp as a command-line argument to the Python script and access it via sys.argv.
| python3 -c "from datetime import datetime; print(int(datetime.fromisoformat('${ts}'.replace('Z','+00:00')).timestamp()))" 2>/dev/null || echo "0" | |
| python3 -c "import sys; from datetime import datetime; print(int(datetime.fromisoformat(sys.argv[1].replace('Z','+00:00')).timestamp()))" "$ts" 2>/dev/null || echo "0" |
| # Write a preliminary entry to jobs.log (status=running, duration/queue_time TBD) | ||
| # Final entry with duration and status is written by job-completed.sh | ||
| # Format: timestamp,job_id,status,duration_seconds,queue_time_seconds | ||
| echo "${timestamp},${job_id},running,0,0" >>"$JOBS_LOG" |
There was a problem hiding this comment.
This script has a race condition when writing to jobs.log. While echo >> is often atomic for single writes, job-completed.sh performs a non-atomic read-modify-write operation on the same file. To prevent data corruption, all accesses to jobs.log should be synchronized using a lock.
I recommend using flock here to ensure that this write operation does not conflict with operations in job-completed.sh or metrics-collector.sh.
| # Write a preliminary entry to jobs.log (status=running, duration/queue_time TBD) | |
| # Final entry with duration and status is written by job-completed.sh | |
| # Format: timestamp,job_id,status,duration_seconds,queue_time_seconds | |
| echo "${timestamp},${job_id},running,0,0" >>"$JOBS_LOG" | |
| # Use a lock file to ensure atomic updates to jobs.log | |
| ( flock 200 | |
| # Write a preliminary entry to jobs.log (status=running, duration/queue_time TBD) | |
| # Final entry with duration and status is written by job-completed.sh | |
| # Format: timestamp,job_id,status,duration_seconds,queue_time_seconds | |
| echo "${timestamp},${job_id},running,0,0" >>"$JOBS_LOG" | |
| ) 200>"${JOBS_LOG}.lock" |
| if [[ ! -f "$JOBS_LOG" ]] || [[ ! -s "$JOBS_LOG" ]]; then | ||
| return | ||
| fi | ||
|
|
||
| # Read completed job durations (field 4 = duration_seconds) | ||
| # Skip running entries and empty lines | ||
| while IFS=',' read -r _ts _id status duration _queue; do | ||
| # Skip running/incomplete entries | ||
| [[ "$status" == "running" ]] && continue | ||
| [[ -z "$duration" ]] && continue | ||
|
|
||
| # Validate duration is numeric | ||
| if ! [[ "$duration" =~ ^[0-9]+$ ]]; then | ||
| continue | ||
| fi | ||
|
|
||
| # Increment sum and count | ||
| sum_ref=$((sum_ref + duration)) | ||
| count_ref=$((count_ref + 1)) | ||
|
|
||
| # Increment histogram buckets (cumulative) | ||
| for i in "${!HISTOGRAM_BUCKETS[@]}"; do | ||
| if [[ "$duration" -le "${HISTOGRAM_BUCKETS[$i]}" ]]; then | ||
| bucket_counts_ref[$i]=$((bucket_counts_ref[$i] + 1)) | ||
| fi | ||
| done | ||
| # +Inf bucket always increments | ||
| bucket_counts_ref[${#HISTOGRAM_BUCKETS[@]}]=$((bucket_counts_ref[${#HISTOGRAM_BUCKETS[@]}] + 1)) | ||
| done < <(grep -v '^$' "$JOBS_LOG" 2>/dev/null || true) |
There was a problem hiding this comment.
This function reads from jobs.log without any locking. Since job-started.sh and job-completed.sh can modify this file at any time, there's a risk of reading an incomplete or inconsistent state, which would lead to incorrect metrics. This is a critical race condition.
To ensure data consistency, you should use flock to acquire a lock on the log file before reading it. This will synchronize access with the writer scripts (assuming they are also updated to use flock).
( flock 200
if [[ ! -f "$JOBS_LOG" ]] || [[ ! -s "$JOBS_LOG" ]]; then
return
fi
# Read completed job durations (field 4 = duration_seconds)
# Skip running entries and empty lines
while IFS=',' read -r _ts _id status duration _queue; do
# Skip running/incomplete entries
[[ "$status" == "running" ]] && continue
[[ -z "$duration" ]] && continue
# Validate duration is numeric
if ! [[ "$duration" =~ ^[0-9]+$ ]]; then
continue
fi
# Increment sum and count
sum_ref=$((sum_ref + duration))
count_ref=$((count_ref + 1))
# Increment histogram buckets (cumulative)
for i in "${!HISTOGRAM_BUCKETS[@]}"; do
if [[ "$duration" -le "${HISTOGRAM_BUCKETS[$i]}" ]]; then
bucket_counts_ref[$i]=$((bucket_counts_ref[$i] + 1))
fi
done
# +Inf bucket always increments
bucket_counts_ref[${#HISTOGRAM_BUCKETS[@]}]=$((bucket_counts_ref[${#HISTOGRAM_BUCKETS[@]}] + 1))
done < <(grep -v '^$' "$JOBS_LOG" 2>/dev/null || true)
) 200>"${JOBS_LOG}.lock"| # Make buckets cumulative (each bucket includes all smaller buckets) | ||
| # The above loop already counts per-bucket, but Prometheus requires cumulative | ||
| # So we need to accumulate: bucket[i] += bucket[i-1] | ||
| for ((i = 1; i < ${#HISTOGRAM_BUCKETS[@]}; i++)); do | ||
| bucket_counts_ref[$i]=$((bucket_counts_ref[$i] + bucket_counts_ref[$((i - 1))])) | ||
| done |
There was a problem hiding this comment.
There is a critical bug in the histogram calculation. The first loop (lines 121-125) already calculates the cumulative counts for the histogram buckets, which is what Prometheus expects. This second loop then incorrectly attempts to make the buckets cumulative again, leading to double-counting and incorrect metric values.
The comment on line 131 is misleading; the preceding loop is not just counting per-bucket. This block should be removed.
| cat > "$TMPDIR_TEST/run-completed.sh" << 'WRAPPER' | ||
| #!/usr/bin/env bash | ||
| set -euo pipefail | ||
|
|
||
| TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%SZ") | ||
| JOB_ID="${GITHUB_RUN_ID}_${GITHUB_JOB}" | ||
| START_FILE="${JOB_STATE_DIR}/${JOB_ID}.start" | ||
|
|
||
| # Calculate duration | ||
| if [[ -f "$START_FILE" ]]; then | ||
| START_EPOCH=$(cat "$START_FILE") | ||
| NOW_EPOCH=$(date +%s) | ||
| DURATION=$((NOW_EPOCH - START_EPOCH)) | ||
| else | ||
| DURATION=0 | ||
| fi | ||
|
|
||
| # Get status | ||
| STATUS="${GITHUB_JOB_STATUS:-failed}" | ||
|
|
||
| # Calculate queue time (from run creation to job start) | ||
| QUEUE_TIME=0 | ||
| if [[ -n "${GITHUB_RUN_CREATED_AT:-}" && -f "$START_FILE" ]]; then | ||
| START_EPOCH=$(cat "$START_FILE") | ||
| # Convert ISO timestamp to epoch | ||
| if date --version >/dev/null 2>&1; then | ||
| CREATED_EPOCH=$(date -u -d "${GITHUB_RUN_CREATED_AT}" +%s 2>/dev/null || echo "0") | ||
| else | ||
| CREATED_EPOCH=$(date -u -j -f "%Y-%m-%dT%H:%M:%SZ" "${GITHUB_RUN_CREATED_AT}" +%s 2>/dev/null || echo "0") | ||
| fi | ||
| if [[ "$CREATED_EPOCH" -gt 0 ]]; then | ||
| QUEUE_TIME=$((START_EPOCH - CREATED_EPOCH)) | ||
| [[ "$QUEUE_TIME" -lt 0 ]] && QUEUE_TIME=0 | ||
| fi | ||
| fi | ||
|
|
||
| # Remove running entry | ||
| if [[ -f "${JOBS_LOG}" ]]; then | ||
| grep -v "${JOB_ID},running" "${JOBS_LOG}" > "${JOBS_LOG}.tmp" || true | ||
| mv "${JOBS_LOG}.tmp" "${JOBS_LOG}" | ||
| fi | ||
|
|
||
| # Write final entry | ||
| echo "${TIMESTAMP},${JOB_ID},${STATUS},${DURATION},${QUEUE_TIME}" >> "${JOBS_LOG}" | ||
|
|
||
| # Cleanup state | ||
| rm -f "$START_FILE" | ||
|
|
||
| echo "Job completed: ${JOB_ID} status=${STATUS} duration=${DURATION}s queue=${QUEUE_TIME}s" | ||
| WRAPPER | ||
| chmod +x "$TMPDIR_TEST/run-completed.sh" | ||
|
|
||
| bash "$TMPDIR_TEST/run-completed.sh" |
There was a problem hiding this comment.
Similar to the test for job-started.sh, this test for job-completed.sh uses a wrapper that re-implements the script's logic instead of executing the actual script. This approach is fragile and doesn't provide true integration testing. For example, the wrapper expects an epoch timestamp in the .start file, but the real job-started.sh writes an ISO timestamp, which would break the real job-completed.sh.
The test should be updated to call the actual $JOB_COMPLETED script directly to ensure it is testing the production code.
| cat > "$TMPDIR_TEST/run-completed.sh" << 'WRAPPER' | |
| #!/usr/bin/env bash | |
| set -euo pipefail | |
| TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%SZ") | |
| JOB_ID="${GITHUB_RUN_ID}_${GITHUB_JOB}" | |
| START_FILE="${JOB_STATE_DIR}/${JOB_ID}.start" | |
| # Calculate duration | |
| if [[ -f "$START_FILE" ]]; then | |
| START_EPOCH=$(cat "$START_FILE") | |
| NOW_EPOCH=$(date +%s) | |
| DURATION=$((NOW_EPOCH - START_EPOCH)) | |
| else | |
| DURATION=0 | |
| fi | |
| # Get status | |
| STATUS="${GITHUB_JOB_STATUS:-failed}" | |
| # Calculate queue time (from run creation to job start) | |
| QUEUE_TIME=0 | |
| if [[ -n "${GITHUB_RUN_CREATED_AT:-}" && -f "$START_FILE" ]]; then | |
| START_EPOCH=$(cat "$START_FILE") | |
| # Convert ISO timestamp to epoch | |
| if date --version >/dev/null 2>&1; then | |
| CREATED_EPOCH=$(date -u -d "${GITHUB_RUN_CREATED_AT}" +%s 2>/dev/null || echo "0") | |
| else | |
| CREATED_EPOCH=$(date -u -j -f "%Y-%m-%dT%H:%M:%SZ" "${GITHUB_RUN_CREATED_AT}" +%s 2>/dev/null || echo "0") | |
| fi | |
| if [[ "$CREATED_EPOCH" -gt 0 ]]; then | |
| QUEUE_TIME=$((START_EPOCH - CREATED_EPOCH)) | |
| [[ "$QUEUE_TIME" -lt 0 ]] && QUEUE_TIME=0 | |
| fi | |
| fi | |
| # Remove running entry | |
| if [[ -f "${JOBS_LOG}" ]]; then | |
| grep -v "${JOB_ID},running" "${JOBS_LOG}" > "${JOBS_LOG}.tmp" || true | |
| mv "${JOBS_LOG}.tmp" "${JOBS_LOG}" | |
| fi | |
| # Write final entry | |
| echo "${TIMESTAMP},${JOB_ID},${STATUS},${DURATION},${QUEUE_TIME}" >> "${JOBS_LOG}" | |
| # Cleanup state | |
| rm -f "$START_FILE" | |
| echo "Job completed: ${JOB_ID} status=${STATUS} duration=${DURATION}s queue=${QUEUE_TIME}s" | |
| WRAPPER | |
| chmod +x "$TMPDIR_TEST/run-completed.sh" | |
| bash "$TMPDIR_TEST/run-completed.sh" | |
| bash "$JOB_COMPLETED" |
| COPY --chown=runner:runner job-started.sh /usr/local/bin/job-started.sh | ||
| COPY --chown=runner:runner job-completed.sh /usr/local/bin/job-completed.sh |
| COPY --chown=runner:runner job-started.sh /usr/local/bin/job-started.sh | ||
| COPY --chown=runner:runner job-completed.sh /usr/local/bin/job-completed.sh |
| COPY --chown=runner:runner job-started.sh /usr/local/bin/job-started.sh | ||
| COPY --chown=runner:runner job-completed.sh /usr/local/bin/job-completed.sh |
| **Notes:** | ||
|
|
||
| - `running` entries are preliminary (written by `job-started.sh`) and cleaned up by `job-completed.sh` | ||
| - If `job-completed.sh` cannot determine status, it defaults to `failed` |
There was a problem hiding this comment.
The documentation states that job-completed.sh defaults to a failed status. However, the implementation in docker/job-completed.sh (line 70) actually defaults to success. The documentation should be updated to reflect the actual behavior.
| - If `job-completed.sh` cannot determine status, it defaults to `failed` | |
| - If `job-completed.sh` cannot determine status, it defaults to `success` |
Summary
Implements Phase 3: Enhanced Metrics & Job Tracking (DORA) from issue #1061.
Changes
Job Lifecycle Hooks (new)
docker/job-started.sh- Native runner hook for job start eventsdocker/job-completed.sh- Native runner hook for job completion eventsACTIONS_RUNNER_HOOK_JOB_STARTED/COMPLETEDenv vars (Runner v2.300.0+)timestamp,job_id,status,duration_seconds,queue_time_secondsMetrics Collector (rewritten)
runner_nameandrunner_typelabelsEntrypoints & Dockerfiles (modified)
Grafana Dashboards
github-runner.jsonreplaced with 24-panel DORA-enabled overviewdora-metrics.json(new) - DORA deep dive with classification referencejob-analysis.json(new) - Job analysis with histograms and status breakdownTesting & Docs
docs/features/PHASE3_DORA_METRICS.md- Complete Phase 3 documentationType of Change
Testing Performed
Closes #1061
/cc @copilot