-
Notifications
You must be signed in to change notification settings - Fork 159
Harden Linux Docker retries in CI #8358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
54 commits
Select commit
Hold shift + click to select a range
e7ea99d
initial implementation
NachoEchevarria 43c805c
update snapshots
NachoEchevarria 45a3914
update snapshots
NachoEchevarria fa1e92c
fix duplicated
NachoEchevarria dfd5800
Avoid two calls to GetServiceName
NachoEchevarria 891cd46
Add warning comments for enum sync
NachoEchevarria 92b5de7
rename param
NachoEchevarria 9495c55
Remove not needed
NachoEchevarria 1603764
Remove not needed
NachoEchevarria 83c03be
Refactor: Use ServiceNameMetadata to decrease double method calls.
NachoEchevarria fb7e24a
Fix unit tests
NachoEchevarria b487692
remove integration arrays
NachoEchevarria 76423c9
Refactor schemas
NachoEchevarria 2e0044c
Use opt.service_mapping even if default
NachoEchevarria d76c65e
Fix dbscopefactory
NachoEchevarria 6ef7bb5
Final check serviceNameEqualsDefault
NachoEchevarria 18105ba
SpanMessagePackFormatter opt case
NachoEchevarria 42401b0
Add unit tests
NachoEchevarria 564ff19
Minor refactor
NachoEchevarria 19abb0a
remove hardcoded
NachoEchevarria 3df9559
Initial implementation
NachoEchevarria 571a6e8
Fix unit. Azure Aws add source.
NachoEchevarria cbf0b86
fix unit tests and integration tests
NachoEchevarria 02ecf55
Update v1 rules
NachoEchevarria 1721030
Update OTEL
NachoEchevarria 5bc3666
Remove commas
NachoEchevarria 337c5fe
serivoce source stats. Initial implementation.
NachoEchevarria c80fcaf
protect against empty strings
NachoEchevarria d6e12c2
Merge branch 'master' into nacho/ServiceSourceStats
NachoEchevarria 7add326
Add Linux docker cgroup retry wrapper
datadog-official[bot] c2cdfd3
fixes
NachoEchevarria 241ef8c
Match Windows approach
NachoEchevarria 5b73840
cover more cases
NachoEchevarria ffdf087
Add ensure docker ready
NachoEchevarria a0ae1f7
Temporary change to test the stat system test
NachoEchevarria 9fd8d2e
Merge branch 'master' into nacho/ServiceSourceStats
NachoEchevarria 6415ecb
Fix compilation errors from merge.
NachoEchevarria a78fcd8
Improve logging
NachoEchevarria 134aa23
Update .azure-pipelines/steps/ensure-docker-ready-linux.sh
NachoEchevarria 974b4fe
Apply suggestion from @andrewlock
NachoEchevarria bf76fc4
Apply suggestion from @andrewlock
NachoEchevarria 16f424f
Merge branch 'dd/ci/linux-docker-cgroup-retries' of https://github.co…
NachoEchevarria d98d1f7
undo
NachoEchevarria f4f367f
Merge branch 'master' into dd/ci/linux-docker-cgroup-retries
NachoEchevarria 49bb57d
Add log line
NachoEchevarria 9cc81c3
Merge branch 'dd/ci/linux-docker-cgroup-retries' of https://github.co…
NachoEchevarria a66e824
Use POSIX shell
NachoEchevarria 47d9cc7
Merge branch 'master' into dd/ci/linux-docker-cgroup-retries
NachoEchevarria 88d5788
Early return
NachoEchevarria 4e1cdcd
Simplified try_restart_docker
NachoEchevarria ee16ba0
Make ensure-docker-ready-linux.sh executable
NachoEchevarria 9388163
adress nits
NachoEchevarria 26d3048
Early return
NachoEchevarria eea7deb
Avoid too verbose output
NachoEchevarria File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,134 @@ | ||
| #!/bin/sh | ||
| # Linux Docker readiness check — mirrors the Windows PowerShell logic in ensure-docker-ready.yml. | ||
| # Waits for the Docker daemon, attempts service restarts if needed, and fails fast | ||
| # to avoid wasting time on a broken agent. | ||
|
|
||
| set -u | ||
|
|
||
| DOCKER_READY_TIMEOUT_SECONDS="${DOCKER_READY_TIMEOUT_SECONDS:-300}" | ||
| DOCKER_READY_CHECK_INTERVAL_SECONDS="${DOCKER_READY_CHECK_INTERVAL_SECONDS:-10}" | ||
| DOCKER_MAX_RESTARTS="${DOCKER_MAX_RESTARTS:-3}" | ||
|
|
||
| log() | ||
| { | ||
| echo "[ensure-docker-ready-linux] $*" | ||
| } | ||
|
|
||
| log_diagnostics() | ||
| { | ||
| log "--- Diagnostics ---" | ||
| local cgroup_version="unknown" | ||
| if [ -f "/sys/fs/cgroup/cgroup.controllers" ]; then | ||
| cgroup_version="v2" | ||
| elif [ -d "/sys/fs/cgroup" ]; then | ||
| cgroup_version="v1" | ||
| fi | ||
|
|
||
| log "cgroup version: ${cgroup_version}" | ||
| log "kernel: $(uname -a)" | ||
|
|
||
| if command -v systemctl >/dev/null 2>&1; then | ||
| log "systemd state:" | ||
| systemctl is-system-running || true | ||
| log "docker service status:" | ||
| systemctl status docker --no-pager || true | ||
| log "dbus service status:" | ||
| systemctl status dbus --no-pager || true | ||
| fi | ||
|
|
||
| if command -v journalctl >/dev/null 2>&1; then | ||
| log "docker journal logs (last 50 lines):" | ||
| journalctl -u docker --no-pager -n 50 || true | ||
| fi | ||
|
|
||
| log "docker version:" | ||
| docker version || true | ||
| log "docker info:" | ||
| docker info || true | ||
| } | ||
|
|
||
| try_restart_docker() | ||
| { | ||
| log "Attempting Docker service restart..." | ||
| local output | ||
| output=$(systemctl restart docker 2>&1) | ||
| if [ $? -eq 0 ]; then | ||
| log "systemctl restart docker completed" | ||
| return 0 | ||
| else | ||
| log "systemctl restart docker failed: ${output}" | ||
| return 1 | ||
| fi | ||
| } | ||
|
|
||
| wait_for_docker() | ||
| { | ||
| local elapsed=0 | ||
| local restart_count=0 | ||
|
|
||
| log "Waiting up to ${DOCKER_READY_TIMEOUT_SECONDS}s for Docker daemon (will attempt up to ${DOCKER_MAX_RESTARTS} service restarts)..." | ||
|
|
||
| # Quick check — if Docker is already healthy, nothing to do | ||
| if docker info >/dev/null 2>&1; then | ||
| log "Docker daemon is ready" | ||
| return 0 | ||
| fi | ||
|
|
||
| # If we can't restart Docker, there's no point looping | ||
| if ! command -v systemctl >/dev/null 2>&1; then | ||
| log "Docker is not responding and systemctl is not available — cannot recover" | ||
| log_diagnostics | ||
| return 1 | ||
| fi | ||
|
|
||
| # Log initial service state | ||
| local initial_status | ||
| initial_status=$(systemctl is-active docker 2>&1 || true) | ||
| log "Docker service initial state: ${initial_status}" | ||
|
|
||
| local consecutive_failures=0 | ||
| local DOCKER_READY_FORCE_RESTART_AFTER=3 | ||
|
|
||
| while [ "${elapsed}" -lt "${DOCKER_READY_TIMEOUT_SECONDS}" ]; do | ||
| if docker info >/dev/null 2>&1; then | ||
| log "Docker daemon is ready (waited ${elapsed}s, ${restart_count} restart(s) performed)" | ||
| return 0 | ||
| fi | ||
|
|
||
| consecutive_failures=$((consecutive_failures + 1)) | ||
|
|
||
| # Try restarting if the service is down, or if it reports active but is unresponsive | ||
| local svc_status | ||
| svc_status=$(systemctl is-active docker 2>&1 || true) | ||
| local should_restart=false | ||
| if [ "${svc_status}" != "active" ]; then | ||
| should_restart=true | ||
| elif [ "${consecutive_failures}" -ge "${DOCKER_READY_FORCE_RESTART_AFTER}" ]; then | ||
| log "Docker service reports active but has been unresponsive for ${consecutive_failures} checks" | ||
| should_restart=true | ||
| fi | ||
|
|
||
| if [ "${should_restart}" = true ] && [ "${restart_count}" -lt "${DOCKER_MAX_RESTARTS}" ]; then | ||
| restart_count=$((restart_count + 1)) | ||
| log "Docker service is ${svc_status}. Attempting restart ${restart_count}/${DOCKER_MAX_RESTARTS}..." | ||
| try_restart_docker | ||
| sleep 2 | ||
| consecutive_failures=0 | ||
| elif [ "${should_restart}" = true ] && [ "${restart_count}" -ge "${DOCKER_MAX_RESTARTS}" ]; then | ||
| log "Docker service is ${svc_status} but max restarts (${DOCKER_MAX_RESTARTS}) exhausted — giving up" | ||
| log_diagnostics | ||
| return 1 | ||
| fi | ||
|
|
||
| log "Docker not ready yet (${elapsed}s elapsed), retrying in ${DOCKER_READY_CHECK_INTERVAL_SECONDS}s..." | ||
| sleep "${DOCKER_READY_CHECK_INTERVAL_SECONDS}" | ||
| elapsed=$((elapsed + DOCKER_READY_CHECK_INTERVAL_SECONDS)) | ||
| done | ||
|
|
||
| log "Docker daemon did not become ready within ${DOCKER_READY_TIMEOUT_SECONDS}s after ${restart_count} restart(s)" | ||
| echo "##vso[task.logissue type=error]Docker daemon did not become ready within ${DOCKER_READY_TIMEOUT_SECONDS}s after ${restart_count} restart(s)" | ||
| log_diagnostics | ||
| return 1 | ||
| } | ||
|
|
||
| wait_for_docker | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2143,6 +2143,8 @@ stages: | |
| displayName: BuildWindowsIntegrationTests | ||
| retryCountOnTaskFailure: 3 | ||
|
|
||
| - template: steps/ensure-docker-ready.yml | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch 👍 |
||
|
|
||
| - powershell: | | ||
| mkdir -Force ./artifacts/build_data/snapshots | ||
| mkdir -Force ./artifacts/build_data/logs/LoaderOptimizationStartup | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably exit on failing commands too? Or does that break things? Meh, maybe best to leave it 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should avoid early exits in some commands that could possibly fail in the script