[core] (FreeObjects 2/n) Adding dead owner callback through raylet GCS listener#1
Draft
aaronscalene wants to merge 163 commits into
Draft
[core] (FreeObjects 2/n) Adding dead owner callback through raylet GCS listener#1aaronscalene wants to merge 163 commits into
aaronscalene wants to merge 163 commits into
Conversation
36d98a3 to
b1f06f1
Compare
9a872b8 to
0408656
Compare
483b867 to
9482fe8
Compare
0408656 to
4f1ef12
Compare
3c5129a to
01dd4d9
Compare
01dd4d9 to
a8c096f
Compare
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
a8c096f to
cc91473
Compare
…e/ray into aaron/passive-owner-callback Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
…e/ray into aaron/passive-owner-callback Signed-off-by: aaron.li <aaron.li@anyscale.com>
…allback Signed-off-by: aaron.li <aaron.li@anyscale.com>
…in (ray-project#63797) ## Why are these changes needed? `ray job submit` uses `subprocess.list2cmdline` to join the entrypoint arguments into a command string. This function wraps arguments containing spaces in **double quotes** (`"`), which causes POSIX shells (`/bin/sh`) on the job server to expand `$VAR` references — silently eating environment variables that the user intended to preserve as literal strings. For example: ```bash ray job submit -- echo 'python -m launcher --config $CONFIG_PATH' ``` **Before (bug):** CLI sends `echo "python -m launcher --config $CONFIG_PATH"` → server shell expands `$CONFIG_PATH` to empty → output: `python -m launcher --config` **After (fix):** CLI sends `echo 'python -m launcher --config $CONFIG_PATH'` → server shell preserves literal → output: `python -m launcher --config $CONFIG_PATH` The fix replaces `subprocess.list2cmdline` (designed for Windows `cmd.exe`) with `shlex.join` (designed for POSIX shells), which wraps arguments in **single quotes** (`'`) to prevent variable expansion — matching standard POSIX shell conventions. ## Related issue number Fixes ray-project#56232 ## Checks - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've made sure the tests are passing. - [x] Testing Strategy --------- Signed-off-by: Vũ Trần Phúc <Vuphuccc@gmail.com>
…atus (ray-project#62934) ## Description `ClusterStatus.stats` currently uses `field(default_factory=Stats)`, but `Stats` requires the positional argument `gcs_request_time_s`. As a result, `ClusterStatus()` cannot be default-constructed and raises: ```text TypeError: Stats.__init__() missing 1 required positional argument: 'gcs_request_time_s' ``` This PR fixes the invalid default by making ClusterStatus construct a valid default Stats object, and adds a regression test to cover the empty/default construction path. This is a small schema-level fix and does not change the normal populated paths where callers already pass an explicit Stats(...) instance. ## Related issues ray-project#62933 ## Additional information Implementation notes: update ClusterStatus.stats to use a valid default Stats value add a regression test for ClusterStatus() --------- Signed-off-by: weimingdiit <weimingdiit@gmail.com> Signed-off-by: Rueian <rueiancsie@gmail.com> Co-authored-by: Rueian <rueiancsie@gmail.com>
## Description We noticed that some Redis/Valkey providers don't reject/reset plain TCP connections if the server requires TLS. They just hang the connections and do not respond. Here was one tcpdump we captured showing such server behavior: ``` (base) ray@ip-10-0-241-152:~/default$ sudo tcpdump -i any tcp and host clustercfg.internal-testing-with-tls-enabled.fjpqowie.memorydb.us-west-2.amazonaws.com tcpdump: data link type LINUX_SLL2 tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes 22:59:20.106033 ens5 Out IP ip-10-0-241-152.us-west-2.compute.internal.48120 > ip-10-0-33-223.us-west-2.compute.internal.redis: Flags [S], seq 379242323, win 62727, options [mss 8961,sackOK,TS val 3940517090 ecr 0,nop,wscale 7], length 0 22:59:20.106509 ens5 In IP ip-10-0-33-223.us-west-2.compute.internal.redis > ip-10-0-241-152.us-west-2.compute.internal.48120: Flags [S.], seq 916631862, ack 379242324, win 26847, options [mss 8961,sackOK,TS val 2553398864 ecr 3940517090,nop,wscale 7], length 0 22:59:20.106526 ens5 Out IP ip-10-0-241-152.us-west-2.compute.internal.48120 > ip-10-0-33-223.us-west-2.compute.internal.redis: Flags [.], ack 1, win 491, options [nop,nop,TS val 3940517090 ecr 2553398864], length 0 22:59:20.106545 ens5 Out IP ip-10-0-241-152.us-west-2.compute.internal.48128 > ip-10-0-33-223.us-west-2.compute.internal.redis: Flags [S], seq 1429463323, win 62727, options [mss 8961,sackOK,TS val 3940517090 ecr 0,nop,wscale 7], length 0 22:59:20.106582 ens5 Out IP ip-10-0-241-152.us-west-2.compute.internal.48120 > ip-10-0-33-223.us-west-2.compute.internal.redis: Flags [P.], seq 1:29, ack 1, win 491, options [nop,nop,TS val 3940517090 ecr 2553398864], length 28: RESP "INFO" "SENTINEL" 22:59:20.106942 ens5 In IP ip-10-0-33-223.us-west-2.compute.internal.redis > ip-10-0-241-152.us-west-2.compute.internal.48128: Flags [S.], seq 911009498, ack 1429463324, win 26847, options [mss 8961,sackOK,TS val 2553398864 ecr 3940517090,nop,wscale 7], length 0 22:59:20.106950 ens5 Out IP ip-10-0-241-152.us-west-2.compute.internal.48128 > ip-10-0-33-223.us-west-2.compute.internal.redis: Flags [.], ack 1, win 491, options [nop,nop,TS val 3940517091 ecr 2553398864], length 0 22:59:20.106953 ens5 In IP ip-10-0-33-223.us-west-2.compute.internal.redis > ip-10-0-241-152.us-west-2.compute.internal.48120: Flags [.], ack 29, win 210, options [nop,nop,TS val 2553398864 ecr 3940517090], length 0 23:04:21.939073 ens5 In IP ip-10-0-33-223.us-west-2.compute.internal.redis > ip-10-0-241-152.us-west-2.compute.internal.48120: Flags [.], ack 29, win 210, options [nop,nop,TS val 2553700699 ecr 3940517090], length 0 23:04:21.939073 ens5 In IP ip-10-0-33-223.us-west-2.compute.internal.redis > ip-10-0-241-152.us-west-2.compute.internal.48128: Flags [.], ack 1, win 210, options [nop,nop,TS val 2553700699 ecr 3940517091], length 0 23:04:21.939081 ens5 Out IP ip-10-0-241-152.us-west-2.compute.internal.48120 > ip-10-0-33-223.us-west-2.compute.internal.redis: Flags [.], ack 1, win 491, options [nop,nop,TS val 3940818923 ecr 2553398864], length 0 23:04:21.939084 ens5 Out IP ip-10-0-241-152.us-west-2.compute.internal.48128 > ip-10-0-33-223.us-west-2.compute.internal.redis: Flags [.], ack 1, win 491, options [nop,nop,TS val 3940818923 ecr 2553398864], length 0 ``` In such cases, we shouldn't just wait for the server to respond. This PR adds a `redis_db_probe_timeout_milliseconds` timeout to the first command, which is the `INFO SENTINEL` and `AUTH` if auth is needed. The timeout is 30 seconds by default. If the timeout is reached, we will see the following from the `ray start` stderr output: ``` $ RAY_redis_db_probe_timeout_milliseconds=3000 python -m ray.scripts.scripts start --head --address=127.0.0.1:6380 --redis-password="" --block [2026-05-05 15:23:57,514 E 93876 7759139] (ray_init) redis_context.cc:684: Timed out waiting for redis command reply (sync). [2026-05-05 15:23:57,514 E 93876 7759139] (ray_init) redis_context.cc:684: Timed out waiting for redis command reply (sync). [2026-05-05 15:23:57,524 C 93876 7759139] (ray_init) redis_context.cc:475: An unexpected system state has occurred. You have likely discovered a bug in Ray. Please report this issue at https://github.com/ray-project/ray/issues and we'll work with you to fix it. Check failed: reply Failed to get Redis info within 30000ms. ``` ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Rueian Huang <rueiancsie@gmail.com> Signed-off-by: Rueian <rueiancsie@gmail.com>
## Why are these changes needed? An audit of the LLM usage telemetry found the per-model statistics are structurally wrong, independent of any particular deployment. The detached telemetry actor appends one entry to its model list **per replica engine-start** (and again on every restart / autoscale / redeploy) and is never reset, so each comma-joined tag: - grows **without bound** over a long-lived cluster's lifetime (risking silent truncation at the usage-tag value size limit), and - duplicates each model by its replica/restart count, inflating every per-model list and count. Two values were also wrong by construction: `num_replicas` was hardcoded to `1` for non-autoscaling deployments (and `num_replicas="auto"` was dropped entirely, since the stored config keeps the literal string), and the JSON_MODE tags were built from a hardcoded `use_json_mode=True` for a dimension that has no deployment-time config, making them duplicates of the MODELS / NUM_REPLICAS tags. These are observable failure modes, not hypothetical. ### Serve (`observability/usage_telemetry/usage.py`) - **Dedup by `model_id`** (last-write-wins dict) instead of an append-only list. Fixes the double-counting across replicas/restarts and the unbounded memory growth on the head-node actor; tags now carry one entry per distinct model. `model_id` is identity only and is never recorded as a value. - **Report the configured replica count**: fixed `num_replicas` instead of hardcoded `1`, and resolve `num_replicas="auto"` to its autoscaling config (reported as autoscaling) rather than dropping it. - **Drop the JSON_MODE tags** (proto 602/603 `reserved`); they carried no signal. - **Telemetry can no longer break engine start**: `_multiple_apps` never raises, and per-model reporting is wrapped so a failure is logged, not propagated. - **Atomic agent creation** via `get_if_exists`; removed the shadowed retry args. ### Batch (`batch/observability/usage_telemetry/usage.py` + processors) - Same dedup fix (key by telemetry identity) plus a `_reset` hook. - Guarded, atomic agent creation and a non-raising push path so telemetry cannot break processor construction. - HTTP processor now passes `batch_size`; vLLM processor now sources `data_parallel_size` (both previously reported 0). - Fixed the `LLM_BATCH_CONCURRENCY` proto comment. ### Tests Updated existing expectations and added regression tests: replica/restart dedup, fixed `num_replicas`, and `num_replicas="auto"` reported as autoscaling. > [!NOTE] > This reserves two released usage tags (602/603) and stops recording them. Per the proto header, data-collection changes need sign-off from @pcmoritz / @thomasdesr. ## Checks - [x] Added regression tests. - [x] Signed off with DCO. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
## Description 1. relax oss test_collective target to 300s timeout Signed-off-by: Lehui Liu <lehui@anyscale.com>
…adcast instead (ray-project#63723) We continue to optimistically deduct resources when scheduling placement groups in the GCS, but no longer deduct those resources on placement group scheduling failure. Instead, we rely on the regular resource view broadcast to reconcile the resource view between the GCS and Raylet(s). Closes: ray-project#62858 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…roject#63880) ## Why A 6-month GA4 audit of `docs.ray.io` found **2,017 out-of-support legacy-version paths receiving ~552K views with no redirect rule and no equivalent on current docs**. Under any legacy-cohort cutover, that traffic lands on 404s. This PR adds the redirect coverage for the bulk of that gap. ## What 21 `page` catch-all rules appended to `doc/redirects/current.yaml`, covering **~96% of the gap by views and by paths**. Each cluster lands on its nearest surviving section index: - **Autodoc / API refs** (RLlib `package_ref`, Serve, Data, Tune, Train, Ray Core, Observability) → that library's API reference index. - **Examples** (Ray AIR, Tune, Train, Ray Core) → the examples gallery. - **Restructured sections** (Kubernetes, Serve tutorials) → the section index. - **Deprecated, no current equivalent**: Ray AIR → AIR getting-started; Ray Workflows → Ray overview. - **Sphinx `_modules/` source views** → the Ray source on GitHub (a view-source request for a symbol that no longer exists is better served by GitHub search than a 404). - Three targeted `rllib-*` overrides precede the RLlib catch-all so high-traffic renamed pages land on their real equivalents. ## Design All rules are `type: page` with `force` omitted (defaults to `false`), so each **fires only when the path would 404**. Pages that still exist with the same name on any version resolve untouched; only moved/deleted/renamed paths hit the catch-all. This composes with the planned legacy-cohort cutover (`/en/releases-X.Y.Z/*` → `/en/latest/:splat`): the cutover preserves same-name paths, and these rules catch what no longer exists on current docs. ## Validation ``` rtd-redirects validate doc/redirects/current.yaml → 0 error, 49 warning ``` Zero ordering/shadow errors. The 49 chain warnings are benign: each is an existing rule whose `to` is a live page that happens to sit under one of the new section wildcards, which never fires on a live page given `force: false`. ## Notes - Follows the source-of-truth workflow established in ray-project#63367. - Application to Read the Docs is a manual maintainer step post-merge per `doc/redirects/README.md`. Signed-off-by: Douglas Strodtman <douglas@anyscale.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ay-project#63833) ## Why are these changes needed? A few small, independent Serve improvements bundled together: - **gRPC error-path tracing** (`replica.py`): the gRPC success path already records span attributes and exceptions via `_wrap_request`; route the direct-ingress unary error path through the same `_handle_errors_and_metrics` handling (by surfacing the status code and re-raising) so failed gRPC requests are traced with their status code and exception instead of silently dropping out of the trace. - **Timeout bump** for `test_serve_metrics_for_successful_connection` to reduce flakiness while waiting for metrics. No production behavior changes beyond the added error-path tracing instrumentation. ## Related issue number N/A ## Checks - [x] I've signed off every commit (by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've made sure the tests are passing. - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com>
## Description The ubsan tests were failing in release at compile time due to throwing error on warning of shadowed variable: ``` [2026-06-04T19:43:48Z] In file included from src/ray/asio/io_context_monitor.cc:15: -- | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h: In constructor 'ray::IOContextMonitor::ProbeState::ProbeState(std::string, instrumented_io_context&, std::shared_ptr<ray::ClockInterface>)': | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:64:48: error: declaration of 'clock' shadows a member of 'ray::IOContextMonitor::ProbeState' [-Werror=shadow] | [2026-06-04T19:43:48Z] 64 \| std::shared_ptr<ClockInterface> clock) | [2026-06-04T19:43:48Z] \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~ | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:69:43: note: shadowed declaration is here | [2026-06-04T19:43:48Z] 69 \| const std::shared_ptr<ClockInterface> clock; | [2026-06-04T19:43:48Z] \| ^~~~~ | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:63:41: error: declaration of 'io_context' shadows a member of 'ray::IOContextMonitor::ProbeState' [-Werror=shadow] | [2026-06-04T19:43:48Z] 63 \| instrumented_io_context &io_context, | [2026-06-04T19:43:48Z] \| ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~ | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:68:30: note: shadowed declaration is here | [2026-06-04T19:43:48Z] 68 \| instrumented_io_context &io_context; | [2026-06-04T19:43:48Z] \| ^~~~~~~~~~ | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:62:28: error: declaration of 'name' shadows a member of 'ray::IOContextMonitor::ProbeState' [-Werror=shadow] | [2026-06-04T19:43:48Z] 62 \| ProbeState(std::string name, | [2026-06-04T19:43:48Z] \| ~~~~~~~~~~~~^~~~ | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:67:23: note: shadowed declaration is here | [2026-06-04T19:43:48Z] 67 \| const std::string name; | [2026-06-04T19:43:48Z] \| ^~~~ | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h: In constructor 'ray::IOContextMonitor::ProbeState::ProbeState(std::string, instrumented_io_context&, std::shared_ptr<ray::ClockInterface>)': | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:64:48: error: declaration of 'clock' shadows a member of 'ray::IOContextMonitor::ProbeState' [-Werror=shadow] | [2026-06-04T19:43:48Z] 64 \| std::shared_ptr<ClockInterface> clock) | [2026-06-04T19:43:48Z] \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~ | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:69:43: note: shadowed declaration is here | [2026-06-04T19:43:48Z] 69 \| const std::shared_ptr<ClockInterface> clock; | [2026-06-04T19:43:48Z] \| ^~~~~ | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:63:41: error: declaration of 'io_context' shadows a member of 'ray::IOContextMonitor::ProbeState' [-Werror=shadow] | [2026-06-04T19:43:48Z] 63 \| instrumented_io_context &io_context, | [2026-06-04T19:43:48Z] \| ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~ | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:68:30: note: shadowed declaration is here | [2026-06-04T19:43:48Z] 68 \| instrumented_io_context &io_context; | [2026-06-04T19:43:48Z] \| ^~~~~~~~~~ | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:62:28: error: declaration of 'name' shadows a member of 'ray::IOContextMonitor::ProbeState' [-Werror=shadow] | [2026-06-04T19:43:48Z] 62 \| ProbeState(std::string name, | [2026-06-04T19:43:48Z] \| ~~~~~~~~~~~~^~~~ | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:67:23: note: shadowed declaration is here | [2026-06-04T19:43:48Z] 67 \| const std::string name; | [2026-06-04T19:43:48Z] \| ^~~~ | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h: In constructor 'ray::IOContextMonitor::ProbeState::ProbeState(std::string, instrumented_io_context&, std::shared_ptr<ray::ClockInterface>)': | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:64:48: error: declaration of 'clock' shadows a member of 'ray::IOContextMonitor::ProbeState' [-Werror=shadow] | [2026-06-04T19:43:48Z] 64 \| std::shared_ptr<ClockInterface> clock) | [2026-06-04T19:43:48Z] \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~ | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:69:43: note: shadowed declaration is here | [2026-06-04T19:43:48Z] 69 \| const std::shared_ptr<ClockInterface> clock; | [2026-06-04T19:43:48Z] \| ^~~~~ | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:63:41: error: declaration of 'io_context' shadows a member of 'ray::IOContextMonitor::ProbeState' [-Werror=shadow] | [2026-06-04T19:43:48Z] 63 \| instrumented_io_context &io_context, | [2026-06-04T19:43:48Z] \| ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~ | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:68:30: note: shadowed declaration is here | [2026-06-04T19:43:48Z] 68 \| instrumented_io_context &io_context; | [2026-06-04T19:43:48Z] \| ^~~~~~~~~~ | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:62:28: error: declaration of 'name' shadows a member of 'ray::IOContextMonitor::ProbeState' [-Werror=shadow] | [2026-06-04T19:43:48Z] 62 \| ProbeState(std::string name, | [2026-06-04T19:43:48Z] \| ~~~~~~~~~~~~^~~~ | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:67:23: note: shadowed declaration is here | [2026-06-04T19:43:48Z] 67 \| const std::string name; | [2026-06-04T19:43:48Z] \| ^~~~ | [2026-06-04T19:43:48Z] cc1plus: all warnings being treated as errors ``` This PR addresses this issue by renaming the constructor arguments and adds a missing dependency for an include in `memory_monitor_test_fixture`. ## Related issues ## Additional information Signed-off-by: davik <davik@anyscale.com> Co-authored-by: davik <davik@anyscale.com>
… KubeRay (ray-project#63465) ## Description Terminate a cluster managed by the V2 autoscaler when no user driver is attached. Related to ray-project/kuberay#4815 When `autoscalerOptions.noDriverTimeoutSeconds` is set, the V2 autoscaler evaluates a no-driver predicate every reconcile loop and, when it fires, patches a single annotation on the RayCluster CR: ```yaml metadata: annotations: ray.io/no-driver-ttl-expired: "true" The KubeRay operator observes the condition and decides the terminal action. (delete RayCluster) A cluster is eligible for termination only when both of the following hold, and only when they have held continuously for at least noDriverTimeoutSeconds: 1. No active user driver is attached. 2. Condition 1 has held for at least noDriverTimeoutSeconds. Note that **detached actors do not count as a driver, a cluster running only detached actors is still eligible for termination.** Changes This PR adds `autoscalerOptions.noDriverTimeoutSeconds`. The decision lives on `KubeRayProvider`: it tracks how long the cluster has had no driver attached and, once the timeout is exceeded, dispatches an annotation for KubeRay to terminate the cluster, freeing the head pod and any reserved capacity that would otherwise linger. 1. New autoscalerOptions.noDriverTimeoutSeconds field, V2 + KubeRay only - Existing CRs and existing V1 / non-KubeRay deployments see no behavior change. - The field is read only by KubeRayProvider; unset disables the feature. 2. No-driver decision lives on KubeRayProvider - Evaluated against `gcs_client.get_all_job_info(...)`, filtering out Ray dashboard jobs. Fails closed. - The provider records when the cluster was first seen with no driver, and dispatches once that has held for `noDriverTimeoutSeconds`. The timer resets if a driver reappears. 3. Dispatch: single annotation on the RayCluster CR The reconciler calls `evaluate_no_driver_termination`, which patches the RayCluster CR with `ray.io/no-driver-ttl-expired: "true"`. The KubeRay Operator implementation is covered in ray-project/kuberay#4815. ## Related issues Closes ray-project#63452 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: win5923 <ken89@kimo.com>
…export (ray-project#63744) ## Description This PR normalizes OpenTelemetry metric attribute sets before handing observations to the Prometheus exporter. Some Ray components can emit the same metric with heterogeneous attribute sets, for example when one data point includes `SessionName` and another data point for the same metric does not. With older `opentelemetry-exporter-prometheus` versions used by Ray's default compiled dependencies, metrics can reach Prometheus export with mixed label key sets. This can produce misaligned Prometheus label values, such as `dataset="core_worker"` or `operator="<node ip>"`, making Ray Data dashboards misleading. This change makes Ray enforce a stable label schema at the export boundary: - For observable gauge, counter, and sum callbacks, collect the union of attribute keys for each metric and fill missing values with `""`. - For reconstructed histogram batches in the dashboard reporter, normalize all batch data points to the union of tag keys before recording them. - Add regression coverage for mixed attribute sets in observable metric callbacks and histogram export. This does not depend on upgrading `opentelemetry-exporter-prometheus`. It is also compatible with newer exporter versions that perform similar normalization internally; in that case Ray provides already-normalized observations and the exporter-side normalization is effectively idempotent. ## Related issues Fixes ray-project#63499. ## Additional information This PR intentionally keeps the fix in the Python export path. The issue is caused by heterogeneous label key sets, not by nondeterministic tag ordering, so this avoids changing the metric record path or upgrading OpenTelemetry dependencies. Tests: ```bash python -m py_compile python/ray/_private/telemetry/open_telemetry_metric_recorder.py python/ray/dashboard/modules/reporter/reporter_agent.py python/ray/dashboard/modules/reporter/tests/test_reporter.py python/ray/tests/test_open_telemetry_metric_recorder.py git diff --check Signed-off-by: OneSizeFitsQuorum <txypotato@gmail.com>
…oject#63402) ## Description In Ray Client mode, `ray._private.worker._global_node` is `None` because the client driver is not a Ray worker process, even though `ray.is_initialized()` is `True` and the cluster is connected. `get_or_create_stats_actor` used `_global_node` as a proxy for "connected to Ray" and raised `RuntimeError` whenever Ray Data tried to register or query the stats actor, causing `ds.take_batch()`, `ds.iter_batches()`, etc. to crash on materialized datasets. Use `ray.is_initialized()` for the connection check and only emit the `cluster_id` debug log when `_global_node` is available, since `cluster_id` is not exposed via `ray.get_runtime_context()`. ## Related issues Closes ray-project#61162 ## Additional information --------- Signed-off-by: Yuang Gao <yg2315@nyu.edu>
What does this PR do? Adds a module docstring to cpp/example/_BUILD.bazel to fix the buildifier module-docstring warning. Related issue Part of ray-project#50875. Signed-off-by: HUY <vinhuytran0810@gmail.com>
## Description - Five startup methods in services.py create output handlers using `open(os.devnull, "w")` but never close, causing `ResourceWarning: unclosed file` warnings. Replaced with `subprocess.DEVNULL`. - Kill process method in Node do not call `wait()` after `kill()` when the argument `wait` is False, causing `ResourceWarning: subprocess is still running` warnings. Change to call `wait()` after `kill()` to prevent zombie processes. ## Related issues Fixes ray-project#9546, Fixes ray-project#59782 --------- Signed-off-by: Accurio <2671768169@qq.com>
## Why are these changes needed? This PR hardens `runtime_env` zip package extraction by resolving candidate member paths before checking whether they remain inside the extraction target. The existing `unzip_package()` implementation builds a candidate path from the target directory and zip member name, then checks containment before resolving path components such as `..`. This makes the zip extraction path inconsistent with the safer resolved-path containment logic already used by `untar_package()`. This change updates `unzip_package()` to: - resolve the extraction target path - resolve each candidate zip member extraction path - skip zip entries whose resolved path is outside the target directory - use the resolved path for directory creation, file writes, and chmod This keeps zip extraction behavior aligned with the intended path containment invariant and with the existing tar extraction implementation. ## Related issue number N/A ## Checks - [x] I've run relevant checks for this change. ## Testing ```shell python3 -m py_compile python/ray/_private/runtime_env/packaging.py python/ray/tests/test_runtime_env_packaging.py git diff --check ``` I also ran the three new regression cases in a focused standalone pytest harness against the patched `unzip_package()` implementation: 3 passed. --------- Signed-off-by: wonyunkang <rojo.wk@gmail.com> Signed-off-by: H4ck2 <H4ck2@users.noreply.github.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: H4ck2 <H4ck2@users.noreply.github.com>
…nt (ray-project#63632) ray-project#52974 added pydoclint to pre-commit without fixing any issues through adding all the problematic docstrings to an ignore list. However this means that all the docstrings that do have issues / problems with them that aren't raised or fixed (which is helpful for agents understanding codebases). This PR removes all of the ray serve and llm ignores then uses Claude to fix all the docstrings / type hints (and reviewed by me to confirm implementations). --------- Signed-off-by: Mark Towers <mark@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com>
…oject#63541) ray-project#52974 added pydoclint to pre-commit without fixing any issues through adding all the problematic docstrings to an ignore list. However this means that all the docstrings that do have issues / problems with them that aren't raised or fixed (which is helpful for agents understanding codebases). This PR removes all of the `ray._private` ignores then uses Claude to fix all the docstrings / type hints (and reviewed by me to confirm implementations). --------- Signed-off-by: Mark Towers <mark@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
The Windows CI driver binaries (`//ci/ray_ci:test_in_docker`, `//ci/ray_ci:build_in_docker_windows`) are non-hermetic `py_binary`s that run on the Windows agent's Python 3.8 conda base, but their third-party deps were bundled by bazel from `py_deps_py310` (`release/requirements_py310.txt`). ray-project#63722 (Starlette security upgrade) bumped `typing-extensions` 4.11.0 -> 4.15.0 in that lock. typing-extensions 4.14.0 dropped Python 3.8, so the driver started dying at import on every Windows job before any test ran: ``` typing_extensions.py:526 class _SpecialGenericAlias(typing._SpecialGenericAlias, ...) AttributeError: module 'typing' has no attribute '_SpecialGenericAlias' ``` ## What Resolve and install the driver's deps for **Python 3.8 / win32** on the agent, mirroring the macOS CI driver (`ci/ray_ci/macos/macos_ci.sh`) -- no bazel bundling. - **`release/requirements_windows.in` + `ci/raydepsets/configs/ci_windows.depsets.yaml`** resolve the driver's closure for py3.8/windows into `python/deplocks/ci/ci_windows_depset.lock`. Same package set as `requirements_py310.in`, minus the docs toolchain (pydantic forces `typing-extensions>=4.14.1`) and twine (`requires-python >=3.9`, only used by `//ci/ray_ci/automation`). - **`ci/ray_ci/windows/install_tools.sh`** `pip install`s that lock into the agent's Python 3.8 env (strip hashes, `--no-deps`), so the correct py3.8 versions (incl. a current `pyOpenSSL` that shadows the agent's stale conda copy) are present at runtime. - **`bazel/ci_require.bzl` + `//ci/ray_ci/deps`** route the driver's third-party deps via per-package aliases: bundled from `@py_deps_py310` off Windows, and **unbundled on Windows** (resolve to an empty library) so bundled py3.10 wheels don't shadow the agent's py3.8 versions. `ci_require`/`bk_require` is applied in `//ci/ray_ci` and `//release`. Postmerge build: https://buildkite.com/ray-project/postmerge/builds/17909/canvas --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ject#62486) ## Summary The `air_example_gptj_deepspeed_fine_tuning` release test fails due to dependency incompatibilities introduced when the base ray-ml image was upgraded to torch 2.7.0+cu128 (triton 3.3.0) in ray-project#61328. This PR fixes the dependency issues so the test passes. Failing build: https://buildkite.com/ray-project/release/builds/87834#_ ## Changes - Update notebook dependency pins: `deepspeed` 0.12.3 -> 0.17.2, `accelerate` 0.18.0 -> 0.33.0, `transformers` 4.26.0 -> 4.36.2 - Remove `reduce_bucket_size`, `stage3_prefetch_bucket_size`, `stage3_param_persistence_threshold` from DeepSpeed config (HF Trainer resolved these "auto" values to floats, which DeepSpeed 0.17.2's pydantic validation rejects) - Install `accelerate==0.33.0` and uninstall `peft` in BYOD script (base image's `peft==0.11.1` is incompatible with `transformers==4.36.2`, and the notebook's pip install cell is commented out by jupytext during conversion) - Remove `torch>=1.12.0` from runtime_env and stale TODO comment Signed-off-by: JasonLi1909 <jasonli@anyscale.com> Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Co-authored-by: Mark Towers <mark.m.towers@gmail.com>
…#63406) ## Description Adds Azure Blob Storage and Azure Files to the Ray Train persistent storage guide. The page already enumerates AWS S3 / GCS for cloud and AWS EFS / Google Filestore / HDFS for shared filesystems — Azure's equivalent offerings (commonly used by Ray Train deployments on Azure / AKS) are simply missing from the list. This PR makes the smallest change consistent with the surrounding prose: - Adds `Azure Blob Storage` and `Azure Files` to the one-line summary. - Updates the **Shared filesystem (NFS, HDFS)** section header to also list Azure Files, and threads it through the body sentence. - Adds an Azure Files example as a commented-out `storage_path` next to the existing HDFS example, with a one-line note that the share needs to be mounted on every node first. No new sections, no example code that has to be tested — just registry-level changes to surface an already-supported option. ## Related issues Closes ray-project#54054 This picks up the work from the now-stale ray-project#54055 (which received a maintainer LGTM before being auto-closed by the stale bot) and ray-project#55862, rebased against current master. The diff in this PR is functionally the same as the LGTM'd ray-project#54055, with the body wording polished for the current document text. ## Additional information - Single doc file touched: `doc/source/train/user-guides/persistent-storage.rst` - +7 / -5 lines - No code changes, no test changes required --------- Signed-off-by: lonexreb <reach2shubhankar@gmail.com> Signed-off-by: Shubhankar Tripathy <reach2shubhankar@gmail.com>
…y-project#63633) ray-project#52974 added pydoclint to pre-commit without fixing any issues through adding all the problematic docstrings to an ignore list. However this means that all the docstrings that do have issues / problems with them that aren't raised or fixed (which is helpful for agents understanding codebases). This PR removes all of the ray dashboard ignores then uses Claude to fix all the docstrings / type hints (and reviewed by me to confirm implementations). --------- Signed-off-by: Mark Towers <mark@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com>
ray-project#63882) ## Description For release tests, we wrap each in `_anyscale_job_wrapper.py` which after the job is finished, we poll prometheus to get several metrics for checking if OOMs or spilling has occurred. However, this is a flaky mechanism as prometheus will time out meaning that no metric is returned and thus the checks fail. The problem is that we have limited logs for understanding whats gone wrong. This PR aims to improve our logging for these jobs that are failing in `anyscale_job_wrapper.py` --------- Signed-off-by: Mark Towers <mark@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com>
## Why are these changes needed? Ray Train already works with any S3-compatible object store through pyarrow's `S3FileSystem` (via `endpoint_override` in the `storage_path` URI, or the standard `AWS_*` environment variables). This PR documents that path in the Train persistent-storage guide and adds the Backblaze B2 specifics. **Docs-only, no code changes.** (An earlier revision added an env-var aliasing helper; per review feedback it was removed in favor of documenting the setup users perform themselves.) Changes to `doc/source/train/user-guides/persistent-storage.rst`: - Retitles the section to "S3-compatible storage (Backblaze B2, MinIO, etc.)". - Shows the `endpoint_override` query-parameter form for Backblaze B2 and MinIO (local). - Notes that the standard AWS environment variables (`AWS_ENDPOINT_URL_S3`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`) work with a plain `s3://bucket/path`. - Documents that Backblaze B2 publishes credentials as `B2_APPLICATION_KEY_ID` / `B2_APPLICATION_KEY`; users set `AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` to those values, since pyarrow reads only the AWS-named variables. - Links a complete end-to-end Backblaze B2 notebook example. ## Related issue number Related to ray-project#63104 ## Checks - [x] Change is contained to `doc/source/train/user-guides/persistent-storage.rst`. - [x] No code paths changed; existing tests unaffected. Signed-off-by: Gonzalo Peña-Castellanos <goanpeca@gmail.com>
…nsferring checkpoints to storage (ray-project#62027) ## Description How long does a checkpoint take to upload to storage? How long does the workers spend in the `ray.train.report` sync barrier? Currently, ray train measures how long is spent in `ray.train.report` total rather than these sub-values. This is important for measuring the differences that async checkpointing, etc. This PR adds recording of the total time spent both uploading the checkpoint and the time spent synchronizing between workers Users can see these metrics on `metrics -> train` <img width="2529" height="1023" alt="image" src="https://github.com/user-attachments/assets/48740e31-555f-4ae4-9349-a27bd5902cf8" /> and `ray_workloads -> train -> selecting an actor-id` <img width="2521" height="685" alt="image" src="https://github.com/user-attachments/assets/da3a3680-774f-4d59-9259-71b904781179" /> This is the graph for the `test_async_checkpoint_validation` release test showing that for the five different `ray.train.report` options. <img width="2247" height="623" alt="image" src="https://github.com/user-attachments/assets/69d5c5e1-4c49-4800-8593-23bcdf56bd5f" /> --------- Signed-off-by: Mark Towers <mark@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com>
## Why this change is needed
A missing comma between `"pyarrow.Table"` and `"pandas.DataFrame"` in
the `DataBatchType` Union causes Python to concatenate adjacent string
literals into `"pyarrow.Tablepandas.DataFrame"`, which is not a valid
type.
This makes `DataBatchType` incomplete — it should include both
`pyarrow.Table` and `pandas.DataFrame` as separate Union members,
matching the equivalent `DataBatch` type in `ray.data.block`.
### Before
```python
DataBatchType = Union[
"numpy.ndarray", "pyarrow.Table" "pandas.DataFrame", Dict[str, "numpy.ndarray"]
]
```
↑ "pyarrow.Table" "pandas.DataFrame" concatenated into
"pyarrow.Tablepandas.DataFrame"
### After
```python
DataBatchType = Union[
"numpy.ndarray", "pyarrow.Table", "pandas.DataFrame", Dict[str, "numpy.ndarray"]
]
```
## Related Issues
`DataBatchType` is referenced extensively in
`ray.air.util.data_batch_conversion`, `ray.data.preprocessor`, and
`ray.data.util.data_batch_conversion`. The incorrect type string could
appear in user-facing error messages and type checking.
## Checks
- [x] I have signed the commits with Developer Certificate of Origin
(DCO)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Signed-off-by: awen11123 <awen11123@users.noreply.github.com>
Signed-off-by: awen <444014092@qq.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
8f4c9cd to
fdf18e3
Compare
…bility (ray-project#63593) ## Description - **Refine Qwen3-VL VLM video example and documentation.** <img width="1293" height="408" alt="Screenshot 2026-05-21 at 8 44 05 PM" src="https://github.com/user-attachments/assets/62bba88f-1b2b-44f6-950c-4386c0fa26fe" /> - **Remove C++ incompatibility guide (only applicable for Ray 2.55).** - **Add a Ray ↔ vLLM compatibility table.** <img width="1300" height="460" alt="Screenshot 2026-05-21 at 8 44 35 PM" src="https://github.com/user-attachments/assets/d50c8d81-a44e-49e8-be2b-f3d5d4762b95" /> - **Update office hour link.** - **Fix a doc rendering bug** **Before** <img width="564" height="194" alt="Screenshot 2026-05-21 at 8 57 20 PM" src="https://github.com/user-attachments/assets/292f0b9d-86d6-4552-976f-851efa4ddb8a" /> **After** <img width="533" height="198" alt="Screenshot 2026-05-21 at 9 00 37 PM" src="https://github.com/user-attachments/assets/74243bf6-cda8-45c7-b504-ee8514a06ec4" /> - **Fix `vLLMEngineProcessorConfig` inaccurate docstring** ## Related issues N/A ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: aaronscalene <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Related issues
Additional information