Skip to content

[core] (FreeObjects 2/n) Adding dead owner callback through raylet GCS listener#1

Draft
aaronscalene wants to merge 163 commits into
aaron/free-objects-objm-fixfrom
aaron/passive-owner-callback
Draft

[core] (FreeObjects 2/n) Adding dead owner callback through raylet GCS listener#1
aaronscalene wants to merge 163 commits into
aaron/free-objects-objm-fixfrom
aaron/passive-owner-callback

Conversation

@aaronscalene

Copy link
Copy Markdown
Owner

Description

Briefly describe what this PR accomplishes and why it's needed.

Related issues

Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@aaronscalene aaronscalene changed the title Aaron/passive owner callback [2/n] May 6, 2026
@aaronscalene aaronscalene force-pushed the aaron/passive-owner-callback branch from 36d98a3 to b1f06f1 Compare May 6, 2026 19:27
@aaronscalene aaronscalene force-pushed the aaron/free-objects-objm-fix branch from 9a872b8 to 0408656 Compare May 8, 2026 03:06
@aaronscalene aaronscalene force-pushed the aaron/passive-owner-callback branch 2 times, most recently from 483b867 to 9482fe8 Compare May 8, 2026 23:21
@aaronscalene aaronscalene force-pushed the aaron/free-objects-objm-fix branch from 0408656 to 4f1ef12 Compare May 15, 2026 21:17
@aaronscalene aaronscalene force-pushed the aaron/passive-owner-callback branch from 3c5129a to 01dd4d9 Compare May 15, 2026 21:20
@aaronscalene aaronscalene changed the title [2/n] [core] (FreeObjects) Adding GCS listener object freeing May 20, 2026
@aaronscalene aaronscalene changed the title [core] (FreeObjects) Adding GCS listener object freeing [core] (FreeObjects) Adding object freeing for GCS listener May 20, 2026
@aaronscalene aaronscalene changed the title [core] (FreeObjects) Adding object freeing for GCS listener [core] (FreeObjects) Adding dead owner callback through raylet GCS listener May 20, 2026
@aaronscalene aaronscalene force-pushed the aaron/passive-owner-callback branch from 01dd4d9 to a8c096f Compare May 20, 2026 22:22
@aaronscalene aaronscalene changed the title [core] (FreeObjects) Adding dead owner callback through raylet GCS listener [core] (FreeObjects 2/n) Adding dead owner callback through raylet GCS listener May 21, 2026
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
@aaronscalene aaronscalene force-pushed the aaron/passive-owner-callback branch from a8c096f to cc91473 Compare May 22, 2026 21:24

@aaronscalene aaronscalene left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed

…e/ray into aaron/passive-owner-callback

Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
…e/ray into aaron/passive-owner-callback

Signed-off-by: aaron.li <aaron.li@anyscale.com>
…allback

Signed-off-by: aaron.li <aaron.li@anyscale.com>
Phucvt123 and others added 26 commits June 5, 2026 11:33
…in (ray-project#63797)

## Why are these changes needed?

`ray job submit` uses `subprocess.list2cmdline` to join the entrypoint
arguments into a command string. This function wraps arguments
containing spaces in **double quotes** (`"`), which causes POSIX shells
(`/bin/sh`) on the job server to expand `$VAR` references — silently
eating environment variables that the user intended to preserve as
literal strings.

For example:
```bash
ray job submit -- echo 'python -m launcher --config $CONFIG_PATH'
```

**Before (bug):** CLI sends `echo "python -m launcher --config
$CONFIG_PATH"` → server shell expands `$CONFIG_PATH` to empty → output:
`python -m launcher --config`

**After (fix):** CLI sends `echo 'python -m launcher --config
$CONFIG_PATH'` → server shell preserves literal → output: `python -m
launcher --config $CONFIG_PATH`

The fix replaces `subprocess.list2cmdline` (designed for Windows
`cmd.exe`) with `shlex.join` (designed for POSIX shells), which wraps
arguments in **single quotes** (`'`) to prevent variable expansion —
matching standard POSIX shell conventions.

## Related issue number

Fixes ray-project#56232

## Checks

- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've made sure the tests are passing.
- [x] Testing Strategy

---------

Signed-off-by: Vũ Trần Phúc <Vuphuccc@gmail.com>
…atus (ray-project#62934)

## Description

`ClusterStatus.stats` currently uses `field(default_factory=Stats)`, but
`Stats` requires the positional argument `gcs_request_time_s`.

As a result, `ClusterStatus()` cannot be default-constructed and raises:

```text
TypeError: Stats.__init__() missing 1 required positional argument: 'gcs_request_time_s'
```

This PR fixes the invalid default by making ClusterStatus construct a
valid default Stats object, and adds a regression test to cover the
empty/default construction path.

This is a small schema-level fix and does not change the normal
populated paths where callers already pass an explicit Stats(...)
instance.


## Related issues
ray-project#62933 

## Additional information

Implementation notes:
update ClusterStatus.stats to use a valid default Stats value add a
regression test for ClusterStatus()

---------

Signed-off-by: weimingdiit <weimingdiit@gmail.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
## Description

We noticed that some Redis/Valkey providers don't reject/reset plain TCP
connections if the server requires TLS. They just hang the connections
and do not respond. Here was one tcpdump we captured showing such server
behavior:

```
(base) ray@ip-10-0-241-152:~/default$ sudo tcpdump -i any tcp and host clustercfg.internal-testing-with-tls-enabled.fjpqowie.memorydb.us-west-2.amazonaws.com 
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
22:59:20.106033 ens5  Out IP ip-10-0-241-152.us-west-2.compute.internal.48120 > ip-10-0-33-223.us-west-2.compute.internal.redis: Flags [S], seq 379242323, win 62727, options [mss 8961,sackOK,TS val 3940517090 ecr 0,nop,wscale 7], length 0
22:59:20.106509 ens5  In  IP ip-10-0-33-223.us-west-2.compute.internal.redis > ip-10-0-241-152.us-west-2.compute.internal.48120: Flags [S.], seq 916631862, ack 379242324, win 26847, options [mss 8961,sackOK,TS val 2553398864 ecr 3940517090,nop,wscale 7], length 0
22:59:20.106526 ens5  Out IP ip-10-0-241-152.us-west-2.compute.internal.48120 > ip-10-0-33-223.us-west-2.compute.internal.redis: Flags [.], ack 1, win 491, options [nop,nop,TS val 3940517090 ecr 2553398864], length 0
22:59:20.106545 ens5  Out IP ip-10-0-241-152.us-west-2.compute.internal.48128 > ip-10-0-33-223.us-west-2.compute.internal.redis: Flags [S], seq 1429463323, win 62727, options [mss 8961,sackOK,TS val 3940517090 ecr 0,nop,wscale 7], length 0
22:59:20.106582 ens5  Out IP ip-10-0-241-152.us-west-2.compute.internal.48120 > ip-10-0-33-223.us-west-2.compute.internal.redis: Flags [P.], seq 1:29, ack 1, win 491, options [nop,nop,TS val 3940517090 ecr 2553398864], length 28: RESP "INFO" "SENTINEL"
22:59:20.106942 ens5  In  IP ip-10-0-33-223.us-west-2.compute.internal.redis > ip-10-0-241-152.us-west-2.compute.internal.48128: Flags [S.], seq 911009498, ack 1429463324, win 26847, options [mss 8961,sackOK,TS val 2553398864 ecr 3940517090,nop,wscale 7], length 0
22:59:20.106950 ens5  Out IP ip-10-0-241-152.us-west-2.compute.internal.48128 > ip-10-0-33-223.us-west-2.compute.internal.redis: Flags [.], ack 1, win 491, options [nop,nop,TS val 3940517091 ecr 2553398864], length 0
22:59:20.106953 ens5  In  IP ip-10-0-33-223.us-west-2.compute.internal.redis > ip-10-0-241-152.us-west-2.compute.internal.48120: Flags [.], ack 29, win 210, options [nop,nop,TS val 2553398864 ecr 3940517090], length 0
23:04:21.939073 ens5  In  IP ip-10-0-33-223.us-west-2.compute.internal.redis > ip-10-0-241-152.us-west-2.compute.internal.48120: Flags [.], ack 29, win 210, options [nop,nop,TS val 2553700699 ecr 3940517090], length 0
23:04:21.939073 ens5  In  IP ip-10-0-33-223.us-west-2.compute.internal.redis > ip-10-0-241-152.us-west-2.compute.internal.48128: Flags [.], ack 1, win 210, options [nop,nop,TS val 2553700699 ecr 3940517091], length 0
23:04:21.939081 ens5  Out IP ip-10-0-241-152.us-west-2.compute.internal.48120 > ip-10-0-33-223.us-west-2.compute.internal.redis: Flags [.], ack 1, win 491, options [nop,nop,TS val 3940818923 ecr 2553398864], length 0
23:04:21.939084 ens5  Out IP ip-10-0-241-152.us-west-2.compute.internal.48128 > ip-10-0-33-223.us-west-2.compute.internal.redis: Flags [.], ack 1, win 491, options [nop,nop,TS val 3940818923 ecr 2553398864], length 0
```

In such cases, we shouldn't just wait for the server to respond. This PR
adds a `redis_db_probe_timeout_milliseconds` timeout to the first
command, which is the `INFO SENTINEL` and `AUTH` if auth is needed. The
timeout is 30 seconds by default.

If the timeout is reached, we will see the following from the `ray
start` stderr output:
```
$ RAY_redis_db_probe_timeout_milliseconds=3000 python -m ray.scripts.scripts start --head --address=127.0.0.1:6380 --redis-password="" --block
[2026-05-05 15:23:57,514 E 93876 7759139] (ray_init) redis_context.cc:684: Timed out waiting for redis command reply (sync).
[2026-05-05 15:23:57,514 E 93876 7759139] (ray_init) redis_context.cc:684: Timed out waiting for redis command reply (sync).
[2026-05-05 15:23:57,524 C 93876 7759139] (ray_init) redis_context.cc:475:  An unexpected system state has occurred. You have likely discovered a bug in Ray. Please report this issue at https://github.com/ray-project/ray/issues and we'll work with you to fix it. Check failed: reply Failed to get Redis info within 30000ms.
```

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
## Why are these changes needed?

An audit of the LLM usage telemetry found the per-model statistics are
structurally wrong, independent of any particular deployment. The
detached telemetry actor appends one entry to its model list **per
replica engine-start** (and again on every restart / autoscale /
redeploy) and is never reset, so each comma-joined tag:

- grows **without bound** over a long-lived cluster's lifetime (risking
silent truncation at the usage-tag value size limit), and
- duplicates each model by its replica/restart count, inflating every
per-model list and count.

Two values were also wrong by construction: `num_replicas` was hardcoded
to `1` for non-autoscaling deployments (and `num_replicas="auto"` was
dropped entirely, since the stored config keeps the literal string), and
the JSON_MODE tags were built from a hardcoded `use_json_mode=True` for
a dimension that has no deployment-time config, making them duplicates
of the MODELS / NUM_REPLICAS tags. These are observable failure modes,
not hypothetical.

### Serve (`observability/usage_telemetry/usage.py`)
- **Dedup by `model_id`** (last-write-wins dict) instead of an
append-only list. Fixes the double-counting across replicas/restarts and
the unbounded memory growth on the head-node actor; tags now carry one
entry per distinct model. `model_id` is identity only and is never
recorded as a value.
- **Report the configured replica count**: fixed `num_replicas` instead
of hardcoded `1`, and resolve `num_replicas="auto"` to its autoscaling
config (reported as autoscaling) rather than dropping it.
- **Drop the JSON_MODE tags** (proto 602/603 `reserved`); they carried
no signal.
- **Telemetry can no longer break engine start**: `_multiple_apps` never
raises, and per-model reporting is wrapped so a failure is logged, not
propagated.
- **Atomic agent creation** via `get_if_exists`; removed the shadowed
retry args.

### Batch (`batch/observability/usage_telemetry/usage.py` + processors)
- Same dedup fix (key by telemetry identity) plus a `_reset` hook.
- Guarded, atomic agent creation and a non-raising push path so
telemetry cannot break processor construction.
- HTTP processor now passes `batch_size`; vLLM processor now sources
`data_parallel_size` (both previously reported 0).
- Fixed the `LLM_BATCH_CONCURRENCY` proto comment.

### Tests
Updated existing expectations and added regression tests:
replica/restart dedup, fixed `num_replicas`, and `num_replicas="auto"`
reported as autoscaling.

> [!NOTE]
> This reserves two released usage tags (602/603) and stops recording
them. Per the proto header, data-collection changes need sign-off from
@pcmoritz / @thomasdesr.

## Checks
- [x] Added regression tests.
- [x] Signed off with DCO.

---------

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
## Description
1. relax oss test_collective target to 300s timeout

Signed-off-by: Lehui Liu <lehui@anyscale.com>
…adcast instead (ray-project#63723)

We continue to optimistically deduct resources when scheduling placement
groups in the GCS, but no longer deduct those resources on placement
group scheduling failure. Instead, we rely on the regular resource view
broadcast to reconcile the resource view between the GCS and Raylet(s).

Closes: ray-project#62858

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…roject#63880)

## Why

A 6-month GA4 audit of `docs.ray.io` found **2,017 out-of-support
legacy-version paths receiving ~552K views with no redirect rule and no
equivalent on current docs**. Under any legacy-cohort cutover, that
traffic lands on 404s. This PR adds the redirect coverage for the bulk
of that gap.

## What

21 `page` catch-all rules appended to `doc/redirects/current.yaml`,
covering **~96% of the gap by views and by paths**. Each cluster lands
on its nearest surviving section index:

- **Autodoc / API refs** (RLlib `package_ref`, Serve, Data, Tune, Train,
Ray Core, Observability) → that library's API reference index.
- **Examples** (Ray AIR, Tune, Train, Ray Core) → the examples gallery.
- **Restructured sections** (Kubernetes, Serve tutorials) → the section
index.
- **Deprecated, no current equivalent**: Ray AIR → AIR getting-started;
Ray Workflows → Ray overview.
- **Sphinx `_modules/` source views** → the Ray source on GitHub (a
view-source request for a symbol that no longer exists is better served
by GitHub search than a 404).
- Three targeted `rllib-*` overrides precede the RLlib catch-all so
high-traffic renamed pages land on their real equivalents.

## Design

All rules are `type: page` with `force` omitted (defaults to `false`),
so each **fires only when the path would 404**. Pages that still exist
with the same name on any version resolve untouched; only
moved/deleted/renamed paths hit the catch-all. This composes with the
planned legacy-cohort cutover (`/en/releases-X.Y.Z/*` →
`/en/latest/:splat`): the cutover preserves same-name paths, and these
rules catch what no longer exists on current docs.

## Validation

```
rtd-redirects validate doc/redirects/current.yaml
→ 0 error, 49 warning
```

Zero ordering/shadow errors. The 49 chain warnings are benign: each is
an existing rule whose `to` is a live page that happens to sit under one
of the new section wildcards, which never fires on a live page given
`force: false`.

## Notes

- Follows the source-of-truth workflow established in ray-project#63367.
- Application to Read the Docs is a manual maintainer step post-merge
per `doc/redirects/README.md`.

Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ay-project#63833)

## Why are these changes needed?

A few small, independent Serve improvements bundled together:

- **gRPC error-path tracing** (`replica.py`): the gRPC success path
already records span attributes and exceptions via `_wrap_request`;
route the direct-ingress unary error path through the same
`_handle_errors_and_metrics` handling (by surfacing the status code and
re-raising) so failed gRPC requests are traced with their status code
and exception instead of silently dropping out of the trace.
- **Timeout bump** for `test_serve_metrics_for_successful_connection` to
reduce flakiness while waiting for metrics.

No production behavior changes beyond the added error-path tracing
instrumentation.

## Related issue number

N/A

## Checks

- [x] I've signed off every commit (by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing.
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
## Description
The ubsan tests were failing in release at compile time due to throwing
error on warning of shadowed variable:
```
[2026-06-04T19:43:48Z] In file included from src/ray/asio/io_context_monitor.cc:15:
--
  | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h: In constructor 'ray::IOContextMonitor::ProbeState::ProbeState(std::string, instrumented_io_context&, std::shared_ptr<ray::ClockInterface>)':
  | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:64:48: error: declaration of 'clock' shadows a member of 'ray::IOContextMonitor::ProbeState' [-Werror=shadow]
  | [2026-06-04T19:43:48Z]    64 \|                std::shared_ptr<ClockInterface> clock)
  | [2026-06-04T19:43:48Z]       \|                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~
  | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:69:43: note: shadowed declaration is here
  | [2026-06-04T19:43:48Z]    69 \|     const std::shared_ptr<ClockInterface> clock;
  | [2026-06-04T19:43:48Z]       \|                                           ^~~~~
  | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:63:41: error: declaration of 'io_context' shadows a member of 'ray::IOContextMonitor::ProbeState' [-Werror=shadow]
  | [2026-06-04T19:43:48Z]    63 \|                instrumented_io_context &io_context,
  | [2026-06-04T19:43:48Z]       \|                ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~
  | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:68:30: note: shadowed declaration is here
  | [2026-06-04T19:43:48Z]    68 \|     instrumented_io_context &io_context;
  | [2026-06-04T19:43:48Z]       \|                              ^~~~~~~~~~
  | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:62:28: error: declaration of 'name' shadows a member of 'ray::IOContextMonitor::ProbeState' [-Werror=shadow]
  | [2026-06-04T19:43:48Z]    62 \|     ProbeState(std::string name,
  | [2026-06-04T19:43:48Z]       \|                ~~~~~~~~~~~~^~~~
  | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:67:23: note: shadowed declaration is here
  | [2026-06-04T19:43:48Z]    67 \|     const std::string name;
  | [2026-06-04T19:43:48Z]       \|                       ^~~~
  | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h: In constructor 'ray::IOContextMonitor::ProbeState::ProbeState(std::string, instrumented_io_context&, std::shared_ptr<ray::ClockInterface>)':
  | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:64:48: error: declaration of 'clock' shadows a member of 'ray::IOContextMonitor::ProbeState' [-Werror=shadow]
  | [2026-06-04T19:43:48Z]    64 \|                std::shared_ptr<ClockInterface> clock)
  | [2026-06-04T19:43:48Z]       \|                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~
  | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:69:43: note: shadowed declaration is here
  | [2026-06-04T19:43:48Z]    69 \|     const std::shared_ptr<ClockInterface> clock;
  | [2026-06-04T19:43:48Z]       \|                                           ^~~~~
  | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:63:41: error: declaration of 'io_context' shadows a member of 'ray::IOContextMonitor::ProbeState' [-Werror=shadow]
  | [2026-06-04T19:43:48Z]    63 \|                instrumented_io_context &io_context,
  | [2026-06-04T19:43:48Z]       \|                ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~
  | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:68:30: note: shadowed declaration is here
  | [2026-06-04T19:43:48Z]    68 \|     instrumented_io_context &io_context;
  | [2026-06-04T19:43:48Z]       \|                              ^~~~~~~~~~
  | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:62:28: error: declaration of 'name' shadows a member of 'ray::IOContextMonitor::ProbeState' [-Werror=shadow]
  | [2026-06-04T19:43:48Z]    62 \|     ProbeState(std::string name,
  | [2026-06-04T19:43:48Z]       \|                ~~~~~~~~~~~~^~~~
  | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:67:23: note: shadowed declaration is here
  | [2026-06-04T19:43:48Z]    67 \|     const std::string name;
  | [2026-06-04T19:43:48Z]       \|                       ^~~~
  | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h: In constructor 'ray::IOContextMonitor::ProbeState::ProbeState(std::string, instrumented_io_context&, std::shared_ptr<ray::ClockInterface>)':
  | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:64:48: error: declaration of 'clock' shadows a member of 'ray::IOContextMonitor::ProbeState' [-Werror=shadow]
  | [2026-06-04T19:43:48Z]    64 \|                std::shared_ptr<ClockInterface> clock)
  | [2026-06-04T19:43:48Z]       \|                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~
  | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:69:43: note: shadowed declaration is here
  | [2026-06-04T19:43:48Z]    69 \|     const std::shared_ptr<ClockInterface> clock;
  | [2026-06-04T19:43:48Z]       \|                                           ^~~~~
  | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:63:41: error: declaration of 'io_context' shadows a member of 'ray::IOContextMonitor::ProbeState' [-Werror=shadow]
  | [2026-06-04T19:43:48Z]    63 \|                instrumented_io_context &io_context,
  | [2026-06-04T19:43:48Z]       \|                ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~
  | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:68:30: note: shadowed declaration is here
  | [2026-06-04T19:43:48Z]    68 \|     instrumented_io_context &io_context;
  | [2026-06-04T19:43:48Z]       \|                              ^~~~~~~~~~
  | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:62:28: error: declaration of 'name' shadows a member of 'ray::IOContextMonitor::ProbeState' [-Werror=shadow]
  | [2026-06-04T19:43:48Z]    62 \|     ProbeState(std::string name,
  | [2026-06-04T19:43:48Z]       \|                ~~~~~~~~~~~~^~~~
  | [2026-06-04T19:43:48Z] bazel-out/k8-opt/bin/src/ray/asio/_virtual_includes/io_context_monitor/ray/asio/io_context_monitor.h:67:23: note: shadowed declaration is here
  | [2026-06-04T19:43:48Z]    67 \|     const std::string name;
  | [2026-06-04T19:43:48Z]       \|                       ^~~~
  | [2026-06-04T19:43:48Z] cc1plus: all warnings being treated as errors
```

This PR addresses this issue by renaming the constructor arguments and
adds a missing dependency for an include in
`memory_monitor_test_fixture`.


## Related issues

## Additional information

Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: davik <davik@anyscale.com>
… KubeRay (ray-project#63465)

## Description
Terminate a cluster managed by the V2 autoscaler when no user driver is
attached. Related to ray-project/kuberay#4815

When `autoscalerOptions.noDriverTimeoutSeconds` is set, the V2
autoscaler evaluates a no-driver predicate every reconcile loop and,
when it fires, patches a single annotation on the RayCluster CR:

    ```yaml
    metadata:
      annotations:
        ray.io/no-driver-ttl-expired: "true"

The KubeRay operator observes the condition and decides the terminal
action. (delete RayCluster)

A cluster is eligible for termination only when both of the following
hold, and only when they have held continuously for at least
noDriverTimeoutSeconds:

  1. No active user driver is attached.
  2. Condition 1 has held for at least noDriverTimeoutSeconds.

Note that **detached actors do not count as a driver, a cluster running
only detached actors is still eligible for termination.**

  Changes

This PR adds `autoscalerOptions.noDriverTimeoutSeconds`. The decision
lives on `KubeRayProvider`: it tracks how long the cluster has had no
driver attached and, once the timeout is exceeded, dispatches an
annotation for KubeRay to
terminate the cluster, freeing the head pod and any reserved capacity
that would otherwise linger.

1. New autoscalerOptions.noDriverTimeoutSeconds field, V2 + KubeRay only

- Existing CRs and existing V1 / non-KubeRay deployments see no behavior
change.
- The field is read only by KubeRayProvider; unset disables the feature.

  2. No-driver decision lives on KubeRayProvider

- Evaluated against `gcs_client.get_all_job_info(...)`, filtering out
Ray dashboard jobs. Fails closed.
- The provider records when the cluster was first seen with no driver,
and dispatches once that has held for `noDriverTimeoutSeconds`. The
timer resets if a driver reappears.

  3. Dispatch: single annotation on the RayCluster CR

The reconciler calls `evaluate_no_driver_termination`, which patches the
RayCluster CR with `ray.io/no-driver-ttl-expired: "true"`. The KubeRay
Operator implementation is covered in
ray-project/kuberay#4815.

## Related issues
Closes ray-project#63452

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: win5923 <ken89@kimo.com>
…export (ray-project#63744)

## Description

This PR normalizes OpenTelemetry metric attribute sets before handing
observations to the Prometheus exporter.

Some Ray components can emit the same metric with heterogeneous
attribute sets, for example when one data point includes `SessionName`
and another data point for the same metric does not. With older
`opentelemetry-exporter-prometheus` versions used by Ray's default
compiled dependencies, metrics can reach Prometheus export with mixed
label key sets. This can produce misaligned Prometheus label values,
such as `dataset="core_worker"` or `operator="<node ip>"`, making Ray
Data dashboards misleading.

This change makes Ray enforce a stable label schema at the export
boundary:

- For observable gauge, counter, and sum callbacks, collect the union of
attribute keys for each metric and fill missing values with `""`.
- For reconstructed histogram batches in the dashboard reporter,
normalize all batch data points to the union of tag keys before
recording them.
- Add regression coverage for mixed attribute sets in observable metric
callbacks and histogram export.

This does not depend on upgrading `opentelemetry-exporter-prometheus`.
It is also compatible with newer exporter versions that perform similar
normalization internally; in that case Ray provides already-normalized
observations and the exporter-side normalization is effectively
idempotent.

## Related issues

Fixes ray-project#63499.

## Additional information

This PR intentionally keeps the fix in the Python export path. The issue
is caused by heterogeneous label key sets, not by nondeterministic tag
ordering, so this avoids changing the metric record path or upgrading
OpenTelemetry dependencies.

Tests:

```bash
python -m py_compile python/ray/_private/telemetry/open_telemetry_metric_recorder.py python/ray/dashboard/modules/reporter/reporter_agent.py python/ray/dashboard/modules/reporter/tests/test_reporter.py python/ray/tests/test_open_telemetry_metric_recorder.py
git diff --check

Signed-off-by: OneSizeFitsQuorum <txypotato@gmail.com>
…oject#63402)

## Description

In Ray Client mode, `ray._private.worker._global_node` is `None` because
the client driver is not a Ray worker process, even though
`ray.is_initialized()` is `True` and the cluster is connected.
`get_or_create_stats_actor` used
`_global_node` as a proxy for "connected to Ray" and raised
`RuntimeError` whenever Ray Data tried to register or query the stats
actor, causing `ds.take_batch()`, `ds.iter_batches()`, etc. to crash on
materialized datasets.

Use `ray.is_initialized()` for the connection check and only emit the
`cluster_id` debug log when `_global_node` is available, since
`cluster_id` is not exposed via `ray.get_runtime_context()`.

## Related issues

Closes ray-project#61162

## Additional information

---------

Signed-off-by: Yuang Gao <yg2315@nyu.edu>
What does this PR do?

Adds a module docstring to cpp/example/_BUILD.bazel to fix the
buildifier module-docstring warning.

Related issue

Part of ray-project#50875.

Signed-off-by: HUY <vinhuytran0810@gmail.com>
## Description

- Five startup methods in services.py create output handlers using
`open(os.devnull, "w")` but never close, causing `ResourceWarning:
unclosed file` warnings. Replaced with `subprocess.DEVNULL`.
- Kill process method in Node do not call `wait()` after `kill()` when
the argument `wait` is False, causing `ResourceWarning: subprocess is
still running` warnings. Change to call `wait()` after `kill()` to
prevent zombie processes.

## Related issues

Fixes ray-project#9546, Fixes ray-project#59782

---------

Signed-off-by: Accurio <2671768169@qq.com>
## Why are these changes needed?

This PR hardens `runtime_env` zip package extraction by resolving
candidate
member paths before checking whether they remain inside the extraction
target.

The existing `unzip_package()` implementation builds a candidate path
from the
target directory and zip member name, then checks containment before
resolving
path components such as `..`. This makes the zip extraction path
inconsistent
with the safer resolved-path containment logic already used by
`untar_package()`.

This change updates `unzip_package()` to:

- resolve the extraction target path
- resolve each candidate zip member extraction path
- skip zip entries whose resolved path is outside the target directory
- use the resolved path for directory creation, file writes, and chmod

This keeps zip extraction behavior aligned with the intended path
containment
invariant and with the existing tar extraction implementation.

## Related issue number

N/A

## Checks

- [x] I've run relevant checks for this change.

## Testing

```shell
python3 -m py_compile python/ray/_private/runtime_env/packaging.py python/ray/tests/test_runtime_env_packaging.py
git diff --check
```

I also ran the three new regression cases in a focused standalone pytest
harness
against the patched `unzip_package()` implementation: 3 passed.

---------

Signed-off-by: wonyunkang <rojo.wk@gmail.com>
Signed-off-by: H4ck2 <H4ck2@users.noreply.github.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: H4ck2 <H4ck2@users.noreply.github.com>
…nt (ray-project#63632)

ray-project#52974 added pydoclint to
pre-commit without fixing any issues through adding all the problematic
docstrings to an ignore list.
However this means that all the docstrings that do have issues /
problems with them that aren't raised or fixed (which is helpful for
agents understanding codebases).

This PR removes all of the ray serve and llm ignores then uses Claude to
fix all the docstrings / type hints (and reviewed by me to confirm
implementations).

---------

Signed-off-by: Mark Towers <mark@anyscale.com>
Co-authored-by: Mark Towers <mark@anyscale.com>
…oject#63541)

ray-project#52974 added pydoclint to
pre-commit without fixing any issues through adding all the problematic
docstrings to an ignore list.
However this means that all the docstrings that do have issues /
problems with them that aren't raised or fixed (which is helpful for
agents understanding codebases).

This PR removes all of the `ray._private` ignores then uses Claude to
fix all the docstrings / type hints (and reviewed by me to confirm
implementations).

---------

Signed-off-by: Mark Towers <mark@anyscale.com>
Co-authored-by: Mark Towers <mark@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
The Windows CI driver binaries (`//ci/ray_ci:test_in_docker`,
`//ci/ray_ci:build_in_docker_windows`) are non-hermetic `py_binary`s
that run on
the Windows agent's Python 3.8 conda base, but their third-party deps
were bundled
by bazel from `py_deps_py310` (`release/requirements_py310.txt`).

ray-project#63722 (Starlette security upgrade) bumped `typing-extensions` 4.11.0 ->
4.15.0 in
that lock. typing-extensions 4.14.0 dropped Python 3.8, so the driver
started dying
at import on every Windows job before any test ran:

```
typing_extensions.py:526 class _SpecialGenericAlias(typing._SpecialGenericAlias, ...)
AttributeError: module 'typing' has no attribute '_SpecialGenericAlias'
```

## What

Resolve and install the driver's deps for **Python 3.8 / win32** on the
agent,
mirroring the macOS CI driver (`ci/ray_ci/macos/macos_ci.sh`) -- no
bazel bundling.

- **`release/requirements_windows.in` +
`ci/raydepsets/configs/ci_windows.depsets.yaml`**
  resolve the driver's closure for py3.8/windows into
  `python/deplocks/ci/ci_windows_depset.lock`. Same package set as
  `requirements_py310.in`, minus the docs toolchain (pydantic forces
`typing-extensions>=4.14.1`) and twine (`requires-python >=3.9`, only
used by
  `//ci/ray_ci/automation`).
- **`ci/ray_ci/windows/install_tools.sh`** `pip install`s that lock into
the
agent's Python 3.8 env (strip hashes, `--no-deps`), so the correct py3.8
versions
(incl. a current `pyOpenSSL` that shadows the agent's stale conda copy)
are present
  at runtime.
- **`bazel/ci_require.bzl` + `//ci/ray_ci/deps`** route the driver's
third-party
deps via per-package aliases: bundled from `@py_deps_py310` off Windows,
and
**unbundled on Windows** (resolve to an empty library) so bundled py3.10
wheels
don't shadow the agent's py3.8 versions. `ci_require`/`bk_require` is
applied in
  `//ci/ray_ci` and `//release`.


Postmerge build:
https://buildkite.com/ray-project/postmerge/builds/17909/canvas

---------

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ject#62486)

## Summary

The `air_example_gptj_deepspeed_fine_tuning` release test fails due to
dependency incompatibilities introduced when the base ray-ml image was
upgraded to torch 2.7.0+cu128 (triton 3.3.0) in
ray-project#61328.

This PR fixes the dependency issues so the test passes.

Failing build: https://buildkite.com/ray-project/release/builds/87834#_

## Changes

- Update notebook dependency pins: `deepspeed` 0.12.3 -> 0.17.2,
`accelerate` 0.18.0 -> 0.33.0, `transformers` 4.26.0 -> 4.36.2
- Remove `reduce_bucket_size`, `stage3_prefetch_bucket_size`,
`stage3_param_persistence_threshold` from DeepSpeed config (HF Trainer
resolved these "auto" values to floats, which DeepSpeed 0.17.2's
pydantic validation rejects)
- Install `accelerate==0.33.0` and uninstall `peft` in BYOD script (base
image's `peft==0.11.1` is incompatible with `transformers==4.36.2`, and
the notebook's pip install cell is commented out by jupytext during
conversion)
- Remove `torch>=1.12.0` from runtime_env and stale TODO comment

Signed-off-by: JasonLi1909 <jasonli@anyscale.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Co-authored-by: Mark Towers <mark.m.towers@gmail.com>
…#63406)

## Description

Adds Azure Blob Storage and Azure Files to the Ray Train persistent
storage guide. The page already enumerates AWS S3 / GCS for cloud and
AWS EFS / Google Filestore / HDFS for shared filesystems — Azure's
equivalent offerings (commonly used by Ray Train deployments on Azure /
AKS) are simply missing from the list.

This PR makes the smallest change consistent with the surrounding prose:

- Adds `Azure Blob Storage` and `Azure Files` to the one-line summary.
- Updates the **Shared filesystem (NFS, HDFS)** section header to also
list Azure Files, and threads it through the body sentence.
- Adds an Azure Files example as a commented-out `storage_path` next to
the existing HDFS example, with a one-line note that the share needs to
be mounted on every node first.

No new sections, no example code that has to be tested — just
registry-level changes to surface an already-supported option.

## Related issues

Closes ray-project#54054

This picks up the work from the now-stale ray-project#54055 (which received a
maintainer LGTM before being auto-closed by the stale bot) and ray-project#55862,
rebased against current master. The diff in this PR is functionally the
same as the LGTM'd ray-project#54055, with the body wording polished for the
current document text.

## Additional information

- Single doc file touched:
`doc/source/train/user-guides/persistent-storage.rst`
- +7 / -5 lines
- No code changes, no test changes required

---------

Signed-off-by: lonexreb <reach2shubhankar@gmail.com>
Signed-off-by: Shubhankar Tripathy <reach2shubhankar@gmail.com>
…y-project#63633)

ray-project#52974 added pydoclint to
pre-commit without fixing any issues through adding all the problematic
docstrings to an ignore list.
However this means that all the docstrings that do have issues /
problems with them that aren't raised or fixed (which is helpful for
agents understanding codebases).

This PR removes all of the ray dashboard ignores then uses Claude to fix
all the docstrings / type hints (and reviewed by me to confirm
implementations).

---------

Signed-off-by: Mark Towers <mark@anyscale.com>
Co-authored-by: Mark Towers <mark@anyscale.com>
ray-project#63882)

## Description
For release tests, we wrap each in `_anyscale_job_wrapper.py` which
after the job is finished, we poll prometheus to get several metrics for
checking if OOMs or spilling has occurred.
However, this is a flaky mechanism as prometheus will time out meaning
that no metric is returned and thus the checks fail. The problem is that
we have limited logs for understanding whats gone wrong.

This PR aims to improve our logging for these jobs that are failing in
`anyscale_job_wrapper.py`

---------

Signed-off-by: Mark Towers <mark@anyscale.com>
Co-authored-by: Mark Towers <mark@anyscale.com>
## Why are these changes needed?

Ray Train already works with any S3-compatible object store through
pyarrow's `S3FileSystem` (via `endpoint_override` in the `storage_path`
URI, or the standard `AWS_*` environment variables). This PR documents
that path in the Train persistent-storage guide and adds the Backblaze
B2 specifics.

**Docs-only, no code changes.** (An earlier revision added an env-var
aliasing helper; per review feedback it was removed in favor of
documenting the setup users perform themselves.)

Changes to `doc/source/train/user-guides/persistent-storage.rst`:

- Retitles the section to "S3-compatible storage (Backblaze B2, MinIO,
etc.)".
- Shows the `endpoint_override` query-parameter form for Backblaze B2
and MinIO (local).
- Notes that the standard AWS environment variables
(`AWS_ENDPOINT_URL_S3`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`)
work with a plain `s3://bucket/path`.
- Documents that Backblaze B2 publishes credentials as
`B2_APPLICATION_KEY_ID` / `B2_APPLICATION_KEY`; users set
`AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` to those values, since
pyarrow reads only the AWS-named variables.
- Links a complete end-to-end Backblaze B2 notebook example.

## Related issue number

Related to ray-project#63104

## Checks

- [x] Change is contained to
`doc/source/train/user-guides/persistent-storage.rst`.
- [x] No code paths changed; existing tests unaffected.

Signed-off-by: Gonzalo Peña-Castellanos <goanpeca@gmail.com>
…nsferring checkpoints to storage (ray-project#62027)

## Description
How long does a checkpoint take to upload to storage? How long does the
workers spend in the `ray.train.report` sync barrier?
Currently, ray train measures how long is spent in `ray.train.report`
total rather than these sub-values. This is important for measuring the
differences that async checkpointing, etc.

This PR adds recording of the total time spent both uploading the
checkpoint and the time spent synchronizing between workers
Users can see these metrics on `metrics -> train`
<img width="2529" height="1023" alt="image"
src="https://github.com/user-attachments/assets/48740e31-555f-4ae4-9349-a27bd5902cf8"
/>
and `ray_workloads -> train -> selecting an actor-id`
<img width="2521" height="685" alt="image"
src="https://github.com/user-attachments/assets/da3a3680-774f-4d59-9259-71b904781179"
/>

This is the graph for the `test_async_checkpoint_validation` release
test showing that for the five different `ray.train.report` options.
<img width="2247" height="623" alt="image"
src="https://github.com/user-attachments/assets/69d5c5e1-4c49-4800-8593-23bcdf56bd5f"
/>

---------

Signed-off-by: Mark Towers <mark@anyscale.com>
Co-authored-by: Mark Towers <mark@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
## Why this change is needed

A missing comma between `"pyarrow.Table"` and `"pandas.DataFrame"` in
the `DataBatchType` Union causes Python to concatenate adjacent string
literals into `"pyarrow.Tablepandas.DataFrame"`, which is not a valid
type.

This makes `DataBatchType` incomplete — it should include both
`pyarrow.Table` and `pandas.DataFrame` as separate Union members,
matching the equivalent `DataBatch` type in `ray.data.block`.

### Before
```python
DataBatchType = Union[
    "numpy.ndarray", "pyarrow.Table" "pandas.DataFrame", Dict[str, "numpy.ndarray"]
]
```
↑ "pyarrow.Table" "pandas.DataFrame" concatenated into
"pyarrow.Tablepandas.DataFrame"

### After
```python
DataBatchType = Union[
    "numpy.ndarray", "pyarrow.Table", "pandas.DataFrame", Dict[str, "numpy.ndarray"]
]
```

## Related Issues

`DataBatchType` is referenced extensively in
`ray.air.util.data_batch_conversion`, `ray.data.preprocessor`, and
`ray.data.util.data_batch_conversion`. The incorrect type string could
appear in user-facing error messages and type checking.

## Checks

- [x] I have signed the commits with Developer Certificate of Origin
(DCO)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Signed-off-by: awen11123 <awen11123@users.noreply.github.com>
Signed-off-by: awen <444014092@qq.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
@aaronscalene aaronscalene force-pushed the aaron/passive-owner-callback branch from 8f4c9cd to fdf18e3 Compare June 8, 2026 19:13
jeffreywang-anyscale and others added 3 commits June 8, 2026 12:19
…bility (ray-project#63593)

## Description
- **Refine Qwen3-VL VLM video example and documentation.**
<img width="1293" height="408" alt="Screenshot 2026-05-21 at 8 44 05 PM"
src="https://github.com/user-attachments/assets/62bba88f-1b2b-44f6-950c-4386c0fa26fe"
/>

- **Remove C++ incompatibility guide (only applicable for Ray 2.55).**
- **Add a Ray ↔ vLLM compatibility table.**
<img width="1300" height="460" alt="Screenshot 2026-05-21 at 8 44 35 PM"
src="https://github.com/user-attachments/assets/d50c8d81-a44e-49e8-be2b-f3d5d4762b95"
/>

- **Update office hour link.**



- **Fix a doc rendering bug**

**Before**
<img width="564" height="194" alt="Screenshot 2026-05-21 at 8 57 20 PM"
src="https://github.com/user-attachments/assets/292f0b9d-86d6-4552-976f-851efa4ddb8a"
/>

**After**
<img width="533" height="198" alt="Screenshot 2026-05-21 at 9 00 37 PM"
src="https://github.com/user-attachments/assets/74243bf6-cda8-45c7-b504-ee8514a06ec4"
/>

- **Fix `vLLMEngineProcessorConfig` inaccurate docstring**

## Related issues
N/A

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: aaronscalene <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.