NOWEB engine: false STARTING/STOPPED status during restart/mutation windows + intermittently-empty timestamps.activity

### Summary

We've observed two reproducible patterns on the WAHA NOWEB engine where `/api/sessions/{name}` GET returns a transient `STARTING` or `STOPPED` status for a session that is actively receiving and processing WhatsApp traffic. These false-status reads cause downstream consumers (health-checks, admin UIs, alerting) to react to a fault that doesn't exist.

Additionally, while investigating remediation we discovered that the `timestamps.activity` field — which would be the obvious diagnostic to distinguish false-status reads from real failures — is intermittently absent from session-info responses. It was populated during initial investigation and absent during a follow-up smoke check 6 hours later, on the same WAHA container instance, with no engine restart between the two observations.

Both issues appear to be NOWEB-specific (we have not validated WEBJS or GOWS).

**Severity:** MEDIUM. No data loss. Causes false recovery actions and operator confusion in downstream consumers.

**Environment:**

- WAHA image: `devlikeapro/waha-plus:noweb` (sha `bc9689e1c93c777cee7d37973e22106a5f813d546851a4f8a7cf15dd008e3e7a`)
- Engine: NOWEB
- 4 long-lived sessions, all paired
- WAHA exposed on `127.0.0.1:3004` to one downstream consumer
- Container uptime at observation: 2+ days, stable
- Webhook delivery flowing normally (`WebhookSender` POST → 200 OK on continuous events)

### Pattern A — false `STOPPED` post-restart of webhook consumer

When the downstream webhook consumer restarts, the `/api/sessions/{name}` GET response transiently reports `status: "STOPPED"` for a session that is, by every other measurable signal, alive. The flap typically resolves within 0-15 seconds. The session never actually drops — WAHA continues sending webhook events for it throughout the window.

**Reproducer (high level):**

```bash
# Terminal 1 — high-frequency poller on /api/sessions/{name}
for i in $(seq 1 60); do
  ts=$(date -Iseconds)
  r=$(curl -s "http://127.0.0.1:3004/api/sessions/omer" \
      -H "X-Api-Key: $WAHA_API_KEY")
  echo "$ts $(echo $r | jq -r .status) activity=$(echo $r | jq -r .timestamps.activity)"
  sleep 1
done > /tmp/waha-poll.log &

# Terminal 2 — webhook-flow tail (proves session is alive)
docker logs waha-noweb --since "30s" --follow 2>&1 \
  | grep "session\":\"omer.*WebhookSender" > /tmp/waha-webhooks.log &

# Terminal 3 — restart the downstream webhook consumer 3x ~60s apart
for cycle in 1 2 3; do
  systemctl --user stop my-webhook-consumer
  systemctl --user start my-webhook-consumer
  sleep 60
done
```

**Expected output in `/tmp/waha-poll.log`:**
1-3 ticks during the +0..+15s window after each consumer-start show `status: "STOPPED"` while `/tmp/waha-webhooks.log` shows continuous `WebhookSender ... status code: 200` for the same session.

**Live capture from our production environment (2026-05-12):**

```
consumer restarted at 18:58:42 IDT (PID 2923629)
18:58:50 webhook HMAC ok session=session_a
18:58:51 webhook HMAC ok session=session_a
18:58:52 webhook HMAC ok session=session_a
18:58:52 GET /api/sessions/session_a → status="STARTING"   ← false-status read
18:58:52 GET ...               (webhook for same session still firing)
18:58:53 webhook HMAC ok session=session_a
18:58:53 webhook HMAC ok session=session_a
18:58:53 webhook HMAC ok session=session_a
```

(In this particular capture the flap shape was `STARTING` rather than `STOPPED` — see Pattern B below. The mechanism is the same.)

**Historical evidence:** Our prior investigation captured 6 false-`STOPPED` events in a 7-minute window (14:25-14:32 IDT 2026-05-08) during which the affected session had zero `session.status` webhook events and continuous `WebhookSender 200 OK` deliveries.

### Pattern B — false `STARTING` post-session-config-mutation

When the downstream consumer issues a PUT that causes WAHA to re-read session config (in our case via a consumer-side admin endpoint that touches our local SessionDb), the same `/api/sessions/{name}` GET briefly returns `status: "STARTING"`. Self-heals within seconds.

**Reproducer (high level):**

```bash
# 1. Trigger a session-config mutation (the exact endpoint that triggers re-read
#    is implementation-dependent — any path that causes WAHA to re-init NOWEB
#    session config should reproduce).

# 2. High-frequency poll for 30s:
for i in $(seq 1 60); do
  r=$(curl -s "http://127.0.0.1:3004/api/sessions/omer" \
      -H "X-Api-Key: $WAHA_API_KEY")
  echo "$(date -Iseconds) status=$(echo $r | jq -r .status) \
       activity=$(echo $r | jq -r .timestamps.activity)"
  sleep 0.5
done
```

**Expected:** At least one tick shows `status: "STARTING"` while webhook traffic for the same session continues uninterrupted in `docker logs waha-noweb`.

### Pattern C — intermittently-empty `timestamps.activity`

This is the more puzzling one. On 2026-05-12 morning, all 4 of our sessions returned `timestamps.activity: <recent-millis-epoch>` in `/api/sessions/{name}` GET responses. That signal allowed us to draft a defensive cross-check (Path A workaround below). On 2026-05-12 evening (~6 hours later, same WAHA container, no engine restart), all 4 sessions return `timestamps: {}` — the field is absent.

**Live capture (2026-05-12 ~19:00 IDT, production):**

```bash
$ for s in session1 session2 session3 session4; do
    curl -s "http://127.0.0.1:3004/api/sessions/$s" \
      -H "X-Api-Key: $WAHA_API_KEY" | jq "{status, timestamps}"
  done

{"status":"WORKING","timestamps":{}}
{"status":"WORKING","timestamps":{}}
{"status":"WORKING","timestamps":{}}
{"status":"WORKING","timestamps":{}}
```

5-tick stability poll (2s interval, 10s span), same response shape on every tick:

```bash
$ for i in 1 2 3 4 5; do
    curl -s "http://127.0.0.1:3004/api/sessions/session1" \
      -H "X-Api-Key: $WAHA_API_KEY" | jq -c "{status, timestamps}"
    sleep 2
  done

{"status":"WORKING","timestamps":{}}
{"status":"WORKING","timestamps":{}}
{"status":"WORKING","timestamps":{}}
{"status":"WORKING","timestamps":{}}
{"status":"WORKING","timestamps":{}}
```

WAHA is otherwise healthy in this state — `WebhookSender` is firing 200 OKs for `engine.event`, `group.v2.participants`, `presence.update` on every session, and the container has been running 2+ days without restart.

**Question for WAHA maintainers:** Under what conditions is `timestamps.activity` populated vs. absent on the NOWEB engine? Is the field intended to be authoritative for "session has recent WhatsApp activity"? If yes, why does it disappear from the response while the session continues processing events? If no, what is the intended diagnostic field consumers should use to distinguish a real disconnect from a transient status flap?

### Expected behavior

1. `/api/sessions/{name}` GET should NOT report `status: "STOPPED"` or `"STARTING"` for a session that is actively receiving webhook traffic (verifiable in `docker logs waha-noweb` at the time of the GET).
2. If `timestamps.activity` is the intended liveness signal, it should be populated consistently — at least whenever the session has produced a webhook event in the last few seconds.
3. Ideally, the worker-state-machine transitions should be invisible to the external HTTP surface — `/api/sessions/{name}` should return `WORKING` continuously for a session that is connected, regardless of internal worker reassignment / re-init activity.

### Workaround (consumer-side, shipped 2026-05-12)

Because we can't gate downstream consumers on a field that intermittently disappears, we built a **defensive 2-consecutive-read state machine** in our health-check and session-state-poller layers. It treats `STARTING` and `STOPPED` reads as suspect (rather than actionable) when `timestamps.activity` IS present and fresh, and falls through to existing fail-loud behavior when the field is absent. The workaround is documented in our v3.28 phase 249 (private repo; commit refs withheld).

The workaround **does not** address the root cause; it only prevents our consumers from reacting to the false reads when the diagnostic field happens to be populated. With Pattern C (intermittent absence of `timestamps.activity`) being more pervasive than we initially modeled, the workaround currently provides no suppression on the live container — the false-status reads still fall through to fail-loud, just as they did pre-workaround.

### Suggested fixes (in priority order)

1. **Stabilize the `/api/sessions/{name}` status field** during worker-state-machine transitions. Cache the most recent `WORKING` status for ~5-10s if NOWEB internal state is in flux but the WhatsApp connection itself hasn't dropped (verifiable by `lastSeen` on the underlying socket).
2. **Make `timestamps.activity` consistently populated** on the NOWEB engine. The field is documented in the OpenAPI schema; consumers reasonably assume it's always present once a session is past `STARTING`. If there's a maintenance cycle that empties the timestamps object (e.g., NOWEB store cleanup, internal worker reassignment), preserve the activity timestamp across that cycle.
3. **Document the intended liveness signal.** If `timestamps.activity` is not the authoritative field, document which field downstream consumers should use to distinguish a real disconnect from a worker-state-machine flap. We considered `assignedWorker` and `engine.engine` — `assignedWorker` was empty across 40/40 samples in our investigation; `engine.engine` is static and doesn't carry liveness.

### Additional context

- We've maintained a private patch system for unrelated firewall behavior since 2025-Q4. We deliberately chose **not** to patch the NOWEB engine for this issue, because the diagnostic-field-disappearing pattern (Pattern C) suggests the symptom space is larger than a single state-machine boundary and would require a broader fix that belongs upstream.
- Happy to provide more capture data, longer reproducer transcripts, or test against a candidate fix if helpful.

### Reproducer environment metadata

```
WAHA image:  devlikeapro/waha-plus:noweb
WAHA sha:    bc9689e1c93c777cee7d37973e22106a5f813d546851a4f8a7cf15dd008e3e7a
Engine:      NOWEB
Container:   running, 2+ days uptime at time of capture
Host OS:     Ubuntu (consumer host)
Sessions:    4 long-lived, paired
Webhook:     Continuous activity ≥ 1 event/sec on the affected session
```

### Severity rationale

**MEDIUM**, not HIGH:

- No data loss — webhooks still deliver, messages still send.
- No session drop — the affected sessions are paired and connected throughout.
- Downstream consumers can defend — though the Pattern C field-disappearance reduces the defense's effectiveness in practice.

**Not LOW**:

- Causes false recovery actions (spurious `POST /sessions/{name}/restart`) in fail-loud consumers.
- Produces user-visible status flicker in admin UIs (e.g., "session showing STOPPED" alerts that resolve on their own).
- Creates noise in alerting pipelines (`Health check DEGRADED` warnings that fire and then resolve), training operators to ignore alerts.
- Pattern C (intermittently-empty `timestamps.activity`) means even consumers willing to invest in defensive code have no reliable diagnostic to lean on.


[![patron:PLUS](https://img.shields.io/badge/patron-PLUS-a0e6ba)](https://waha.devlike.pro/docs/how-to/plus-version/#tiers)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NOWEB engine: false STARTING/STOPPED status during restart/mutation windows + intermittently-empty timestamps.activity #2073

Summary

Pattern A — false `STOPPED` post-restart of webhook consumer

Pattern B — false `STARTING` post-session-config-mutation

Pattern C — intermittently-empty `timestamps.activity`

Expected behavior

Workaround (consumer-side, shipped 2026-05-12)

Suggested fixes (in priority order)

Additional context

Reproducer environment metadata

Severity rationale

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

NOWEB engine: false STARTING/STOPPED status during restart/mutation windows + intermittently-empty timestamps.activity #2073

Description

Summary

Pattern A — false STOPPED post-restart of webhook consumer

Pattern B — false STARTING post-session-config-mutation

Pattern C — intermittently-empty timestamps.activity

Expected behavior

Workaround (consumer-side, shipped 2026-05-12)

Suggested fixes (in priority order)

Additional context

Reproducer environment metadata

Severity rationale

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Pattern A — false `STOPPED` post-restart of webhook consumer

Pattern B — false `STARTING` post-session-config-mutation

Pattern C — intermittently-empty `timestamps.activity`