Skip to content

NOWEB engine: false STARTING/STOPPED status during restart/mutation windows + intermittently-empty timestamps.activity #2073

@omernesh

Description

@omernesh

Summary

We've observed two reproducible patterns on the WAHA NOWEB engine where /api/sessions/{name} GET returns a transient STARTING or STOPPED status for a session that is actively receiving and processing WhatsApp traffic. These false-status reads cause downstream consumers (health-checks, admin UIs, alerting) to react to a fault that doesn't exist.

Additionally, while investigating remediation we discovered that the timestamps.activity field — which would be the obvious diagnostic to distinguish false-status reads from real failures — is intermittently absent from session-info responses. It was populated during initial investigation and absent during a follow-up smoke check 6 hours later, on the same WAHA container instance, with no engine restart between the two observations.

Both issues appear to be NOWEB-specific (we have not validated WEBJS or GOWS).

Severity: MEDIUM. No data loss. Causes false recovery actions and operator confusion in downstream consumers.

Environment:

  • WAHA image: devlikeapro/waha-plus:noweb (sha bc9689e1c93c777cee7d37973e22106a5f813d546851a4f8a7cf15dd008e3e7a)
  • Engine: NOWEB
  • 4 long-lived sessions, all paired
  • WAHA exposed on 127.0.0.1:3004 to one downstream consumer
  • Container uptime at observation: 2+ days, stable
  • Webhook delivery flowing normally (WebhookSender POST → 200 OK on continuous events)

Pattern A — false STOPPED post-restart of webhook consumer

When the downstream webhook consumer restarts, the /api/sessions/{name} GET response transiently reports status: "STOPPED" for a session that is, by every other measurable signal, alive. The flap typically resolves within 0-15 seconds. The session never actually drops — WAHA continues sending webhook events for it throughout the window.

Reproducer (high level):

# Terminal 1 — high-frequency poller on /api/sessions/{name}
for i in $(seq 1 60); do
  ts=$(date -Iseconds)
  r=$(curl -s "http://127.0.0.1:3004/api/sessions/omer" \
      -H "X-Api-Key: $WAHA_API_KEY")
  echo "$ts $(echo $r | jq -r .status) activity=$(echo $r | jq -r .timestamps.activity)"
  sleep 1
done > /tmp/waha-poll.log &

# Terminal 2 — webhook-flow tail (proves session is alive)
docker logs waha-noweb --since "30s" --follow 2>&1 \
  | grep "session\":\"omer.*WebhookSender" > /tmp/waha-webhooks.log &

# Terminal 3 — restart the downstream webhook consumer 3x ~60s apart
for cycle in 1 2 3; do
  systemctl --user stop my-webhook-consumer
  systemctl --user start my-webhook-consumer
  sleep 60
done

Expected output in /tmp/waha-poll.log:
1-3 ticks during the +0..+15s window after each consumer-start show status: "STOPPED" while /tmp/waha-webhooks.log shows continuous WebhookSender ... status code: 200 for the same session.

Live capture from our production environment (2026-05-12):

consumer restarted at 18:58:42 IDT (PID 2923629)
18:58:50 webhook HMAC ok session=session_a
18:58:51 webhook HMAC ok session=session_a
18:58:52 webhook HMAC ok session=session_a
18:58:52 GET /api/sessions/session_a → status="STARTING"   ← false-status read
18:58:52 GET ...               (webhook for same session still firing)
18:58:53 webhook HMAC ok session=session_a
18:58:53 webhook HMAC ok session=session_a
18:58:53 webhook HMAC ok session=session_a

(In this particular capture the flap shape was STARTING rather than STOPPED — see Pattern B below. The mechanism is the same.)

Historical evidence: Our prior investigation captured 6 false-STOPPED events in a 7-minute window (14:25-14:32 IDT 2026-05-08) during which the affected session had zero session.status webhook events and continuous WebhookSender 200 OK deliveries.

Pattern B — false STARTING post-session-config-mutation

When the downstream consumer issues a PUT that causes WAHA to re-read session config (in our case via a consumer-side admin endpoint that touches our local SessionDb), the same /api/sessions/{name} GET briefly returns status: "STARTING". Self-heals within seconds.

Reproducer (high level):

# 1. Trigger a session-config mutation (the exact endpoint that triggers re-read
#    is implementation-dependent — any path that causes WAHA to re-init NOWEB
#    session config should reproduce).

# 2. High-frequency poll for 30s:
for i in $(seq 1 60); do
  r=$(curl -s "http://127.0.0.1:3004/api/sessions/omer" \
      -H "X-Api-Key: $WAHA_API_KEY")
  echo "$(date -Iseconds) status=$(echo $r | jq -r .status) \
       activity=$(echo $r | jq -r .timestamps.activity)"
  sleep 0.5
done

Expected: At least one tick shows status: "STARTING" while webhook traffic for the same session continues uninterrupted in docker logs waha-noweb.

Pattern C — intermittently-empty timestamps.activity

This is the more puzzling one. On 2026-05-12 morning, all 4 of our sessions returned timestamps.activity: <recent-millis-epoch> in /api/sessions/{name} GET responses. That signal allowed us to draft a defensive cross-check (Path A workaround below). On 2026-05-12 evening (~6 hours later, same WAHA container, no engine restart), all 4 sessions return timestamps: {} — the field is absent.

Live capture (2026-05-12 ~19:00 IDT, production):

$ for s in session1 session2 session3 session4; do
    curl -s "http://127.0.0.1:3004/api/sessions/$s" \
      -H "X-Api-Key: $WAHA_API_KEY" | jq "{status, timestamps}"
  done

{"status":"WORKING","timestamps":{}}
{"status":"WORKING","timestamps":{}}
{"status":"WORKING","timestamps":{}}
{"status":"WORKING","timestamps":{}}

5-tick stability poll (2s interval, 10s span), same response shape on every tick:

$ for i in 1 2 3 4 5; do
    curl -s "http://127.0.0.1:3004/api/sessions/session1" \
      -H "X-Api-Key: $WAHA_API_KEY" | jq -c "{status, timestamps}"
    sleep 2
  done

{"status":"WORKING","timestamps":{}}
{"status":"WORKING","timestamps":{}}
{"status":"WORKING","timestamps":{}}
{"status":"WORKING","timestamps":{}}
{"status":"WORKING","timestamps":{}}

WAHA is otherwise healthy in this state — WebhookSender is firing 200 OKs for engine.event, group.v2.participants, presence.update on every session, and the container has been running 2+ days without restart.

Question for WAHA maintainers: Under what conditions is timestamps.activity populated vs. absent on the NOWEB engine? Is the field intended to be authoritative for "session has recent WhatsApp activity"? If yes, why does it disappear from the response while the session continues processing events? If no, what is the intended diagnostic field consumers should use to distinguish a real disconnect from a transient status flap?

Expected behavior

  1. /api/sessions/{name} GET should NOT report status: "STOPPED" or "STARTING" for a session that is actively receiving webhook traffic (verifiable in docker logs waha-noweb at the time of the GET).
  2. If timestamps.activity is the intended liveness signal, it should be populated consistently — at least whenever the session has produced a webhook event in the last few seconds.
  3. Ideally, the worker-state-machine transitions should be invisible to the external HTTP surface — /api/sessions/{name} should return WORKING continuously for a session that is connected, regardless of internal worker reassignment / re-init activity.

Workaround (consumer-side, shipped 2026-05-12)

Because we can't gate downstream consumers on a field that intermittently disappears, we built a defensive 2-consecutive-read state machine in our health-check and session-state-poller layers. It treats STARTING and STOPPED reads as suspect (rather than actionable) when timestamps.activity IS present and fresh, and falls through to existing fail-loud behavior when the field is absent. The workaround is documented in our v3.28 phase 249 (private repo; commit refs withheld).

The workaround does not address the root cause; it only prevents our consumers from reacting to the false reads when the diagnostic field happens to be populated. With Pattern C (intermittent absence of timestamps.activity) being more pervasive than we initially modeled, the workaround currently provides no suppression on the live container — the false-status reads still fall through to fail-loud, just as they did pre-workaround.

Suggested fixes (in priority order)

  1. Stabilize the /api/sessions/{name} status field during worker-state-machine transitions. Cache the most recent WORKING status for ~5-10s if NOWEB internal state is in flux but the WhatsApp connection itself hasn't dropped (verifiable by lastSeen on the underlying socket).
  2. Make timestamps.activity consistently populated on the NOWEB engine. The field is documented in the OpenAPI schema; consumers reasonably assume it's always present once a session is past STARTING. If there's a maintenance cycle that empties the timestamps object (e.g., NOWEB store cleanup, internal worker reassignment), preserve the activity timestamp across that cycle.
  3. Document the intended liveness signal. If timestamps.activity is not the authoritative field, document which field downstream consumers should use to distinguish a real disconnect from a worker-state-machine flap. We considered assignedWorker and engine.engineassignedWorker was empty across 40/40 samples in our investigation; engine.engine is static and doesn't carry liveness.

Additional context

  • We've maintained a private patch system for unrelated firewall behavior since 2025-Q4. We deliberately chose not to patch the NOWEB engine for this issue, because the diagnostic-field-disappearing pattern (Pattern C) suggests the symptom space is larger than a single state-machine boundary and would require a broader fix that belongs upstream.
  • Happy to provide more capture data, longer reproducer transcripts, or test against a candidate fix if helpful.

Reproducer environment metadata

WAHA image:  devlikeapro/waha-plus:noweb
WAHA sha:    bc9689e1c93c777cee7d37973e22106a5f813d546851a4f8a7cf15dd008e3e7a
Engine:      NOWEB
Container:   running, 2+ days uptime at time of capture
Host OS:     Ubuntu (consumer host)
Sessions:    4 long-lived, paired
Webhook:     Continuous activity ≥ 1 event/sec on the affected session

Severity rationale

MEDIUM, not HIGH:

  • No data loss — webhooks still deliver, messages still send.
  • No session drop — the affected sessions are paired and connected throughout.
  • Downstream consumers can defend — though the Pattern C field-disappearance reduces the defense's effectiveness in practice.

Not LOW:

  • Causes false recovery actions (spurious POST /sessions/{name}/restart) in fail-loud consumers.
  • Produces user-visible status flicker in admin UIs (e.g., "session showing STOPPED" alerts that resolve on their own).
  • Creates noise in alerting pipelines (Health check DEGRADED warnings that fire and then resolve), training operators to ignore alerts.
  • Pattern C (intermittently-empty timestamps.activity) means even consumers willing to invest in defensive code have no reliable diagnostic to lean on.

patron:PLUS

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions