Skip to content

feat(whatsapp): comprehensive media recovery pipeline + anti-ban hardening#11

Open
CyPack wants to merge 25 commits intoownpilot:mainfrom
CyPack:fix/whatsapp-440-reconnect-loop
Open

feat(whatsapp): comprehensive media recovery pipeline + anti-ban hardening#11
CyPack wants to merge 25 commits intoownpilot:mainfrom
CyPack:fix/whatsapp-440-reconnect-loop

Conversation

@CyPack
Copy link

@CyPack CyPack commented Mar 6, 2026

Summary

Comprehensive WhatsApp reliability improvements across multiple sessions:

Core Fixes (S28 — Latest)

  • type='append' silent drop fix: Previously ALL offline/reconnect messages were silently discarded. WhatsApp delivers missed messages via messages.upsert type='append' after reconnect — these now save to DB with offlineSync: true metadata, no AI response triggered.
  • Connection gap tracking: lastDisconnectedAt instance var logs reconnect gap ("Reconnected after Xs gap")
  • Helpers extracted: addToProcessedMsgIds() (FIFO cap eviction), parseMessageTimestamp() (null-safe)

Integration Test Evidence (S29)

docker logs: UPSERT EVENT received — type: append, count: 1
docker logs: Offline sync saved 1/1 messages to DB
DB: metadata.offlineSync = true, id = channel.whatsapp:3EB0AF2E7FCAE4B2CD19A3

Unit Tests (S29 — 17 scenarios)

New file: packages/gateway/src/channels/plugins/whatsapp/whatsapp-api.test.ts

  • B-series (7): core offline handler — DB save, processedMsgIds seeding, no AI response, metadata-only media, empty message skip
  • C-series (6): edge cases — empty batch, no pushName, group without participant, fromMe filter, self-chat pass-through, stub message skip
  • D-series (2): FIFO cap eviction at 5000 entries
  • E-series (2): reconnect dedup — append then same notify deduped

Media Recovery Pipeline (S21-S26)

  • retryMediaFromMetadata() — 15/15 SOR files recovered
  • Short-circuit optimization: 40-87s → 1.06s per file (60x faster, skip history sync when mediaKey stored)
  • enrichMediaMetadata() — fixes ON CONFLICT DO NOTHING silent drop of mediaKey updates
  • POST /channels/:id/recover-media — throttled bulk recovery endpoint
  • POST /channels/:id/batch-retry-media — sequential batch with safety caps

Anti-Ban & Connection Hardening (S4, S22)

  • cleanupSocket() + isReconnecting guard prevents zombie connections and concurrent reconnects
  • 440 (connectionReplaced) special handling: 10s base delay prevents immediate reconnect loop
  • 30s timeout on updateMediaMessage (Baileys has no built-in timeout)
  • LID → display name resolution via channel_users lookup

Test Results

11959 pass / 2 fail (pre-existing rate-limit.test.ts at lines 1357, unrelated to these changes)
TypeScript: 0 errors

Test Plan

  • Integration test: docker stop → send messages → docker start → verify type='append' events
  • DB verification: SELECT ... WHERE metadata->>'offlineSync' = 'true'
  • No AI response triggered for offline messages (handleIncomingMessage not called)
  • Unit tests: 17/17 pass
  • Full test suite: 11959/11961 pass (2 pre-existing failures)

🤖 Generated with Claude Code

CyPack and others added 6 commits March 4, 2026 22:37
- Add downloadMediaWithRetry() method with 410/404 retry logic
- Update real-time handler to download image/video/document/sticker
- Update history sync handler to download all attachment types
- Use sock.updateMediaMessage for re-uploading expired media URLs
- Add ChannelMessageAttachmentInput type and serializeAttachments() helper
  that correctly converts Uint8Array/Buffer to base64 before JSON.stringify
- Update createBatch() and create() to use serializeAttachments internally
- Fix service-impl.ts to pass mimeType/filename/data through to DB
- Fix whatsapp-api.ts to use ChannelMessageAttachmentInput[] type
- Add GET /channels/messages/:messageId/media/:index endpoint

Previously Uint8Array serialized as {"0":255,"1":216,...} and data was lost.
Now stored as base64 string in JSONB attachments column.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Extract message-parser.ts for SOR/document content fallback (filename instead of [Attachment])
- Enrich metadata.document with filename, mimeType, size, hasMediaKey, hasUrl, hasDirectPath
- Add retry-media endpoint for cold-cache recovery of missing attachments
- Add SOR export endpoint for batch downloading attachment data
- 129/129 tests pass

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ages

Both history sync and real-time handlers now use the media descriptor's
filename as content fallback before falling back to generic [Attachment].
This ensures SOR files show their actual filename (e.g. 2728JA_45_V1.SOR)
in the content field.

Also backfilled 72 existing rows and enriched metadata.document for 80 rows.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Store actual base64-encoded mediaKey, CDN directPath, and URL in
ParsedWhatsAppMessageMetadata instead of just boolean flags. This enables
media re-upload requests for messages with expired CDN URLs.

Also adds PROTO-DIAG logging for document messages during history sync
to verify mediaKey presence from the WhatsApp protocol layer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add retryMediaFromMetadata() to WhatsAppChannelAPI that reconstructs a
WAMessage from DB-stored mediaKey/directPath/url and explicitly calls
sock.updateMediaMessage() to request the sender's phone to re-upload
expired media to CDN.

This works around a Baileys 7.0.0-rc.9 bug where downloadMediaMessage
checks error.status for 410/404, but Boom errors store it in
output.statusCode — so automatic reuploadRequest never triggers.

The retry-media endpoint now falls back to stored metadata when the
in-memory cache miss + history sync repair path both fail.

Verified: 5/5 SOR files from Dec 2025 successfully recovered (20-101KB).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ersinkoc
Copy link
Collaborator

ersinkoc commented Mar 6, 2026

Thanks for the contribution! After reviewing against our current main branch, here's what we found:

Already merged in main

The bug fixes listed in this PR are already resolved in main:

  • 440 reconnect fix — merged in 57785d9 (anti-ban, cleanupSocket(), consecutive 440 tracking, backoff)
  • getMessage callback — already implemented with message cache
  • History sync — passive (messaging-history.set) + on-demand (fetchMessageHistory) both implemented
  • Group/chat listing — merged in dae789f

These were part of v0.1.5 and the commits leading to v0.1.6.

New feature work (not yet in main)

The following are genuinely new additions that we don't have:

Feature Description
message-parser.ts Document metadata extraction (mediaKey, directPath, URL)
downloadMediaWithRetry() Binary media download with retry + Baileys RC9 updateMediaMessage workaround
retryMediaFromMetadata() Reconstruct WAMessage from stored metadata for expired CDN URLs
serializeAttachments() Base64 binary attachment serialization for DB storage
POST /retry-media Endpoint to trigger media re-download from stored metadata
GET /media/:index Serve stored binary media
historyAnchorByJid Anchor-based pagination for incremental history sync

Recommendation

This PR's branch is based on an older version of main (before the 440/anti-ban fixes were merged). The diff will have significant conflicts with the current code.

Could you please:

  1. Rebase onto current main (which already has the 440, anti-ban, and history sync fixes)
  2. Submit a clean feature PR with just the media download/storage/retry additions
  3. Update the PR title to reflect the actual new work (e.g., feat(whatsapp): media binary storage, download retry, and CDN recovery)

This will make the review much cleaner. The media recovery feature (especially the Baileys RC9 workaround) looks useful — we'd like to review it as a focused changeset.

CyPack and others added 2 commits March 6, 2026 11:24
…apper

- Short-circuit: skip history sync when stored mediaKey exists in DB,
  go directly to retryMediaFromMetadata(). Reduces per-file time from
  40-87s to ~1s (60x improvement).
- Batch endpoint: POST /channels/:id/batch-retry-media with configurable
  throttle (default 5s), max 50 messages per batch, sequential processing.
- Timeout: 30s Promise.race on updateMediaMessage to prevent infinite hang
  when sender phone is offline (Baileys has no built-in timeout).

Results: 31/35 media files recovered (4 NOT_FOUND — sender deleted).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
During WhatsApp history sync, messages are re-delivered with updated
fields (mediaKey, directPath, url) that weren't present in the
original delivery. However, createBatch() uses INSERT ... ON CONFLICT
DO NOTHING, which silently drops these updates when the row already
exists. This caused hundreds of messages to remain without downloadable
media metadata despite the data being available.

Fix:
- Add enrichMediaMetadata() to merge mediaKey/directPath/url into
  existing rows after createBatch(), so re-delivered metadata is
  never lost
- Call enrichment pass automatically in the history sync handler
- Add getAttachmentsNeedingRecovery() with flexible filters
  (groupJid, date range, needsKey, needsData) and SQL LIMIT

New endpoint:
- POST /channels/:id/recover-media — targeted media recovery pipeline
  that queries DB, triggers sync, waits for enrichment, and batch
  downloads with throttle. Safety defaults: limit=20 (max 50),
  throttleMs=5000 (min 2000), syncWaitMs=8000 (max 30000), dryRun
  mode. Includes date validation and platformMessageId null guard.

Docs:
- WHATSAPP-GUIDE.md: add media recovery pipeline architecture,
  endpoint reference, safety guidelines, and known limitation
  documenting history sync chunk boundary edge case (~2% of date
  ranges may arrive without media metadata)
- Remove personal data from guide examples (use placeholders)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@CyPack CyPack changed the title feat(whatsapp): media recovery via re-upload + history sync hardening feat(whatsapp): comprehensive media recovery pipeline + anti-ban hardening Mar 6, 2026
CyPack and others added 17 commits March 6, 2026 13:13
Replace hardcoded WhatsApp group JID in test files with a generic
placeholder to avoid leaking real identifiers in the public repo.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, shared type

- Extract WhatsAppDocumentMetadata interface from inline casts (4 files)
- Add enrichMediaMetadataBatch() — CTE+VALUES single SQL replaces N+1 loop
- Add media recovery concurrency guard (channelId-level lock with 5min TTL)
  Guards both recover-media and batch-retry-media endpoints (409 Conflict)
- Fix parseJsonBody: remove broken Content-Type check (Hono c.req.json()
  ignores Content-Type; the 415 response was silently discarded)
- Fix ui-auth login/password routes: replace ?? {} anti-pattern with
  proper null check so parse errors propagate correctly

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
History sync messages stored WhatsApp Linked IDs (LIDs) as sender_name
instead of human-readable names because pushName is often empty in
history sync payloads. This made 6400+ messages show numeric IDs.

- Add resolveDisplayName() helper with 10-min TTL cache
- Lookup channel_users.display_name when pushName is missing or numeric
- Apply to both history sync and real-time message handlers
- DB batch fix already applied: UPDATE 6408 rows via LID→name join

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ntly dropping

Root cause: messages.upsert handler had `if (upsert.type !== 'notify') return;`
which silently dropped ALL offline/reconnect messages. WhatsApp delivers missed
messages via type='append' after reconnect with full payload.

Changes:
- Handle type='append' via new handleOfflineMessages() method
- Batch-collect + single createBatch() (ON CONFLICT DO NOTHING)
- Metadata-only for media (no download — ban risk from burst CDN hits)
- Serialized via historySyncQueue (prevents race with messaging-history.set)
- SAFETY: Never emits MESSAGE_RECEIVED (no AI auto-reply for offline msgs)
- Extract addToProcessedMsgIds() helper (DRY: notify, history, offline)
- Extract parseMessageTimestamp() helper (number/BigInt/Long)
- Add lastDisconnectedAt tracking with reconnection gap logging

Research: 10 specialist agents analyzed Baileys source, Evolution API patterns,
dedup safety, event bus chain, media ban risk, and DB schema.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…/C/D/E series)

Covers the offline sync path (type='append') introduced in 735bbfc:
- B-series (7): core behavior — DB save, processedMsgIds seeding, no AI response,
  metadata-only media, empty message skip
- C-series (6): edge cases — empty batch, no pushName, group without participant,
  fromMe filtering, self-chat pass-through, stub message skip
- D-series (2): FIFO cap eviction at 5000 entries
- E-series (2): reconnect dedup — append then notify for same message deduped

Fix: vi.fn(function() {...}) required for constructor mock in vitest 4.x
(vi.fn().mockImplementation(() => ({})) arrow function causes silent constructor failure)

Integration test evidence (S29):
  docker logs: 'UPSERT EVENT received — type: append, count: 1'
  docker logs: 'Offline sync saved 1/1 messages to DB'
  DB: metadata.offlineSync = true, no duplicate rows

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- CREATE TABLE sor_queue with status/message_id/channel_id/filename cols
- UNIQUE(message_id) constraint for idempotent enqueue
- enqueue_sor_message() PG trigger fires AFTER INSERT on channel_messages
- Filters: direction=inbound, content ILIKE %.sor, attachments not empty,
  attachments[0].data present, channel jid=120363423491841999@g.us
- COALESCE(attachments, '[]'::jsonb) safe null handling
- ON CONFLICT DO NOTHING for replay safety
- Indexes: idx_sor_queue_status, idx_sor_queue_created_at

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… config

The FTTH group JID is stored in channel_messages.metadata->>'jid', not in
channels.config->>'jid'. Updated enqueue_sor_message() trigger to use:
COALESCE(NEW.metadata, '{}')::jsonb->>'jid' = '120363423491841999@g.us'

Verified with docker exec psql — trigger correctly enqueues matching rows.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…load audit log

P0 — Bug fixes (data loss prevention):
- Trigger: AFTER INSERT OR UPDATE OF attachments — history sync SORs no longer lost
- Trigger fn: guard for UPDATE when binary already existed (OLD.data IS NOT NULL)
- Stale processing cleanup: rows stuck >10min reset to pending on each cron run
- retry_count column + SELECT includes error rows with retry_count < 3
- All mark_error paths: retry_count = retry_count + 1

P1 — Observability:
- content_hash TEXT column on sor_queue (SHA-256, populated after b64decode)
- Dedup: same binary with status=done skips re-upload, logged as 'skipped'
- sor_upload_log table: full audit history per upload attempt
  (outcome, content_hash, content_size, opdracht_id, error_message)
- _log_attempt() helper called on success, failure, and duplicate skip
- Indexes: idx_sor_queue_content_hash (partial), idx_sor_upload_log_*

MIGRATIONS_SQL: all changes idempotent (IF NOT EXISTS guards)
Note: pre-existing CLI typecheck errors unrelated to this change

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ng, PG LISTEN/NOTIFY

processing_started_at:
- New TIMESTAMP column on sor_queue — set when status→processing
- Stale cleanup uses COALESCE(processing_started_at, created_at) for accuracy

JID → DB config (no more hardcoded trigger):
- Trigger reads COALESCE(NULLIF(current_setting('app.sor_jid', TRUE), ''), fallback)
- ALTER DATABASE ownpilot SET "app.sor_jid" = '...' persists across connections
- MIGRATIONS_SQL sets default value; override without redeployment

PG LISTEN/NOTIFY:
- Trigger: PERFORM pg_notify('sor_new_file', NEW.id) on every sor_queue INSERT
- Voorinfra MCP: daemon thread (sor-notify-listener) starts at server startup
- Thread: psycopg2 LISTEN + select.select(5s timeout) + asyncio.run(process_queue)
- Latency: 60s (cron) → <100ms (event-driven) for new SOR arrivals
- Cron remains as fallback for reliability

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…point

- ChannelMessageAttachment + ChannelMessageAttachmentInput: add local_path?: string
- serializeAttachments(): pass through local_path from input to stored JSONB
- WhatsAppAPI: add writeSorToDisk() — writes .sor files to /app/data/sor-files/{messageId}.sor
- WhatsAppAPI: inject disk write at both download call sites (history sync + live message)
- GET /api/v1/sor-files/:messageId — auth-protected file download endpoint (streams from disk)
- base64 JSONB fallback preserved (no breakage for old messages)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…story/offline sync

ON CONFLICT DO NOTHING skips existing rows — local_path was never persisted
for re-delivered SOR messages. After createBatch, iterate rows with local_path
set and call updateAttachments() so disk path is reachable from DB.

Applied to both history sync and offline sync paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
historyAnchorByJid in-memory cache tracks NEWEST message per chat
(needed for dedup/upsert). fetchMessageHistory requires OLDEST known
message as anchor to page backward in time — using newest yields
empty ON_DEMAND batches (WhatsApp treats history as already synced).

Fix: fetchGroupHistory now always loads from DB via
loadHistoryAnchorFromDatabase (getOldestByChat ASC) instead of
checking in-memory cache first.

Result: /groups/:jid/sync now returns 50 messages instead of 0,
correctly paging backward from the oldest known DB message.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add contacts.upsert and contacts.update event handlers to whatsapp-api.ts
- Add contacts sync to messaging-history.set handler (passive sync)
- Add syncContactsToDb() private method with name resolution hierarchy
  (name > notify > verifiedName > phone fallback)
- Add upsertByExternal() to ContactsRepository (ON CONFLICT upsert)
- Add unique constraint (external_id, external_source) to contacts table
- Skip group JIDs and status broadcasts in contact sync
- Retroactive migration: 41 contacts populated from existing messages

Resolves: 190 individual chats showing "(bilinmiyor)" for contact names.
Evolution API + WAHA reference patterns applied.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Devil's Advocate finding: contact.id can be LID format (e.g. 179203@lid)
which produces garbage phone numbers. Now prefers contact.phoneNumber
(real phone JID) and skips contacts with only LID identifiers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ution, history sync mapping

1. Early return fix: PUSH_NAME type (0 messages, 822 contacts) was skipped entirely
   because `messages.length === 0` returned early before contacts sync block.
   Now: skip only when BOTH messages AND contacts are empty.

2. LID phoneNumber resolution (WAHA pattern): contacts with @lid JIDs may have a
   `phoneNumber` field containing the real @s.whatsapp.net JID. Use phoneNumber
   when available, skip only pure LID contacts with no real phone.

3. History sync contacts: removed redundant filter/map that dropped contacts
   without name/notify before passing to syncContactsToDb. The method already
   handles fallback naming (phone number as name).

Before: contacts.upsert 585 OK, history sync 0/582, PUSH_NAME skipped entirely
After:  contacts.upsert 585 OK, history sync 476/582, PUSH_NAME 620/822 — total 970

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. messages.upsert handler: extract pushName from incoming messages
   and sync to contacts DB. Uses softName mode to avoid overwriting
   phonebook names with pushName.

2. upsertByExternal softName option: when true, only updates name if
   existing name is just a phone number (regex ^[0-9+]+$ or name=phone).
   Preserves higher-quality phonebook names from contacts.upsert.

3. syncContactsToDb: accepts options.softName parameter, forwarded
   to upsertByExternal.

Before: individual chat names showed phone numbers only
After: pushName from messages enriches contacts with display names

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause: container rebuild lost /app/data/sor-files/ (no named volume).
18 phantom files: local_path in DB, disk gone, data=NULL → PG trigger never fired.

Recovery system documented:
- retryMediaFromMetadata(): explicit sock.updateMediaMessage() (Baileys RC9 bug workaround)
- POST /api/v1/channels/:id/recover-media: batch endpoint
- Auth bypass: Authorization: Bearer bypass (AUTH_TYPE=none + ui_password_hash conflict)
- mediaKey MUST be Uint8Array not Buffer
- sor_queue ON CONFLICT needs manual reset after recovery

Results 2026-03-08: 181/210 SOR binary recovered (86.2%). 29 permanently lost.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants