Skip to content

fix(fbmessenger): don't collapse multiple attachments to one row#406

Open
njt wants to merge 2 commits into
kenn-io:mainfrom
njt:fix/fbmessenger-attachment-dedup
Open

fix(fbmessenger): don't collapse multiple attachments to one row#406
njt wants to merge 2 commits into
kenn-io:mainfrom
njt:fix/fbmessenger-attachment-dedup

Conversation

@njt

@njt njt commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Facebook Messenger messages that carry multiple attachments without downloaded bytes (files missing from the DYI export, or no attachments dir configured) were recording only one attachment row — the rest were silently dropped.

UpsertAttachment dedupes rows with an empty content_hash to one per message. These link/missing attachments now get a stable synthetic content_hash derived from the export-relative URI, so siblings coexist and re-imports stay idempotent. storage_path stays empty, so no bytes are implied and file-cleanup paths are unaffected.

🤖 Generated with Claude Code

…e row

UpsertAttachment dedupes attachments with an empty content_hash to a single
row per message (SELECT-then-insert on message_id). The fbmessenger importer
stored an empty content_hash whenever an attachment file was missing or no
attachments dir was configured, so a Messenger message with several photos
whose files were absent recorded only ONE attachment row — the rest were
silently dropped.

Give hashless attachments a stable synthetic content_hash (sha256 of the
export-relative URI, not file bytes) so siblings coexist and re-imports stay
idempotent. storage_path stays empty, so no stored content is implied and the
file-cleanup paths (which filter on non-empty storage_path) are unaffected.

Audit of the other UpsertAttachment callers found no further instances: gmail
and the generic ingest path always carry a real MIME content hash, synctechsms
always hashes the part bytes, whatsapp stores at most one attachment per
message, and teams already uses a synthetic link hash.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_012Ri7QdvXXUMQPke9wrLjSS
@roborev-ci

roborev-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown

roborev: Combined Review (7cab66b)

Summary verdict: One medium issue remains; no high or critical findings were reported.

Medium

  • internal/fbmessenger/importer.go:701: Missing, rejected, or no-directory attachments now store a 64-character hex value in content_hash even though it is not a real byte hash and storage_path is empty. Existing consumers treat valid-looking SHA-256 values as exportable content, so show-message --json, export-attachment, export-attachments, and MCP attachment reads may advertise or attempt to fetch files that were never stored.
    • Fix: Keep content_hash reserved for real content hashes and use a distinguishable synthetic key for idempotency, such as a prefixed non-SHA value, or add a separate store-level identity for hashless attachments.

Panel: ci_default_security | Synthesis: codex, 11s | Members: codex_default (codex/default, done, 2m24s), codex_security (codex/security, done, 56s) | Total: 3m31s

@njt

njt commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Split out of #398 (Teams ingestion) to keep that PR scoped. Found via an audit of UpsertAttachment's hashless-attachment dedup path. Independent of #398 — branches off current main.

@wesm

wesm commented Jun 25, 2026

Copy link
Copy Markdown
Member

looking at this

Messenger imports still need a stable per-attachment identity when files are missing, rejected, or skipped because no attachments directory is configured. Without that identity, the store's empty-hash fallback collapses multiple hashless attachments on the same message into one row.

The previous fallback identity looked exactly like a SHA-256 content hash even though no bytes were stored. Prefix the synthetic key so JSON output and export flows cannot confuse it with content-addressed attachment data, while preserving idempotent re-import behavior.

Validation: focused Messenger attachment tests were run red/green; the new assertions failed against bare 64-hex fallback keys before the importer change and passed after prefixing them.

Generated with Codex (GPT-5)
Co-authored-by: Codex <codex@openai.com>
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (0da95ca)

Medium issue found; security review found no additional issues.

Medium

  • internal/fbmessenger/importer.go:701: Re-importing a Facebook export previously imported by older code can leave the legacy empty-content_hash attachment row in place while inserting the new synthetic-key row beside it. Because the new non-empty key no longer hits UpsertAttachment’s empty-hash dedupe path, upgrade idempotency breaks and attachment rows/counts can be inflated for missing, rejected, or no-attachment-dir imports.

    Suggested fix: When writing synthetic Facebook attachment keys, migrate or delete existing empty-hash attachment rows for that message before inserting synthetic rows. Add a regression test that seeds a legacy empty-hash row before re-import.


Panel: ci_default_security | Synthesis: codex, 8s | Members: codex_default (codex/default, done, 2m52s), codex_security (codex/security, done, 13s) | Total: 3m13s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants