Skip to content

feat: Microsoft Teams ingestion (delegated Graph sync)#398

Open
njt wants to merge 40 commits into
kenn-io:mainfrom
njt:feat/teams-ingestion
Open

feat: Microsoft Teams ingestion (delegated Graph sync)#398
njt wants to merge 40 commits into
kenn-io:mainfrom
njt:feat/teams-ingestion

Conversation

@njt

@njt njt commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

What

Sync your own Microsoft Teams 1:1/group/meeting chats and channel messages into msgvault via delegated Microsoft Graph, searchable alongside mail through the existing TUI / FTS / Parquet analytics.

Highlights

  • New add-teams (delegated Graph OAuth) and sync-teams (full + incremental, with streamed per-conversation progress) commands; Teams also runs under serve scheduled syncs — and the daemon now syncs all source types on an identifier (so Teams + Outlook/IMAP on one address both run).
  • Reuses the existing chat schema — no new core tables: chats → direct/group conversations, channels → channel conversations with root+reply threading, plus reactions, sender + recipient (to) + @mention rows, identity resolution (AAD object id → email dedup, unifying with mail identities), inline images downloaded to content-addressed storage, and shared-file links recorded.
  • Incremental sync: chats via lastModifiedDateTime list filtering (no delegated per-chat delta endpoint exists), channels via /messages/delta; per-conversation cursors persisted in sync_runs.cursor_after, flushed after each conversation so an interrupted long backfill resumes mid-stream.
  • Microsoft OAuth kept independent from the existing Outlook/IMAP manager — separate teams_<email>.json token with Graph scopes only, so IMAP and Teams can each be used alone or together.

Use

  1. Register an Entra app with delegated Graph permissions (Chat.Read, ChannelMessage.Read.All, Team.ReadBasic.All, Channel.ReadBasic.All, User.Read) and grant admin consent.
  2. Add to config.toml:
    [microsoft]
    client_id = "<app-client-id>"
    tenant_id = "<directory-tenant-id>"
  3. msgvault add-teams you@tenant.com then msgvault sync-teams you@tenant.com (--no-channels / --limit for scoped runs). Press a inside msgvault tui to filter to the Teams account.

Notes

  • Validated on a live tenant: ~313k messages spanning 2017–2026 (chats, channels, reactions, threaded replies) — confirms full-history backfill beyond Graph's 8-month delta window.
  • Meeting transcripts are intentionally out of scope (separate spec; delegated transcript access is effectively organizer-only). The only remaining follow-up is preserving a departing user's shared SharePoint/OneDrive files before account removal.

🤖 Generated with Claude Code

Nat Torkington and others added 29 commits June 18, 2026 15:54
Delegated Graph delta-sync of the user's own Teams chats, channel
messages, meeting chats, and transcripts into the existing generic
chat schema. IMAP and Teams OAuth kept independent via separate token
files and incremental consent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Validated Graph contracts against Microsoft Learn docs. Corrected the
spec where the original architecture was falsified:
- no delegated per-chat delta -> chats use list + lastModifiedDateTime filter
- channel delta caps at ~8 months -> backfill via list + replies
- no chat->meeting nav -> resolve transcripts via joinWebUrl filter
- identities carry AAD id, not email -> /users/{id} resolution step

Recorded 3 residual risks (LB-1/2/3) that need a live-tenant spike, plus
kata t56j (spike) and 11te (build).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Transcripts moved out of the Teams ingestion build (organizer-only
content + expired-meeting gaps make delegated coverage thin). New stub
spec 2026-06-18-teams-transcripts-design.md (kata hsww). Main spec now
covers chats + channels + meeting chats only; LB-3 deferred with it.
Residual risks for this build narrowed to LB-1/LB-2, probed live before
planning per decision 2026-06-18.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Graph Explorer probe (delegated, 2026-06-19) confirmed: per-chat list
serves history beyond the 8-month delta window (LB-1), and channel
/messages/delta returns a deltaLink under delegated ChannelMessage.Read.All
(LB-2). Channel enumeration needs Channel.ReadBasic.All. Closes kata t56j.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
…ing)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
15-task TDD plan: delegated Graph OAuth manager, REST client with
paging + Retry-After, participant resolution, message mapping, sync-state
cursors, chats + channels importer, inline images, add-teams/sync-teams
commands, scheduler integration. Built on load-bearing-verified contracts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Add GraphManager, a sibling of the IMAP Manager that runs the same
interactive browser auth-code flow but requests Microsoft Graph scopes
and persists tokens under a teams_<email>.json prefix. No IMAP scope
validation or IMAP-host logic. Reuses Manager's browser-flow and
ID-token verification via an internal delegate; token storage uses the
same tokenFile JSON format. IMAP Manager behavior is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Adds internal/teams/client.go: a minimal Microsoft Graph REST client
with injectable base URL + token function (for httptest), token-bucket
rate limiting via golang.org/x/time/rate, @odata.nextLink paging, and
Retry-After / exponential back-off on 429 and 5xx responses.

Tests cover paged responses (two pages → deltaLink) and Retry-After
retry (429 → success in exactly 2 calls).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
…ncel test

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Add mapping.go with mapMessage, htmlToText (delegates to mime.StripHTML),
snippet, and conversationType helpers. Add mapping_test.go with full
testify coverage.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Replace non-generic getAllPages with generic pageThrough[T], giving a
single paging implementation. Add ListChats, ListJoinedTeams,
ListChannels, ListChatMessages, ChannelMessagesDelta,
ListChannelMessages, and ListReplies on top of the shared helper.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Implements Task 9: orchestration core for the CHATS path.

- NewImporter/Import: source creation, sync state loading, sync lifecycle (StartSync/CompleteSync/FailSync)
- syncChats: lists chats, ensures conversations, fetches messages, advances per-chat cursor in SyncState
- persistMessage: upserts message+body+raw+FTS, handles reactions and reference attachments, marks deleted messages
- identityOf: extracts user or application identity from IdentitySet
- syncChannels stub (returns nil; filled by Task 10)
- TestImportChatsEndToEnd: fake HTTP server, full import run, asserts ChatsProcessed=1 MessagesAdded=1 and DB COUNT=1

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
- Add store.SetReplyTo to link channel reply messages to their parent
  by resolving source_message_id → internal id within the same source
- Add syncChannels: first-run backfill (ListChannelMessages + ListReplies)
  followed by delta-cursor priming; subsequent runs use stored deltaLink
  with 400/410 fallback to full re-page + re-prime
- Add persistChannelMessages helper that delegates to persistMessage and
  calls SetReplyTo when ReplyToID is set
- Add TestImportChannelsEndToEnd end-to-end test with fake Graph server

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Detect Graph hostedContents $value URLs in Teams HTML message bodies,
download each image via the client, and store in content-addressed storage.
InlineImagesCopied is incremented per image stored.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Adds `add-teams <email>` CLI command to authorize Microsoft Teams (delegated
Graph API) for an account. Mirrors addo365 patterns: EffectiveTenantID,
NewGraphManager, GetOrCreateSource(sourceTypeTeams), UpdateSourceDisplayName,
confirmDefaultIdentity, and both startup/post-source-create migrations.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Add Teams source dispatch to runScheduledSync (case sourceTypeTeams),
runScheduledTeamsSync helper, and sourceTypeTeams recognition in
findScheduledSyncSource. Guard the post-switch summary log so a nil
summary (Teams path) doesn't deref.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
… typo

Rename misspelled teamsTenatID -> teamsTenantID in add_teams.go (all
three sites: declaration, flag binding, usage). Add Teams quick-command
block to CLAUDE.md. Mark the design spec implemented on
feat/teams-ingestion. Fix testify-helper-check lint in Teams test files
by introducing local assert/require helpers where call counts exceeded
the threshold.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
…ision

- opts.Limit is now enforced per-conversation in syncChats (break after
  convCount reaches the cap) and across all channel fetch paths in
  syncChannels (roots + replies + delta priming + fallback re-prime).
- Chat cursor now compares lastModifiedDateTime values as time.Time to
  pick the true max, then formats with RFC3339Nano so the Graph
  $filter=lastModifiedDateTime gt boundary is millisecond-exact rather
  than second-exact.
- Added TestImportChatsLimit: fake server returns 3 chat messages,
  Import with Limit:2 asserts only 2 are persisted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
The tui command has no --account flag; account filtering is done inside
the TUI via the 'a' keybinding.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
…rrelevant (m5yj)

Refactor syncChannels to collect all messages for a channel first (Phase 0,
deduped by ID), persist them in order up to the per-conversation limit (Phase 1),
then call SetReplyTo for every persisted reply after all messages are in the
store (Phase 2). This prevents the previous inline-SetReplyTo approach from
silently dropping reply links when a delta page returns a reply before its root.

Adds TestReplyBeforeRoot which verified the bug (reply_to_message_id was NULL)
before the fix and confirms it is set correctly after.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
@roborev-ci

roborev-ci Bot commented Jun 20, 2026

Copy link
Copy Markdown

roborev: Review Unavailable (e58e9fc)

The review agent repeatedly failed to run (likely an agent or configuration error). roborev will try again on the next commit.

Last error: agent: claude-code failed stream: stream errors: You've hit your session limit · resets 5:50am (UTC): exit status 1

Nat Torkington and others added 5 commits June 20, 2026 16:27
Resolves chat members via the Graph /members endpoint and writes
message_recipients rows for 'to' (all members except sender) and
'mention' (explicit @-mentions in the message body). The participant
cache ensures that aadUser-typed mention identities resolve to the
same participant already inserted from the email-keyed members list.

Member-fetch failure is non-fatal: the import continues with empty
toRecips so existing inline-image and other tests are unaffected.
Channel messages pass nil toRecips (no members API for channels).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
…low-up)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
…s for resume (paa1)

Part A: Add ImportOptions.Progress (optional func(string)) called after each chat
and channel conversation with a concise line ("chat N/M (type): K messages" /
"channel Team / Chan: K messages"). CLI sync-teams wires Progress: fmt.Println
so the user sees live output; serve.go is unchanged (nil = silent).

Part B: Thread syncID into syncChats/syncChannels and call UpdateSyncCheckpoint
after finishing each conversation (flushing the current SyncState JSON into
cursor_before). On the next Import, load state from both GetLastSuccessfulSync
(cursor_after baseline) and GetLatestCheckpointedSync (cursor_before checkpoint
from the interrupted run) then call SyncState.Merge to keep the more-advanced
cursor per conversation. Chat cursors are RFC3339Nano strings compared
lexicographically; channel deltaLinks prefer the checkpoint value when non-empty.

Adds 7 new tests (TDD): TestSyncStateMerge, TestSyncStateMergeNilOther,
TestSyncStateMergeBaselineWins, TestImportProgressCallback,
TestImportChannelProgressCallback, TestCheckpointFlushedAfterEachConversation,
TestResumeFromCheckpoint. All 27 package tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Add findScheduledSyncSources (plural) that returns ALL syncable sources
(gmail, imap, teams) for a scheduler identifier — at most one per type,
in stable order. Rewrite runScheduledSync to iterate and dispatch each
source in turn, collecting errors with errors.Join, so an address with
both IMAP and Teams sources gets both synced rather than silently
dropping Teams. Cache rebuild runs once after all sources. Add
sourceTypeTeams constant and runScheduledTeamsSync stub (returns
unimplemented error until the teams package is wired up). Keep the
singular findScheduledSyncSource for backward compatibility with
existing tests and callers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
…s sync (rhrn)

Mirrors the Gmail/IMAP scheduled-sync paths: confirmDefaultIdentity then
runPostSourceCreateMigrations before syncing. No-op in the normal
add-teams-then-daemon flow; ensures a Teams source created by another
path still gets its default identity.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
@roborev-ci

roborev-ci Bot commented Jun 20, 2026

Copy link
Copy Markdown

roborev: Review Unavailable (7d3cff4)

The review agent repeatedly failed to run (likely an agent or configuration error). roborev will try again on the next commit.

Last error: agent: claude-code failed stream: stream errors: You've hit your session limit · resets 5:50am (UTC): exit status 1

Nat Torkington and others added 5 commits June 22, 2026 14:04
Teams messages dropped Graph fields our struct didn't model. Three fixes:

- eventDetail: parse callRecordingEventMessageDetail (callRecordingUrl +
  callRecordingDisplayName) from systemEventMessage items, whose body is an
  empty <systemEventMessage/>. The recording is surfaced as an attachment row
  and as a "📹 recording: …" line in the body text so it shows in the detail
  view and is indexed by FTS search.
- Lossless raw blob: the teams_json archive was json.Marshal of the typed
  struct, silently discarding every unmodeled field (eventDetail, webUrl,
  summary, channelIdentity). ChatMessage.UnmarshalJSON now captures the exact
  original bytes and we archive those verbatim.
- Attachments: store all attachments[] carrying a contentUrl (file/card refs),
  not just contentType=="reference". Link attachments get a URL-derived
  content_hash so multiple coexist per message (UpsertAttachment dedupes
  hashless rows to one-per-message).

HTML body hrefs and hostedContents inline media were already captured on this
branch; verified, no change needed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_012Ri7QdvXXUMQPke9wrLjSS
sync-teams was always incremental: it loads the prior sync cursor and skips
messages it already saw, so an importer upgrade (e.g. eventDetail recording
links, lossless raw, attachments) could not backfill existing messages without
hand-editing sync_runs.

Add --full, which sets ImportOptions.Full so Import ignores the stored cursor
and any interrupted checkpoint and re-fetches every chat/channel message.
Messages upsert on (source_id, source_message_id), so this repairs existing
rows in place without duplicates.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_012Ri7QdvXXUMQPke9wrLjSS
downloadInlineImages passed the absolute hostedContents URL's full path
(".../v1.0/chats/.../hostedContents/.../$value") to the client, which prepends
its baseURL (".../v1.0"). The version segment was duplicated
(".../v1.0/v1.0/..."), so every inline-media fetch 404'd: 0 inline images were
ever stored and each attempt was counted as an error (~23.8k errors observed on
a full re-sync of a real tenant).

Add hostedFetchPath, which strips baseURL's path prefix before fetching, so the
version segment is not doubled. The prior integration test missed this because
its httptest baseURL has no path; a direct unit test now pins the production
(/v1.0) and httptest (no-path) cases and asserts the segment is not doubled.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_012Ri7QdvXXUMQPke9wrLjSS
After the hostedContents path fix, inline media for already-imported messages
still needs re-fetching. A full --full re-sync re-walks all messages; this
command targets only messages whose stored HTML body contains a hostedContents
URL.

Adds Store.ForEachTeamsHostedContentBody (streams message_id + body_html for
the source's messages matching hostedContents) and Importer.BackfillInlineMedia
(iterates those and runs the inline-media download). Idempotent via
content-addressed storage, so it is safe to re-run/resume.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_012Ri7QdvXXUMQPke9wrLjSS
The iterator streamed rows while the caller wrote (UpsertAttachment) inside the
callback. Holding the read cursor open pinned a second pooled connection, so the
writes contended for SQLite's single WAL writer — "database is locked" with
30s busy-timeout stalls and dropped inserts during the inline-media backfill.

Read all matching rows and close the cursor before invoking any callback, so
writes run on a single free connection with no contention. Added a regression
test that writes inside the callback across several rows.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_012Ri7QdvXXUMQPke9wrLjSS
@roborev-ci

roborev-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown

roborev: Combined Review (d4652cb)

Summary verdict: changes are close, but there is one high-severity data-loss issue and several medium-severity Teams integration correctness gaps.

High

  • internal/teams/importer.go:266: If fetching replies for a channel root fails during the initial backfill or delta-token fallback, the importer only increments sum.Errors, continues, primes a new delta cursor, and later stores it. The missed historical replies will not be retried on the next sync.
    • Fix: Treat reply-fetch failures during backfill/fallback as channel failure, or avoid advancing that channel’s delta cursor until all roots and replies were fetched successfully.

Medium

  • internal/teams/mapping.go:42: Inline hosted-content images are downloaded as attachment rows, but has_attachments and attachment_count only count Graph attachments and recordings. Messages with only inline images will appear as having no attachments in APIs/search/UI.

    • Fix: Count hosted-content URLs before UpsertMessage, or update the message attachment fields after successful inline downloads/backfill.
  • internal/teams/importer.go:454: Mention rows are only replaced when the current Graph message has at least one resolvable mention. If a message is edited to remove mentions, or a --full repair re-imports it with no mentions, stale recipient_type='mention' rows remain.

    • Fix: Always call ReplaceMessageRecipients(messageID, "mention", mentionIDs, mentionNames) after processing the current mentions, even when the resulting slice is empty.
  • cmd/msgvault/cmd/remove_account.go:155: remove-account has no sourceTypeTeams credential cleanup case, so deleting a Teams account removes database rows but leaves the teams_<email>.json Graph OAuth token on disk.

    • Fix: Add a Teams case that uses microsoft.NewGraphManager(...).DeleteToken(source.Identifier).
  • cmd/msgvault/cmd/serve.go:72: serve still requires Google [oauth] config before starting. A Teams-only setup with [microsoft] client_id and a Teams token cannot run scheduled Teams syncs through the daemon.

    • Fix: Defer Google OAuth validation to Gmail sync paths, or allow startup when Microsoft OAuth is configured and scheduled sources are Teams-only.

Panel: ci_default_security | Synthesis: codex, 16s | Members: codex_default (codex/default, done, 7m17s), codex_security (codex/security, done, 5m24s) | Total: 12m57s

@njt

njt commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Two unrelated bug fixes that surfaced while building this were split out into their own PRs to keep this one scoped to Teams ingestion:

Neither depends on this PR; both branch off current main.

Retry only messages whose inline media is still missing (hostedContents
reference count exceeds stored on-disk images), instead of re-fetching every
message — useful after transient fetch failures. Adds
Store.ForEachTeamsIncompleteHostedContentBody (sharing the buffered iterator
with the existing full variant) and an OnlyIncomplete import option.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_012Ri7QdvXXUMQPke9wrLjSS
@roborev-ci

roborev-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown

roborev: Combined Review (13c4591)

High-level verdict: changes need fixes before merge due to one high-risk Teams backfill data-loss path and several medium issues around auth cleanup, startup validation, embeddings, and stale child data.

High

  • internal/teams/importer.go:270
    A channel backfill can fail to fetch replies for a root message, count the error, and still prime/save the channel delta cursor later. Older replies missed during that first backfill can then be skipped permanently because future runs resume from the delta link instead of retrying the full replies walk.
    Fix: Treat reply-fetch failures during full channel backfill as channel-incomplete: do not save the channel delta/checkpoint for that channel, or return an error so the next run retries the full roots+replies pass.

Medium

  • cmd/msgvault/cmd/serve.go:72
    serve still requires Google OAuth config via cfg.OAuth.HasAnyConfig(). A Teams-only setup with only [microsoft].client_id configured cannot start the daemon, even though this change adds scheduled Teams sync support.
    Fix: Allow the daemon to start when either Google OAuth or Microsoft Graph OAuth is configured, e.g. include cfg.Microsoft.ClientID != "" in the validation.

  • cmd/msgvault/cmd/remove_account.go:155
    Removing a Teams source does not remove the new teams_<email>.json Graph token. The database source is deleted, but the credential remains and can silently re-authorize a future Teams sync. This also leaves a still-valid Graph refresh token behind after an operator believes the account was scrubbed.
    Fix: Add a sourceTypeTeams case that calls microsoft.NewGraphManager(...).DeleteToken(source.Identifier), and add a regression test confirming remove-account --type teams deletes teams_<email>.json.

  • cmd/msgvault/cmd/serve.go:677
    Teams sync ignores vectorFeatures, and sync-teams does not set up vector enqueueing either. New Teams messages are written to the store but never added to pending_embeddings, so semantic/vector search misses them until a full embedding rebuild.
    Fix: Add optional embedding enqueue support to the Teams importer, return/collect persisted message IDs, and wire it from both sync-teams and scheduled Teams sync.

  • internal/teams/importer.go:458
    Re-imported Teams messages do not clear child metadata when Graph returns an empty collection. Mentions are only replaced when len(gm.Mentions) > 0, and reactions/attachments are append-only, so edits that remove mentions/reactions/attachments leave stale rows.
    Fix: Use replace/delete-then-insert semantics for Teams-managed child collections, including calling ReplaceMessageRecipients with empty slices for mentions and adding equivalent replacement paths for reactions and attachments.


Panel: ci_default_security | Synthesis: codex, 14s | Members: codex_default (codex/default, done, 8m21s), codex_security (codex/security, done, 5m45s) | Total: 14m20s

wesm pushed a commit that referenced this pull request Jun 22, 2026
## What

Several importers built `time.Time` values from epoch timestamps with `time.Unix`/`time.UnixMilli` but **without `.UTC()`**, leaving them in the runner's local zone — while the rest of each importer stores dates in UTC. Any code reading the calendar day (or the Parquet year partition) is then off by one in zones east of UTC.

Fixes:
- `internal/sync/sync.go` — `processBatch` oldest-message date (progress tracking).
- `internal/whatsapp/mapping.go` — message `SentAt`.
- `internal/whatsapp/importer.go` — reaction `createdAt`.

## Why it matters

`TestProcessBatch_OldestDatePropagation` fails on any machine east of UTC (e.g. NZ): the fixture `2024-01-10T12:00:00Z` reads back as Jan 11 local. The tests are correct; the production code was the bug. Adds `TestMapMessageSentAtIsUTC` (asserts the stored zone is UTC, machine-independent).

## Possible later fixes (out of scope here)

The same `time.Unix(...)`-without-`.UTC()` pattern also appears in the embedding-generation status timestamps, but these are **operator-facing status values** round-tripped from unix-int columns (not message dates), so they don't affect partitioning/dedup/cross-system date semantics. Local-time display is arguably fine; normalizing them to UTC would be a consistency-only follow-up. Sites:
- `cmd/msgvault/cmd/embeddings_manage.go` — `StartedAt`, `SeededAt`, `CompletedAt`, `ActivatedAt`.
- `internal/vector/pgvector/backend.go` — `StartedAt`, `CompletedAt`, `ActivatedAt`.
- `internal/vector/sqlitevec/backend.go` — `StartedAt`, `CompletedAt`, `ActivatedAt`.

Left unchanged here to avoid churning working code on a style call; documented so a future pass can decide.

## Scope

Independent of the Teams PR (#398) — branched from `main`, touches only `internal/sync` and `internal/whatsapp`.

Co-authored-by: Nat Torkington <njt@users.noreply.github.com>
@wesm

wesm commented Jun 25, 2026

Copy link
Copy Markdown
Member

looking at this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants