feat: Microsoft Teams ingestion (delegated Graph sync)#398
Conversation
Delegated Graph delta-sync of the user's own Teams chats, channel messages, meeting chats, and transcripts into the existing generic chat schema. IMAP and Teams OAuth kept independent via separate token files and incremental consent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Validated Graph contracts against Microsoft Learn docs. Corrected the
spec where the original architecture was falsified:
- no delegated per-chat delta -> chats use list + lastModifiedDateTime filter
- channel delta caps at ~8 months -> backfill via list + replies
- no chat->meeting nav -> resolve transcripts via joinWebUrl filter
- identities carry AAD id, not email -> /users/{id} resolution step
Recorded 3 residual risks (LB-1/2/3) that need a live-tenant spike, plus
kata t56j (spike) and 11te (build).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Transcripts moved out of the Teams ingestion build (organizer-only content + expired-meeting gaps make delegated coverage thin). New stub spec 2026-06-18-teams-transcripts-design.md (kata hsww). Main spec now covers chats + channels + meeting chats only; LB-3 deferred with it. Residual risks for this build narrowed to LB-1/LB-2, probed live before planning per decision 2026-06-18. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Graph Explorer probe (delegated, 2026-06-19) confirmed: per-chat list serves history beyond the 8-month delta window (LB-1), and channel /messages/delta returns a deltaLink under delegated ChannelMessage.Read.All (LB-2). Channel enumeration needs Channel.ReadBasic.All. Closes kata t56j. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
…ing) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
15-task TDD plan: delegated Graph OAuth manager, REST client with paging + Retry-After, participant resolution, message mapping, sync-state cursors, chats + channels importer, inline images, add-teams/sync-teams commands, scheduler integration. Built on load-bearing-verified contracts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Add GraphManager, a sibling of the IMAP Manager that runs the same interactive browser auth-code flow but requests Microsoft Graph scopes and persists tokens under a teams_<email>.json prefix. No IMAP scope validation or IMAP-host logic. Reuses Manager's browser-flow and ID-token verification via an internal delegate; token storage uses the same tokenFile JSON format. IMAP Manager behavior is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Adds internal/teams/client.go: a minimal Microsoft Graph REST client with injectable base URL + token function (for httptest), token-bucket rate limiting via golang.org/x/time/rate, @odata.nextLink paging, and Retry-After / exponential back-off on 429 and 5xx responses. Tests cover paged responses (two pages → deltaLink) and Retry-After retry (429 → success in exactly 2 calls). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
…ncel test Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Add mapping.go with mapMessage, htmlToText (delegates to mime.StripHTML), snippet, and conversationType helpers. Add mapping_test.go with full testify coverage. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Replace non-generic getAllPages with generic pageThrough[T], giving a single paging implementation. Add ListChats, ListJoinedTeams, ListChannels, ListChatMessages, ChannelMessagesDelta, ListChannelMessages, and ListReplies on top of the shared helper. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Implements Task 9: orchestration core for the CHATS path. - NewImporter/Import: source creation, sync state loading, sync lifecycle (StartSync/CompleteSync/FailSync) - syncChats: lists chats, ensures conversations, fetches messages, advances per-chat cursor in SyncState - persistMessage: upserts message+body+raw+FTS, handles reactions and reference attachments, marks deleted messages - identityOf: extracts user or application identity from IdentitySet - syncChannels stub (returns nil; filled by Task 10) - TestImportChatsEndToEnd: fake HTTP server, full import run, asserts ChatsProcessed=1 MessagesAdded=1 and DB COUNT=1 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
- Add store.SetReplyTo to link channel reply messages to their parent by resolving source_message_id → internal id within the same source - Add syncChannels: first-run backfill (ListChannelMessages + ListReplies) followed by delta-cursor priming; subsequent runs use stored deltaLink with 400/410 fallback to full re-page + re-prime - Add persistChannelMessages helper that delegates to persistMessage and calls SetReplyTo when ReplyToID is set - Add TestImportChannelsEndToEnd end-to-end test with fake Graph server Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Detect Graph hostedContents $value URLs in Teams HTML message bodies, download each image via the client, and store in content-addressed storage. InlineImagesCopied is incremented per image stored. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Adds `add-teams <email>` CLI command to authorize Microsoft Teams (delegated Graph API) for an account. Mirrors addo365 patterns: EffectiveTenantID, NewGraphManager, GetOrCreateSource(sourceTypeTeams), UpdateSourceDisplayName, confirmDefaultIdentity, and both startup/post-source-create migrations. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Add Teams source dispatch to runScheduledSync (case sourceTypeTeams), runScheduledTeamsSync helper, and sourceTypeTeams recognition in findScheduledSyncSource. Guard the post-switch summary log so a nil summary (Teams path) doesn't deref. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
… typo Rename misspelled teamsTenatID -> teamsTenantID in add_teams.go (all three sites: declaration, flag binding, usage). Add Teams quick-command block to CLAUDE.md. Mark the design spec implemented on feat/teams-ingestion. Fix testify-helper-check lint in Teams test files by introducing local assert/require helpers where call counts exceeded the threshold. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
…ision - opts.Limit is now enforced per-conversation in syncChats (break after convCount reaches the cap) and across all channel fetch paths in syncChannels (roots + replies + delta priming + fallback re-prime). - Chat cursor now compares lastModifiedDateTime values as time.Time to pick the true max, then formats with RFC3339Nano so the Graph $filter=lastModifiedDateTime gt boundary is millisecond-exact rather than second-exact. - Added TestImportChatsLimit: fake server returns 3 chat messages, Import with Limit:2 asserts only 2 are persisted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
The tui command has no --account flag; account filtering is done inside the TUI via the 'a' keybinding. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
…rrelevant (m5yj) Refactor syncChannels to collect all messages for a channel first (Phase 0, deduped by ID), persist them in order up to the per-conversation limit (Phase 1), then call SetReplyTo for every persisted reply after all messages are in the store (Phase 2). This prevents the previous inline-SetReplyTo approach from silently dropping reply links when a delta page returns a reply before its root. Adds TestReplyBeforeRoot which verified the bug (reply_to_message_id was NULL) before the fix and confirms it is set correctly after. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
roborev: Review Unavailable (
|
Resolves chat members via the Graph /members endpoint and writes message_recipients rows for 'to' (all members except sender) and 'mention' (explicit @-mentions in the message body). The participant cache ensures that aadUser-typed mention identities resolve to the same participant already inserted from the email-keyed members list. Member-fetch failure is non-fatal: the import continues with empty toRecips so existing inline-image and other tests are unaffected. Channel messages pass nil toRecips (no members API for channels). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
…low-up) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
…s for resume (paa1)
Part A: Add ImportOptions.Progress (optional func(string)) called after each chat
and channel conversation with a concise line ("chat N/M (type): K messages" /
"channel Team / Chan: K messages"). CLI sync-teams wires Progress: fmt.Println
so the user sees live output; serve.go is unchanged (nil = silent).
Part B: Thread syncID into syncChats/syncChannels and call UpdateSyncCheckpoint
after finishing each conversation (flushing the current SyncState JSON into
cursor_before). On the next Import, load state from both GetLastSuccessfulSync
(cursor_after baseline) and GetLatestCheckpointedSync (cursor_before checkpoint
from the interrupted run) then call SyncState.Merge to keep the more-advanced
cursor per conversation. Chat cursors are RFC3339Nano strings compared
lexicographically; channel deltaLinks prefer the checkpoint value when non-empty.
Adds 7 new tests (TDD): TestSyncStateMerge, TestSyncStateMergeNilOther,
TestSyncStateMergeBaselineWins, TestImportProgressCallback,
TestImportChannelProgressCallback, TestCheckpointFlushedAfterEachConversation,
TestResumeFromCheckpoint. All 27 package tests pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
Add findScheduledSyncSources (plural) that returns ALL syncable sources (gmail, imap, teams) for a scheduler identifier — at most one per type, in stable order. Rewrite runScheduledSync to iterate and dispatch each source in turn, collecting errors with errors.Join, so an address with both IMAP and Teams sources gets both synced rather than silently dropping Teams. Cache rebuild runs once after all sources. Add sourceTypeTeams constant and runScheduledTeamsSync stub (returns unimplemented error until the teams package is wired up). Keep the singular findScheduledSyncSource for backward compatibility with existing tests and callers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
…s sync (rhrn) Mirrors the Gmail/IMAP scheduled-sync paths: confirmDefaultIdentity then runPostSourceCreateMigrations before syncing. No-op in the normal add-teams-then-daemon flow; ensures a Teams source created by another path still gets its default identity. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UVTtwc4MNS8L4ztnJ87QMj
roborev: Review Unavailable (
|
Teams messages dropped Graph fields our struct didn't model. Three fixes: - eventDetail: parse callRecordingEventMessageDetail (callRecordingUrl + callRecordingDisplayName) from systemEventMessage items, whose body is an empty <systemEventMessage/>. The recording is surfaced as an attachment row and as a "📹 recording: …" line in the body text so it shows in the detail view and is indexed by FTS search. - Lossless raw blob: the teams_json archive was json.Marshal of the typed struct, silently discarding every unmodeled field (eventDetail, webUrl, summary, channelIdentity). ChatMessage.UnmarshalJSON now captures the exact original bytes and we archive those verbatim. - Attachments: store all attachments[] carrying a contentUrl (file/card refs), not just contentType=="reference". Link attachments get a URL-derived content_hash so multiple coexist per message (UpsertAttachment dedupes hashless rows to one-per-message). HTML body hrefs and hostedContents inline media were already captured on this branch; verified, no change needed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_012Ri7QdvXXUMQPke9wrLjSS
sync-teams was always incremental: it loads the prior sync cursor and skips messages it already saw, so an importer upgrade (e.g. eventDetail recording links, lossless raw, attachments) could not backfill existing messages without hand-editing sync_runs. Add --full, which sets ImportOptions.Full so Import ignores the stored cursor and any interrupted checkpoint and re-fetches every chat/channel message. Messages upsert on (source_id, source_message_id), so this repairs existing rows in place without duplicates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_012Ri7QdvXXUMQPke9wrLjSS
downloadInlineImages passed the absolute hostedContents URL's full path
(".../v1.0/chats/.../hostedContents/.../$value") to the client, which prepends
its baseURL (".../v1.0"). The version segment was duplicated
(".../v1.0/v1.0/..."), so every inline-media fetch 404'd: 0 inline images were
ever stored and each attempt was counted as an error (~23.8k errors observed on
a full re-sync of a real tenant).
Add hostedFetchPath, which strips baseURL's path prefix before fetching, so the
version segment is not doubled. The prior integration test missed this because
its httptest baseURL has no path; a direct unit test now pins the production
(/v1.0) and httptest (no-path) cases and asserts the segment is not doubled.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_012Ri7QdvXXUMQPke9wrLjSS
After the hostedContents path fix, inline media for already-imported messages still needs re-fetching. A full --full re-sync re-walks all messages; this command targets only messages whose stored HTML body contains a hostedContents URL. Adds Store.ForEachTeamsHostedContentBody (streams message_id + body_html for the source's messages matching hostedContents) and Importer.BackfillInlineMedia (iterates those and runs the inline-media download). Idempotent via content-addressed storage, so it is safe to re-run/resume. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_012Ri7QdvXXUMQPke9wrLjSS
The iterator streamed rows while the caller wrote (UpsertAttachment) inside the callback. Holding the read cursor open pinned a second pooled connection, so the writes contended for SQLite's single WAL writer — "database is locked" with 30s busy-timeout stalls and dropped inserts during the inline-media backfill. Read all matching rows and close the cursor before invoking any callback, so writes run on a single free connection with no contention. Added a regression test that writes inside the callback across several rows. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_012Ri7QdvXXUMQPke9wrLjSS
roborev: Combined Review (
|
|
Two unrelated bug fixes that surfaced while building this were split out into their own PRs to keep this one scoped to Teams ingestion:
Neither depends on this PR; both branch off current |
Retry only messages whose inline media is still missing (hostedContents reference count exceeds stored on-disk images), instead of re-fetching every message — useful after transient fetch failures. Adds Store.ForEachTeamsIncompleteHostedContentBody (sharing the buffered iterator with the existing full variant) and an OnlyIncomplete import option. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_012Ri7QdvXXUMQPke9wrLjSS
roborev: Combined Review (
|
## What Several importers built `time.Time` values from epoch timestamps with `time.Unix`/`time.UnixMilli` but **without `.UTC()`**, leaving them in the runner's local zone — while the rest of each importer stores dates in UTC. Any code reading the calendar day (or the Parquet year partition) is then off by one in zones east of UTC. Fixes: - `internal/sync/sync.go` — `processBatch` oldest-message date (progress tracking). - `internal/whatsapp/mapping.go` — message `SentAt`. - `internal/whatsapp/importer.go` — reaction `createdAt`. ## Why it matters `TestProcessBatch_OldestDatePropagation` fails on any machine east of UTC (e.g. NZ): the fixture `2024-01-10T12:00:00Z` reads back as Jan 11 local. The tests are correct; the production code was the bug. Adds `TestMapMessageSentAtIsUTC` (asserts the stored zone is UTC, machine-independent). ## Possible later fixes (out of scope here) The same `time.Unix(...)`-without-`.UTC()` pattern also appears in the embedding-generation status timestamps, but these are **operator-facing status values** round-tripped from unix-int columns (not message dates), so they don't affect partitioning/dedup/cross-system date semantics. Local-time display is arguably fine; normalizing them to UTC would be a consistency-only follow-up. Sites: - `cmd/msgvault/cmd/embeddings_manage.go` — `StartedAt`, `SeededAt`, `CompletedAt`, `ActivatedAt`. - `internal/vector/pgvector/backend.go` — `StartedAt`, `CompletedAt`, `ActivatedAt`. - `internal/vector/sqlitevec/backend.go` — `StartedAt`, `CompletedAt`, `ActivatedAt`. Left unchanged here to avoid churning working code on a style call; documented so a future pass can decide. ## Scope Independent of the Teams PR (#398) — branched from `main`, touches only `internal/sync` and `internal/whatsapp`. Co-authored-by: Nat Torkington <njt@users.noreply.github.com>
|
looking at this |
What
Sync your own Microsoft Teams 1:1/group/meeting chats and channel messages into msgvault via delegated Microsoft Graph, searchable alongside mail through the existing TUI / FTS / Parquet analytics.
Highlights
add-teams(delegated Graph OAuth) andsync-teams(full + incremental, with streamed per-conversation progress) commands; Teams also runs underservescheduled syncs — and the daemon now syncs all source types on an identifier (so Teams + Outlook/IMAP on one address both run).to) +@mentionrows, identity resolution (AAD object id → email dedup, unifying with mail identities), inline images downloaded to content-addressed storage, and shared-file links recorded.lastModifiedDateTimelist filtering (no delegated per-chat delta endpoint exists), channels via/messages/delta; per-conversation cursors persisted insync_runs.cursor_after, flushed after each conversation so an interrupted long backfill resumes mid-stream.teams_<email>.jsontoken with Graph scopes only, so IMAP and Teams can each be used alone or together.Use
Chat.Read,ChannelMessage.Read.All,Team.ReadBasic.All,Channel.ReadBasic.All,User.Read) and grant admin consent.config.toml:msgvault add-teams you@tenant.comthenmsgvault sync-teams you@tenant.com(--no-channels/--limitfor scoped runs). Pressainsidemsgvault tuito filter to the Teams account.Notes
🤖 Generated with Claude Code