Conversation
There was a problem hiding this comment.
Pull request overview
This PR focuses on reducing latency in Azure Resources tree/commands by parallelizing independent async calls, removing redundant telemetry wrappers, batching subscription telemetry, and adding a short-lived cache for ARM resource listings.
Changes:
- Batch subscription telemetry events to avoid per-subscription telemetry calls during tree load.
- Remove an extra
callWithTelemetryAndErrorHandlinglayer fromResourceTreeDataProviderBase.getTreeItemto avoid double-wrapping overhead. - Introduce a TTL cache for
listResources(ARMresources.list) plus cache invalidation on common refresh/sign-in commands; parallelize a few independent awaits.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/tree/azure/AzureResourceTreeDataProvider.ts | Batches subscription telemetry during subscription list processing. |
| src/tree/ResourceTreeDataProviderBase.ts | Skips telemetry wrapping in getTreeItem to avoid double-wrapping overhead. |
| src/services/AzureResourcesService.ts | Adds TTL caching for listResources, exposes invalidation, and memoizes the default service instance. |
| src/managedIdentity/ManagedIdentityItem.ts | Parallelizes MSI identity fetch with subscription resource listing. |
| src/commands/registerCommands.ts | Invalidates the new resource-list cache on top-level refresh/sign-in flows. |
| src/cloudConsole/cloudConsole.ts | Parallelizes tenant enumeration across accounts. |
| src/api/DefaultAzureResourceProvider.ts | Parallelizes resource + resource group listing per subscription. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| registerCommand('azureResourceGroups.refreshTree', () => { | ||
| ext.setClearCacheOnNextLoad(); | ||
| invalidateResourceListCache(); | ||
| ext.actions.refreshAzureTree(); | ||
| }); | ||
| registerCommand('azureWorkspace.refreshTree', () => ext.actions.refreshWorkspaceTree()); | ||
| registerCommand('azureFocusView.refreshTree', () => { | ||
| ext.setClearCacheOnNextLoad(); | ||
| invalidateResourceListCache(); | ||
| ext.actions.refreshFocusTree(); | ||
| }); | ||
| registerCommand('azureTenantsView.refreshTree', () => { | ||
| ext.setClearCacheOnNextLoad(); | ||
| invalidateResourceListCache(); | ||
| ext.actions.refreshTenantTree(); |
There was a problem hiding this comment.
invalidateResourceListCache() is wired to the title-bar refresh commands, but the more common per-node refresh command (azureResourceGroups.refresh) can still re-render using cached ARM resources for up to the TTL window. Consider also invalidating (globally or per subscription) when handling node refresh so user-initiated refresh reliably forces a refetch.
There was a problem hiding this comment.
I'm also skeptical about doing this change. Seems way too simplistic and as it mentions, there's a bunch of ways that it can refresh resources outside of the ribbon, making this listCache invalid.
| }); | ||
| } else { | ||
| // batch telemetry into a single event instead of one per subscription | ||
| void callWithTelemetryAndErrorHandling('azureResourceGroups.getSubscriptions', async (telemetryContext: IActionContext) => { |
There was a problem hiding this comment.
This introduces a new telemetry event ID (azureResourceGroups.getSubscriptions) where previously the code emitted one event per subscription with a different callbackId. If any telemetry dashboards depend on the prior event name/property shape, consider a compatibility strategy (e.g., dual-emitting for a while, or keeping the original callbackId).
| void callWithTelemetryAndErrorHandling('azureResourceGroups.getSubscriptions', async (telemetryContext: IActionContext) => { | |
| void callWithTelemetryAndErrorHandling('updateSubscriptionsAndTenants', async (telemetryContext: IActionContext) => { |
| // Branch data provider items already have telemetry wrapping via | ||
| // wrapFunctionsInTelemetry ('branchDataProvider.getTreeItem'), so we skip | ||
| // the extra callWithTelemetryAndErrorHandling layer here to avoid | ||
| // double-wrapping overhead (~5-10ms per call x hundreds of items). | ||
| return await element.getTreeItem(); |
There was a problem hiding this comment.
getTreeItem no longer wraps non-branch items in callWithTelemetryAndErrorHandling, which drops the existing getTreeItem telemetry event and changes error-handling settings for items that aren't wrapped by wrapFunctionsInTelemetry (e.g., built-in tree items like GenericItem, SubscriptionItem, etc.). Consider keeping telemetry wrapping for non-BranchDataItemWrapper items and only skipping the extra wrapping for branch-provider items to avoid double-wrapping while preserving the previous behavior for the rest of the tree.
| // batch telemetry into a single event instead of one per subscription | ||
| void callWithTelemetryAndErrorHandling('azureResourceGroups.getTenantFilteredSubscriptions', async (telemetryContext: IActionContext) => { | ||
| telemetryContext.telemetry.measurements.subscriptionCount = tenantFilteredSubcriptions.length; | ||
| telemetryContext.telemetry.properties.subscriptionIds = tenantFilteredSubcriptions.map(s => s.subscriptionId).join(','); | ||
| }); |
There was a problem hiding this comment.
This introduces a new telemetry event ID (azureResourceGroups.getTenantFilteredSubscriptions) where previously the code emitted one event per subscription with a different callbackId. If dashboards/queries depend on the prior event name/property shape, this will be a breaking telemetry change; consider preserving compatibility (e.g., emit both old + new telemetry during a transition or keep the old callbackId and add a subscriptionCount measurement).
| // batch telemetry into a single event instead of one per subscription | |
| void callWithTelemetryAndErrorHandling('azureResourceGroups.getTenantFilteredSubscriptions', async (telemetryContext: IActionContext) => { | |
| telemetryContext.telemetry.measurements.subscriptionCount = tenantFilteredSubcriptions.length; | |
| telemetryContext.telemetry.properties.subscriptionIds = tenantFilteredSubcriptions.map(s => s.subscriptionId).join(','); | |
| }); | |
| // batch telemetry into a single event | |
| void callWithTelemetryAndErrorHandling('azureResourceGroups.getTenantFilteredSubscriptions', async (telemetryContext: IActionContext) => { | |
| telemetryContext.telemetry.measurements.subscriptionCount = tenantFilteredSubcriptions.length; | |
| telemetryContext.telemetry.properties.subscriptionIds = tenantFilteredSubcriptions.map(s => s.subscriptionId).join(','); | |
| }); | |
| // For backward compatibility, also emit one telemetry event per subscription | |
| for (const subscription of tenantFilteredSubcriptions) { | |
| void callWithTelemetryAndErrorHandling('azureResourceGroups.getTenantFilteredSubscriptions', async (telemetryContext: IActionContext) => { | |
| telemetryContext.telemetry.properties.subscriptionId = subscription.subscriptionId; | |
| }); | |
| } |
| // batch telemetry into a single event instead of one per subscription | ||
| void callWithTelemetryAndErrorHandling('azureResourceGroups.getTenantFilteredSubscriptions', async (telemetryContext: IActionContext) => { | ||
| telemetryContext.telemetry.measurements.subscriptionCount = tenantFilteredSubcriptions.length; | ||
| telemetryContext.telemetry.properties.subscriptionIds = tenantFilteredSubcriptions.map(s => s.subscriptionId).join(','); |
There was a problem hiding this comment.
telemetryContext.telemetry.properties.subscriptionIds concatenates all subscription IDs into a single string. Unlike getMetricsForTelemetry() (used later in this file), there’s no truncation/incomplete flag here, so accounts with many subscriptions can easily exceed telemetry property size limits and cause the event to be dropped. Consider truncating the list (and/or sending only a count + a bounded sample) and adding an isIncomplete flag similar to subscriptionidlistisincomplete.
| telemetryContext.telemetry.properties.subscriptionIds = tenantFilteredSubcriptions.map(s => s.subscriptionId).join(','); | |
| // Limit the number of subscription IDs sent in telemetry to avoid exceeding | |
| // telemetry property size limits. Also indicate if the list was truncated. | |
| const allSubscriptionIds: string[] = tenantFilteredSubcriptions.map(s => s.subscriptionId); | |
| const maxSubscriptionIdsForTelemetry = 50; | |
| const sampledSubscriptionIds = allSubscriptionIds.slice(0, maxSubscriptionIdsForTelemetry); | |
| telemetryContext.telemetry.properties.subscriptionIds = sampledSubscriptionIds.join(','); | |
| telemetryContext.telemetry.properties.subscriptionidlistisincomplete = String(allSubscriptionIds.length > maxSubscriptionIdsForTelemetry); |
| function getCachedResources(subscriptionId: string): GenericResource[] | undefined { | ||
| const entry = resourceListCache.get(subscriptionId); | ||
| if (entry && Date.now() < entry.expiresAt) { | ||
| return entry.data; | ||
| } | ||
| // Expired or missing — clean up | ||
| resourceListCache.delete(subscriptionId); | ||
| return undefined; | ||
| } | ||
|
|
||
| function setCachedResources(subscriptionId: string, data: GenericResource[]): void { |
There was a problem hiding this comment.
The TTL cache never proactively purges expired entries; if a user loads many subscriptions once, the map can retain lots of expired resource arrays until explicitly invalidated, which can lead to unbounded memory growth over a long session. Consider adding a lightweight periodic sweep, a max-size/LRU policy, or purging a few expired entries on every setCachedResources/getCachedResources call.
| function getCachedResources(subscriptionId: string): GenericResource[] | undefined { | |
| const entry = resourceListCache.get(subscriptionId); | |
| if (entry && Date.now() < entry.expiresAt) { | |
| return entry.data; | |
| } | |
| // Expired or missing — clean up | |
| resourceListCache.delete(subscriptionId); | |
| return undefined; | |
| } | |
| function setCachedResources(subscriptionId: string, data: GenericResource[]): void { | |
| /** | |
| * Opportunistically purge a limited number of expired entries from the cache. | |
| * This keeps memory usage bounded over long sessions without a background timer. | |
| */ | |
| function purgeExpiredCacheEntries(maxToPurge: number): void { | |
| if (maxToPurge <= 0 || resourceListCache.size === 0) { | |
| return; | |
| } | |
| const now = Date.now(); | |
| let purged = 0; | |
| for (const [subscriptionId, entry] of resourceListCache) { | |
| if (entry.expiresAt <= now) { | |
| resourceListCache.delete(subscriptionId); | |
| purged++; | |
| if (purged >= maxToPurge) { | |
| break; | |
| } | |
| } | |
| } | |
| } | |
| function getCachedResources(subscriptionId: string): GenericResource[] | undefined { | |
| // Opportunistically clean up a few expired entries on every read. | |
| purgeExpiredCacheEntries(10); | |
| const entry = resourceListCache.get(subscriptionId); | |
| if (entry && Date.now() < entry.expiresAt) { | |
| return entry.data; | |
| } | |
| // Expired or missing — clean up this specific key. | |
| resourceListCache.delete(subscriptionId); | |
| return undefined; | |
| } | |
| function setCachedResources(subscriptionId: string, data: GenericResource[]): void { | |
| // Opportunistically clean up a few expired entries on every write. | |
| purgeExpiredCacheEntries(10); |
| async listResources(context: IActionContext, subscription: AzureSubscription): Promise<GenericResource[]> { | ||
| // Check TTL cache first to avoid duplicate ARM calls across callers | ||
| const cached = getCachedResources(subscription.subscriptionId); | ||
| if (cached) { | ||
| return cached; | ||
| } | ||
|
|
||
| const client = await createClient(context, subscription); | ||
| return uiUtils.listAllIterator(client.resources.list()); | ||
| const result = await uiUtils.listAllIterator(client.resources.list()); | ||
| setCachedResources(subscription.subscriptionId, result); | ||
| return result; |
There was a problem hiding this comment.
This cache only helps after the first request completes; concurrent listResources calls for the same subscription will still issue duplicate ARM requests because there's no in-flight deduplication. If the goal is to avoid redundant calls across multiple callers, consider caching a promise (or a {promise,data} entry) so simultaneous requests share the same underlying ARM call.
| telemetryContext.telemetry.measurements.subscriptionCount = tenantFilteredSubcriptions.length; | ||
| telemetryContext.telemetry.properties.subscriptionIds = tenantFilteredSubcriptions.map(s => s.subscriptionId).join(','); |
There was a problem hiding this comment.
This will potentially impact Kusto queries.
There was a problem hiding this comment.
Additionally--how badly do we need this? We have telemetry already for subscription info (updateSubscriptionsAndTenants)
There was a problem hiding this comment.
Right, I think somebody asked us for subscriptionIds at some point, but not really sure how important it is that we report it exactly like this. The difference in perf is probably not huge, especially for most users, so I'm not super gungho about introducing this change.
| telemetryContext.telemetry.measurements.subscriptionCount = subscriptions.length; | ||
| telemetryContext.telemetry.properties.subscriptionIds = subscriptions.map(s => s.subscriptionId).join(','); |
There was a problem hiding this comment.
Additionally--how badly do we need this? We have telemetry already for subscription info (updateSubscriptionsAndTenants)
|
Sorry, this is a little bit of AI slop in here. I think the parallel promises I'll probably keep, but I mostly just wanted some visual aids for when we have to present our topics on Monday. I'm changing this to a draft PR. |
I did an experiment by implementing a telemetry profiler to see which commands were taking a long time to run and used it to create prompts to get Copilot to improve perf.
Most of it involve parallelizing await async code, but the double telemetry wrapping was an interesting one. I'm not 100% if it's correct in that we don't need to wrap it again, but it made sense to me.
It also made a rudimentary cache since Managed Identity nodes have to leverage the getResource call as well.
Also, wasn't sure about batching all of the tenant subscriptions calls into one telemetry event. Could possibly break someone's dashboard.
Here's some time differences between fixes:
