Draft
Conversation
Introduces new V3 API endpoints for capturing point-in-time usage data: App Usage Snapshots (/v3/app_usage/snapshots): - Creates snapshots of all running processes across the platform - Captures instance counts, memory allocation, and buildpack info - Data organized by organization and space in paginated chunks Service Usage Snapshots (/v3/service_usage/snapshots): - Creates snapshots of all service instances across the platform - Captures service plan, offering, and broker information - Supports both managed and user-provided service instances Both snapshot types: - Are admin-only operations that run asynchronously via pollable jobs - Include a checkpoint reference (GUID) to the most recent usage event - Support automatic cleanup of old and stale snapshots via daily jobs - Expose Prometheus metrics for generation duration and failure tracking
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Usage Snapshots
Summary
This PR introduces new V3 API endpoints for capturing point-in-time usage data as a non-destructive alternative to
destructively_purge_all_and_reseed. Along with #4646 it is meant to address #4182.New Endpoints:
POST /v3/app_usage/snapshots- Capture all running processesPOST /v3/service_usage/snapshots- Capture all service instancesGETendpoints for listing snapshots and retrieving chunk detailsKey Benefits:
Problem Statement: Why This Feature Exists
The Current Problem with
destructively_purge_all_and_reseedWhen a new billing consumer wants to start tracking usage events, the START/CREATE events for currently running apps have often been pruned (31-day retention). The current "solution" is destructive:
POST /v3/app_usage_events/actions/destructively_purge_all_and_reseedThis breaks existing consumers:
purge_and_reseedThe Solution
Usage snapshots provide a non-destructive alternative:
checkpoint_event_guid)Data Model
App Usage Snapshot
Service Usage Snapshot
Chunking Strategy
Each chunk contains up to 50 items for a single space:
This ensures bounded memory during generation and bounded API response sizes.
Consumer Onboarding Workflow
Step 1: Request a Snapshot
Response:
202 AcceptedwithLocation: /v3/jobs/{guid}Step 2: Poll for Job Completion
Step 3: Retrieve the Snapshot
Response:
{ "guid": "snapshot-guid-123", "created_at": "2026-01-14T10:00:00Z", "completed_at": "2026-01-14T10:00:15Z", "checkpoint_event_guid": "abc123de-f456-7890-abcd-ef1234567890", "checkpoint_event_created_at": "2026-01-14T09:59:58Z", "summary": { "instance_count": 15234, "app_count": 2500, "organization_count": 42, "space_count": 156, "chunk_count": 200 }, "links": { "self": { "href": "/v3/app_usage/snapshots/snapshot-guid-123" }, "checkpoint_event": { "href": "/v3/app_usage_events/abc123de-f456-7890-abcd-ef1234567890" }, "chunks": { "href": "/v3/app_usage/snapshots/snapshot-guid-123/chunks" } } }Step 4: Retrieve Chunks (for per-item details)
Step 5: Start Processing Events from Checkpoint
The
after_guidfilter returns all events created after the checkpoint event, ensuring no gap or overlap between the snapshot baseline and the event stream. The billing system now has a complete picture: a baseline of all running processes plus all subsequent events.API Reference
App Usage Snapshot Endpoints
POST/v3/app_usage/snapshotsGET/v3/app_usage/snapshotsGET/v3/app_usage/snapshots/:guidGET/v3/app_usage/snapshots/:guid/chunksService Usage Snapshot Endpoints
POST/v3/service_usage/snapshotsGET/v3/service_usage/snapshotsGET/v3/service_usage/snapshots/:guidGET/v3/service_usage/snapshots/:guid/chunksRequired Permissions
Error Responses
CF-AppUsageSnapshotGenerationInProgressSnapshot is still processingApp usage snapshot not foundAutomatic Cleanup
Daily cleanup jobs run automatically:
Cleanup removes:
cutoff_age_in_days)Observability
Prometheus Metrics
App Usage:
cc_app_usage_snapshot_generation_duration_seconds(histogram)cc_app_usage_snapshot_generation_failures_total(counter)Service Usage:
cc_service_usage_snapshot_generation_duration_seconds(histogram)cc_service_usage_snapshot_generation_failures_total(counter)Log Sources
cc.app_usage_snapshot_repositorycc.service_usage_snapshot_repositoryPerformance Characteristics
Expected generation times:
Scale characteristics:
Design Decisions
Why Fixed-Size Chunking?
Each chunk = up to 50 items for one space. We considered adaptive chunking but rejected it because:
Why 1-Hour Stale Timeout?
purge_and_reseedAtomic Generation
Snapshot generation is all-or-nothing. If interrupted, it rolls back completely. No partial snapshots can exist.