Create Usage Snapshots by joyvuu-dave · Pull Request #4858 · cloudfoundry/cloud_controller_ng

joyvuu-dave · 2026-02-18T19:47:13Z

Usage Snapshots

Summary

This PR introduces new V3 API endpoints for capturing point-in-time usage data as a non-destructive alternative to destructively_purge_all_and_reseed. Along with #4646 it is meant to address #4182.

New Endpoints:

POST /v3/app_usage/snapshots - Capture all running processes
POST /v3/service_usage/snapshots - Capture all service instances
GET endpoints for listing snapshots and retrieving chunk details

Key Benefits:

Existing billing consumers are unaffected (event stream preserved)
Multiple consumers can onboard independently
Scales to very large datasets via chunked storage
Follows established V3 async job pattern

Problem Statement: Why This Feature Exists

The Current Problem with `destructively_purge_all_and_reseed`

When a new billing consumer wants to start tracking usage events, the START/CREATE events for currently running apps have often been pruned (31-day retention). The current "solution" is destructive:

Admin calls POST /v3/app_usage_events/actions/destructively_purge_all_and_reseed
System TRUNCATES the entire events table
System creates synthetic events for all currently running apps
New billing system starts consuming from event ID 1

This breaks existing consumers:

Billing System A has been tracking since day 1 with perfect accuracy
Billing System B joins and needs a baseline
Admin runs purge_and_reseed
Billing System A's event stream is now corrupted - all checkpoint IDs are invalid

The Solution

Usage snapshots provide a non-destructive alternative:

Captures a point-in-time baseline of all running apps/services
Ties the snapshot to the most recent usage event at the time of generation (stored as checkpoint_event_guid)
Preserves the event stream for all existing consumers
Enables multiple independent consumers without coordination

Data Model

App Usage Snapshot

┌─────────────────────────────┐
│ AppUsageSnapshot (parent)   │
├─────────────────────────────┤
│ guid                        │
│ checkpoint_event_guid       │
│ checkpoint_event_created_at │
│ created_at, completed_at    │
│ instance_count (total)      │
│ app_count (total)           │
│ organization_count          │
│ space_count                 │
│ chunk_count                 │
└─────────────────────────────┘
            │ 1:N
            ▼
┌───────────────────────────────────────┐
│ AppUsageSnapshotChunk                 │
├───────────────────────────────────────┤
│ organization_guid/name                │
│ space_guid/name                       │
│ chunk_index (0, 1, 2...)              │
│ processes (JSON array)                │
│   - app_guid, app_name                │
│   - process_guid, process_type        │
│   - instance_count                    │
│   - memory_in_mb_per_instance         │
│   - buildpack_guid, buildpack_name    │
└───────────────────────────────────────┘

Service Usage Snapshot

┌─────────────────────────────────┐
│ ServiceUsageSnapshot (parent)   │
├─────────────────────────────────┤
│ guid                            │
│ checkpoint_event_guid           │
│ checkpoint_event_created_at     │
│ created_at, completed_at        │
│ service_instance_count (total)  │
│ organization_count              │
│ space_count                     │
│ chunk_count                     │
└─────────────────────────────────┘
            │ 1:N
            ▼
┌───────────────────────────────────────────┐
│ ServiceUsageSnapshotChunk                 │
├───────────────────────────────────────────┤
│ organization_guid/name                    │
│ space_guid/name                           │
│ chunk_index (0, 1, 2...)                  │
│ service_instances (JSON array)            │
│   - service_instance_guid/name/type       │
│   - service_plan_guid/name                │
│   - service_offering_guid/name            │
│   - service_broker_guid/name              │
└───────────────────────────────────────────┘

Chunking Strategy

Each chunk contains up to 50 items for a single space:

Space with 25 items → 1 chunk
Space with 125 items → 3 chunks (50 + 50 + 25)

This ensures bounded memory during generation and bounded API response sizes.

Consumer Onboarding Workflow

Step 1: Request a Snapshot

# App usage
curl "https://api.example.org/v3/app_usage/snapshots" -X POST -H "Authorization: bearer [token]"

# Service usage
curl "https://api.example.org/v3/service_usage/snapshots" -X POST -H "Authorization: bearer [token]"

Response: 202 Accepted with Location: /v3/jobs/{guid}

Step 2: Poll for Job Completion

curl "https://api.example.org/v3/jobs/{job_guid}" -H "Authorization: bearer [token]"

Step 3: Retrieve the Snapshot

curl "https://api.example.org/v3/app_usage/snapshots/{snapshot_guid}" -H "Authorization: bearer [token]"

Response:

{
  "guid": "snapshot-guid-123",
  "created_at": "2026-01-14T10:00:00Z",
  "completed_at": "2026-01-14T10:00:15Z",
  "checkpoint_event_guid": "abc123de-f456-7890-abcd-ef1234567890",
  "checkpoint_event_created_at": "2026-01-14T09:59:58Z",
  "summary": {
    "instance_count": 15234,
    "app_count": 2500,
    "organization_count": 42,
    "space_count": 156,
    "chunk_count": 200
  },
  "links": {
    "self": { "href": "/v3/app_usage/snapshots/snapshot-guid-123" },
    "checkpoint_event": { "href": "/v3/app_usage_events/abc123de-f456-7890-abcd-ef1234567890" },
    "chunks": { "href": "/v3/app_usage/snapshots/snapshot-guid-123/chunks" }
  }
}

Step 4: Retrieve Chunks (for per-item details)

curl "https://api.example.org/v3/app_usage/snapshots/{guid}/chunks" -H "Authorization: bearer [token]"

Step 5: Start Processing Events from Checkpoint

curl "https://api.example.org/v3/app_usage_events?after_guid=abc123de-f456-7890-abcd-ef1234567890" \
  -H "Authorization: bearer [token]"

The after_guid filter returns all events created after the checkpoint event, ensuring no gap or overlap between the snapshot baseline and the event stream. The billing system now has a complete picture: a baseline of all running processes plus all subsequent events.

API Reference

App Usage Snapshot Endpoints

Method	Endpoint	Description
`POST`	`/v3/app_usage/snapshots`	Create a new snapshot (async)
`GET`	`/v3/app_usage/snapshots`	List all snapshots
`GET`	`/v3/app_usage/snapshots/:guid`	Get a specific snapshot
`GET`	`/v3/app_usage/snapshots/:guid/chunks`	Get chunk details

Service Usage Snapshot Endpoints

Method	Endpoint	Description
`POST`	`/v3/service_usage/snapshots`	Create a new snapshot (async)
`GET`	`/v3/service_usage/snapshots`	List all snapshots
`GET`	`/v3/service_usage/snapshots/:guid`	Get a specific snapshot
`GET`	`/v3/service_usage/snapshots/:guid/chunks`	Get chunk details

Required Permissions

Create snapshot: Admin (global write access)
Read snapshot/chunks: Admin, Admin Read-Only, Global Auditor

Error Responses

Code	When	Response
409	Snapshot already in progress	`CF-AppUsageSnapshotGenerationInProgress`
422	Chunks requested for incomplete snapshot	`Snapshot is still processing`
404	Snapshot not found	`App usage snapshot not found`

Automatic Cleanup

Daily cleanup jobs run automatically:

App usage snapshots: 4:00 AM UTC
Service usage snapshots: 4:30 AM UTC

Cleanup removes:

Completed snapshots older than 31 days (configurable via cutoff_age_in_days)
Stale in-progress snapshots (stuck for more than 1 hour)

Observability

Prometheus Metrics

App Usage:

cc_app_usage_snapshot_generation_duration_seconds (histogram)
cc_app_usage_snapshot_generation_failures_total (counter)

Service Usage:

cc_service_usage_snapshot_generation_duration_seconds (histogram)
cc_service_usage_snapshot_generation_failures_total (counter)

Log Sources

cc.app_usage_snapshot_repository
cc.service_usage_snapshot_repository

Performance Characteristics

Expected generation times:

Foundation Size	Instances	Estimated Time
Small	1,000	< 0.2 seconds
Medium	10,000	~1-1.5 seconds
Large	50,000	~5-7 seconds
Very Large	100,000	~10-15 seconds

Scale characteristics:

Memory usage: Bounded (streaming + chunking)
Database load: Non-blocking (keyset pagination)
API responses: Bounded (≤50 items per chunk)

Design Decisions

Why Fixed-Size Chunking?

Each chunk = up to 50 items for one space. We considered adaptive chunking but rejected it because:

Same storage requirements
Simpler to understand and debug
Consumer code is simpler (one format)

Why 1-Hour Stale Timeout?

Normal generation completes in seconds
Aligns with HTTP timeout for sync purge_and_reseed
Quickly unblocks new requests if something fails

Atomic Generation

Snapshot generation is all-or-nothing. If interrupted, it rolls back completely. No partial snapshots can exist.

Introduces new V3 API endpoints for capturing point-in-time usage data: App Usage Snapshots (/v3/app_usage/snapshots): - Creates snapshots of all running processes across the platform - Captures instance counts, memory allocation, and buildpack info - Data organized by organization and space in paginated chunks Service Usage Snapshots (/v3/service_usage/snapshots): - Creates snapshots of all service instances across the platform - Captures service plan, offering, and broker information - Supports both managed and user-provided service instances Both snapshot types: - Are admin-only operations that run asynchronously via pollable jobs - Include a checkpoint reference (GUID) to the most recent usage event - Support automatic cleanup of old and stale snapshots via daily jobs - Expose Prometheus metrics for generation duration and failure tracking

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create Usage Snapshots#4858

Create Usage Snapshots#4858
joyvuu-dave wants to merge 1 commit intocloudfoundry:mainfrom
joyvuu-dave:usage_snapshots

joyvuu-dave commented Feb 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

joyvuu-dave commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Usage Snapshots

Summary

The Current Problem with destructively_purge_all_and_reseed

The Solution

App Usage Snapshot

Service Usage Snapshot

Chunking Strategy

Step 1: Request a Snapshot

Step 2: Poll for Job Completion

Step 3: Retrieve the Snapshot

Step 4: Retrieve Chunks (for per-item details)

Step 5: Start Processing Events from Checkpoint

App Usage Snapshot Endpoints

Service Usage Snapshot Endpoints

Required Permissions

Error Responses

Prometheus Metrics

Log Sources

Why Fixed-Size Chunking?

Why 1-Hour Stale Timeout?

Atomic Generation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

joyvuu-dave commented Feb 18, 2026 •

edited

Loading

The Current Problem with `destructively_purge_all_and_reseed`