Skip to content

Create Usage Snapshots#4858

Draft
joyvuu-dave wants to merge 1 commit intocloudfoundry:mainfrom
joyvuu-dave:usage_snapshots
Draft

Create Usage Snapshots#4858
joyvuu-dave wants to merge 1 commit intocloudfoundry:mainfrom
joyvuu-dave:usage_snapshots

Conversation

@joyvuu-dave
Copy link
Contributor

@joyvuu-dave joyvuu-dave commented Feb 18, 2026

Usage Snapshots

Summary

This PR introduces new V3 API endpoints for capturing point-in-time usage data as a non-destructive alternative to destructively_purge_all_and_reseed. Along with #4646 it is meant to address #4182.

New Endpoints:

  • POST /v3/app_usage/snapshots - Capture all running processes
  • POST /v3/service_usage/snapshots - Capture all service instances
  • GET endpoints for listing snapshots and retrieving chunk details

Key Benefits:

  • Existing billing consumers are unaffected (event stream preserved)
  • Multiple consumers can onboard independently
  • Scales to very large datasets via chunked storage
  • Follows established V3 async job pattern

Problem Statement: Why This Feature Exists

The Current Problem with destructively_purge_all_and_reseed

When a new billing consumer wants to start tracking usage events, the START/CREATE events for currently running apps have often been pruned (31-day retention). The current "solution" is destructive:

  1. Admin calls POST /v3/app_usage_events/actions/destructively_purge_all_and_reseed
  2. System TRUNCATES the entire events table
  3. System creates synthetic events for all currently running apps
  4. New billing system starts consuming from event ID 1

This breaks existing consumers:

  • Billing System A has been tracking since day 1 with perfect accuracy
  • Billing System B joins and needs a baseline
  • Admin runs purge_and_reseed
  • Billing System A's event stream is now corrupted - all checkpoint IDs are invalid

The Solution

Usage snapshots provide a non-destructive alternative:

  1. Captures a point-in-time baseline of all running apps/services
  2. Ties the snapshot to the most recent usage event at the time of generation (stored as checkpoint_event_guid)
  3. Preserves the event stream for all existing consumers
  4. Enables multiple independent consumers without coordination

Data Model

App Usage Snapshot

┌─────────────────────────────┐
│ AppUsageSnapshot (parent)   │
├─────────────────────────────┤
│ guid                        │
│ checkpoint_event_guid       │
│ checkpoint_event_created_at │
│ created_at, completed_at    │
│ instance_count (total)      │
│ app_count (total)           │
│ organization_count          │
│ space_count                 │
│ chunk_count                 │
└─────────────────────────────┘
            │ 1:N
            ▼
┌───────────────────────────────────────┐
│ AppUsageSnapshotChunk                 │
├───────────────────────────────────────┤
│ organization_guid/name                │
│ space_guid/name                       │
│ chunk_index (0, 1, 2...)              │
│ processes (JSON array)                │
│   - app_guid, app_name                │
│   - process_guid, process_type        │
│   - instance_count                    │
│   - memory_in_mb_per_instance         │
│   - buildpack_guid, buildpack_name    │
└───────────────────────────────────────┘

Service Usage Snapshot

┌─────────────────────────────────┐
│ ServiceUsageSnapshot (parent)   │
├─────────────────────────────────┤
│ guid                            │
│ checkpoint_event_guid           │
│ checkpoint_event_created_at     │
│ created_at, completed_at        │
│ service_instance_count (total)  │
│ organization_count              │
│ space_count                     │
│ chunk_count                     │
└─────────────────────────────────┘
            │ 1:N
            ▼
┌───────────────────────────────────────────┐
│ ServiceUsageSnapshotChunk                 │
├───────────────────────────────────────────┤
│ organization_guid/name                    │
│ space_guid/name                           │
│ chunk_index (0, 1, 2...)                  │
│ service_instances (JSON array)            │
│   - service_instance_guid/name/type       │
│   - service_plan_guid/name                │
│   - service_offering_guid/name            │
│   - service_broker_guid/name              │
└───────────────────────────────────────────┘

Chunking Strategy

Each chunk contains up to 50 items for a single space:

  • Space with 25 items → 1 chunk
  • Space with 125 items → 3 chunks (50 + 50 + 25)

This ensures bounded memory during generation and bounded API response sizes.


Consumer Onboarding Workflow

Step 1: Request a Snapshot

# App usage
curl "https://api.example.org/v3/app_usage/snapshots" -X POST -H "Authorization: bearer [token]"

# Service usage
curl "https://api.example.org/v3/service_usage/snapshots" -X POST -H "Authorization: bearer [token]"

Response: 202 Accepted with Location: /v3/jobs/{guid}

Step 2: Poll for Job Completion

curl "https://api.example.org/v3/jobs/{job_guid}" -H "Authorization: bearer [token]"

Step 3: Retrieve the Snapshot

curl "https://api.example.org/v3/app_usage/snapshots/{snapshot_guid}" -H "Authorization: bearer [token]"

Response:

{
  "guid": "snapshot-guid-123",
  "created_at": "2026-01-14T10:00:00Z",
  "completed_at": "2026-01-14T10:00:15Z",
  "checkpoint_event_guid": "abc123de-f456-7890-abcd-ef1234567890",
  "checkpoint_event_created_at": "2026-01-14T09:59:58Z",
  "summary": {
    "instance_count": 15234,
    "app_count": 2500,
    "organization_count": 42,
    "space_count": 156,
    "chunk_count": 200
  },
  "links": {
    "self": { "href": "/v3/app_usage/snapshots/snapshot-guid-123" },
    "checkpoint_event": { "href": "/v3/app_usage_events/abc123de-f456-7890-abcd-ef1234567890" },
    "chunks": { "href": "/v3/app_usage/snapshots/snapshot-guid-123/chunks" }
  }
}

Step 4: Retrieve Chunks (for per-item details)

curl "https://api.example.org/v3/app_usage/snapshots/{guid}/chunks" -H "Authorization: bearer [token]"

Step 5: Start Processing Events from Checkpoint

curl "https://api.example.org/v3/app_usage_events?after_guid=abc123de-f456-7890-abcd-ef1234567890" \
  -H "Authorization: bearer [token]"

The after_guid filter returns all events created after the checkpoint event, ensuring no gap or overlap between the snapshot baseline and the event stream. The billing system now has a complete picture: a baseline of all running processes plus all subsequent events.


API Reference

App Usage Snapshot Endpoints

Method Endpoint Description
POST /v3/app_usage/snapshots Create a new snapshot (async)
GET /v3/app_usage/snapshots List all snapshots
GET /v3/app_usage/snapshots/:guid Get a specific snapshot
GET /v3/app_usage/snapshots/:guid/chunks Get chunk details

Service Usage Snapshot Endpoints

Method Endpoint Description
POST /v3/service_usage/snapshots Create a new snapshot (async)
GET /v3/service_usage/snapshots List all snapshots
GET /v3/service_usage/snapshots/:guid Get a specific snapshot
GET /v3/service_usage/snapshots/:guid/chunks Get chunk details

Required Permissions

  • Create snapshot: Admin (global write access)
  • Read snapshot/chunks: Admin, Admin Read-Only, Global Auditor

Error Responses

Code When Response
409 Snapshot already in progress CF-AppUsageSnapshotGenerationInProgress
422 Chunks requested for incomplete snapshot Snapshot is still processing
404 Snapshot not found App usage snapshot not found

Automatic Cleanup

Daily cleanup jobs run automatically:

  • App usage snapshots: 4:00 AM UTC
  • Service usage snapshots: 4:30 AM UTC

Cleanup removes:

  • Completed snapshots older than 31 days (configurable via cutoff_age_in_days)
  • Stale in-progress snapshots (stuck for more than 1 hour)

Observability

Prometheus Metrics

App Usage:

  • cc_app_usage_snapshot_generation_duration_seconds (histogram)
  • cc_app_usage_snapshot_generation_failures_total (counter)

Service Usage:

  • cc_service_usage_snapshot_generation_duration_seconds (histogram)
  • cc_service_usage_snapshot_generation_failures_total (counter)

Log Sources

  • cc.app_usage_snapshot_repository
  • cc.service_usage_snapshot_repository

Performance Characteristics

Expected generation times:

Foundation Size Instances Estimated Time
Small 1,000 < 0.2 seconds
Medium 10,000 ~1-1.5 seconds
Large 50,000 ~5-7 seconds
Very Large 100,000 ~10-15 seconds

Scale characteristics:

  • Memory usage: Bounded (streaming + chunking)
  • Database load: Non-blocking (keyset pagination)
  • API responses: Bounded (≤50 items per chunk)

Design Decisions

Why Fixed-Size Chunking?

Each chunk = up to 50 items for one space. We considered adaptive chunking but rejected it because:

  • Same storage requirements
  • Simpler to understand and debug
  • Consumer code is simpler (one format)

Why 1-Hour Stale Timeout?

  • Normal generation completes in seconds
  • Aligns with HTTP timeout for sync purge_and_reseed
  • Quickly unblocks new requests if something fails

Atomic Generation

Snapshot generation is all-or-nothing. If interrupted, it rolls back completely. No partial snapshots can exist.

Introduces new V3 API endpoints for capturing point-in-time usage data:

App Usage Snapshots (/v3/app_usage/snapshots):
- Creates snapshots of all running processes across the platform
- Captures instance counts, memory allocation, and buildpack info
- Data organized by organization and space in paginated chunks

Service Usage Snapshots (/v3/service_usage/snapshots):
- Creates snapshots of all service instances across the platform
- Captures service plan, offering, and broker information
- Supports both managed and user-provided service instances

Both snapshot types:
- Are admin-only operations that run asynchronously via pollable jobs
- Include a checkpoint reference (GUID) to the most recent usage event
- Support automatic cleanup of old and stale snapshots via daily jobs
- Expose Prometheus metrics for generation duration and failure tracking
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments