Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 56 additions & 55 deletions docs/sentinel-operator-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,10 @@ This comprehensive guide teaches operators how to deploy, configure, and operate
- [When to Use Sentinel](#12-when-to-use-sentinel)
2. [Core Concepts](#2-core-concepts)
- [Decision Engine](#21-decision-engine)
- [State-Based Reconciliation](#211-state-based-reconciliation)
- [Time-Based Reconciliation (Max Age Intervals)](#212-time-based-reconciliation-max-age-intervals)
- [Never-Processed Reconciliation](#211-never-processed-reconciliation)
- [State-Based Reconciliation](#212-state-based-reconciliation)
- [Time-Based Reconciliation (Max Age Intervals)](#213-time-based-reconciliation-max-age-intervals)
- [Complete Decision Flow](#214-complete-decision-flow)
- [Resource Filtering](#22-resource-filtering)
3. [Configuration Reference](#3-configuration-reference)
- [Configuration File Structure](#31-configuration-file-structure)
Expand Down Expand Up @@ -59,7 +61,8 @@ graph LR
E -->|Update Status| B
```

Sentinel publishes events to a message broker, which fans out messages to downstream adapters. It uses a **dual-trigger reconciliation strategy**:
Sentinel publishes events to a message broker, which fans out messages to downstream adapters. It uses a **three-part decision strategy**:
- **Never-processed**: Publish immediately for new resources that have never been processed
- **State-based**: Publish immediately when resource state indicates unprocessed spec changes
- **Time-based**: Publish periodically based on max age intervals to ensure eventual consistency

Expand All @@ -79,19 +82,36 @@ Deploy Sentinel when you need:

### 2.1 Decision Engine

Sentinel's decision engine evaluates resources during each poll cycle to determine when to publish events. It uses a **dual-trigger strategy** that combines two complementary mechanisms to ensure both immediate response to changes and eventual consistency over time:
Sentinel's decision engine evaluates resources during each poll cycle to determine when to publish events. It uses a **three-part decision strategy** that ensures both immediate response to changes and eventual consistency over time:

1. **State-Based Reconciliation** — Immediate event publishing when resource state indicates unprocessed spec changes, which is checked first
2. **Time-Based Reconciliation** — Periodic event publishing to handle drift and failures when state is in sync
1. **Never-Processed Reconciliation** — Immediate event publishing for new resources that have never been processed by any adapter
2. **State-Based Reconciliation** — Immediate event publishing when resource state indicates unprocessed spec changes
3. **Time-Based Reconciliation** — Periodic event publishing to handle drift and failures when state is in sync

**How Sentinel Reads Resource State:**

When Sentinel polls the HyperFleet API, it retrieves cluster or nodepool resources with their current state.
When Sentinel polls the HyperFleet API, it retrieves cluster or nodepool resources with their current state.

1. **`resource.Generation`** — Retrieved from the API resource. The HyperFleet API increments this value every time the resource spec is updated.
2. **`resource.status`** — Extracted from the API resource's `type=Ready` condition.

#### 2.1.1 State-Based Reconciliation
#### 2.1.1 Never-Processed Reconciliation

Never-processed reconciliation is the **first and highest-priority check** that ensures new resources are published immediately on the first poll cycle after creation.

**How It Works:**

Sentinel checks if `status.last_updated` is zero (meaning no adapter has ever updated the resource status):

- If zero → **Publish immediately** with reason `"never processed"`
- If non-zero → Proceed to state-based check

**Why This Matters:**

- New clusters are published within one poll interval (~5s)
- Ensures consistent handling of all new resources regardless of generation initialization

#### 2.1.2 State-Based Reconciliation

State-based reconciliation is a **spec-change detection mechanism** where Sentinel immediately publishes events when resource state indicates the spec has changed but hasn't been fully processed yet.

Expand All @@ -105,35 +125,14 @@ Sentinel detects unprocessed spec changes by comparing the resource's `generatio

**Note**: Sentinel uses the Ready condition's `ObservedGeneration` field as a proxy signal for spec changes. While the Ready condition can also be False for other reasons (e.g., adapter-reported infrastructure failures), the `ObservedGeneration` field specifically tracks spec processing, making this an effective spec-change detection mechanism.

**Flow Diagram:**

```mermaid
sequenceDiagram
participant User
participant API
participant Sentinel
participant Broker
participant Adapter

User->>API: Update cluster spec (generation: 1 → 2)
API->>API: Increment generation
Sentinel->>API: Poll resources
API-->>Sentinel: cluster (gen: 2, observed_gen: 1)
Sentinel->>Sentinel: Evaluate: 2 > 1 → PUBLISH
Sentinel->>Broker: CloudEvent (reason: state change detected)
Broker->>Adapter: Consume event
Adapter->>API: Reconcile cluster
Adapter->>API: Update status (observed_generation: 2)
```

**Key Properties:**

- **Immediate Response**: No need to wait for max age interval when state indicates unprocessed changes
- **Idempotent**: Adapters can safely process the same generation multiple times
- **Race Prevention**: Ensures spec changes are never missed due to timing
- **Condition-Based**: Uses Ready condition data as a reliable proxy for tracking spec processing status

#### 2.1.2 Time-Based Reconciliation (Max Age Intervals)
#### 2.1.3 Time-Based Reconciliation (Max Age Intervals)

Time-based reconciliation ensures **eventual consistency** by publishing events periodically, even when specs haven't changed. This handles external state drift and transient failures.

Expand All @@ -148,42 +147,44 @@ Sentinel uses two configurable max age intervals based on the resource's status

**Decision Logic:**

When the resource's `generation` matches the `Ready` condition's `ObservedGeneration` (indicating the condition reflects the current state), Sentinel checks if enough time has elapsed:
At this point, we know `status.last_updated` is not zero, so we can use it as the reference timestamp:

1. Calculate reference timestamp:
- If `status.last_updated` exists → use it (adapter has processed resource)
- Otherwise → use `created_time` (new resource never processed)

2. Determine max age interval:
1. Determine max age interval based on ready status:
- If resource is ready (`Ready` condition status == True) → use `max_age_ready` (default: 30m)
- If resource is not ready (`Ready` condition status == False) → use `max_age_not_ready` (default: 10s)

3. Calculate next event time:
```text
next_event = reference_time + max_age_interval
```
2. Calculate next event time:
- `next_event = last_updated + max_age_interval`

4. Compare with current time:
3. Compare with current time:
- If `now >= next_event` → **Publish event** (reason: "max age exceeded")
- Otherwise → **Skip** (reason: "max age not exceeded")

**Flow Diagram:**
#### 2.1.4 Complete Decision Flow

The three reconciliation checks work together in priority order to determine when to publish events:

```mermaid
graph TD
A[Determine Reference Time] --> B{last_updated exists?}
B -->|Yes| C[Use last_updated]
B -->|No| D[Use created_time]
C --> E{Resource Ready?}
D --> E
E -->|Yes| F[Max Age = 30m]
E -->|No| G[Max Age = 10s]
F --> H{now >= reference + max_age?}
G --> H
H -->|Yes| I[Publish: max age exceeded]
H -->|No| J[Skip: within max age]
START[Poll Resource] --> CHECK1{last_updated == zero?}
CHECK1 -->|Yes| PUB1[Publish: never processed]
CHECK1 -->|No| CHECK2{Generation > ObservedGeneration?}
CHECK2 -->|Yes| PUB2[Publish: generation changed]
CHECK2 -->|No| CHECK3{Resource Ready?}
CHECK3 -->|Yes| AGE1[Max Age = 30m]
CHECK3 -->|No| AGE2[Max Age = 10s]
AGE1 --> CHECK4{now >= last_updated + max_age?}
AGE2 --> CHECK4
CHECK4 -->|Yes| PUB3[Publish: max age exceeded]
CHECK4 -->|No| SKIP[Skip: within max age]
```

**Key Takeaways:**

- **Never-processed takes absolute priority** - New resources always publish immediately
- **State changes override max age** - Spec changes don't wait for intervals
- **Max age is the fallback** - Ensures eventual consistency when nothing else triggers

### 2.2 Resource Filtering

Resource filtering enables **horizontal scaling** by allowing operators to distribute resources across multiple Sentinel instances using label-based selectors.
Expand Down Expand Up @@ -362,7 +363,7 @@ The `message_data` field defines the CloudEvents payload structure using **Commo
| Variable | Type | Description | Example Fields |
|----------|------|-------------|----------------|
| `resource` | Resource | The HyperFleet resource | `id`, `kind`, `href`, `generation`, `status`, `labels`, `created_time` |
| `reason` | string | Decision engine reason | `"max age exceeded"`, `"generation changed"` |
| `reason` | string | Decision engine reason | `"generation changed"`, `"never processed"`, `"max age exceeded"` |

**CEL Expression Syntax:**

Expand Down Expand Up @@ -652,7 +653,7 @@ Follow this checklist to ensure successful Sentinel deployment and operation.
Trigger cycle completed total=15 published=3 skipped=12 duration=0.125s topic=hyperfleet-dev-clusters subset=clusters
```
- `count` - Number of resources fetched from the API matching the resource selector
- `published` - Number of events published (generation changed or max age exceeded)
- `published` - Number of events published (generation changed, never processed, or max age exceeded)
- `skipped` - Number of resources skipped (no reconciliation needed)

For detailed deployment guidance, see [docs/running-sentinel.md](running-sentinel.md)
Expand Down
25 changes: 17 additions & 8 deletions internal/engine/decision.go
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ import (
const (
ReasonMaxAgeExceeded = "max age exceeded"
ReasonGenerationChanged = "generation changed"
ReasonNeverProcessed = "never processed"
ReasonNilResource = "resource is nil"
ReasonZeroNow = "now time is zero"
)
Expand Down Expand Up @@ -38,11 +39,12 @@ type Decision struct {
// Evaluate determines if an event should be published for the resource.
//
// Decision Logic (in priority order):
// 1. Generation-based reconciliation: If resource.Generation > status.ObservedGeneration,
// 1. Never-processed reconciliation: If status.LastUpdated is zero, publish immediately
// (resource has never been processed by any adapter)
// 2. Generation-based reconciliation: If resource.Generation > status.ObservedGeneration,
// publish immediately (spec has changed, adapter needs to reconcile)
// 2. Time-based reconciliation: If max age exceeded since last update, publish
// 3. Time-based reconciliation: If max age exceeded since last update, publish
// - Uses status.LastUpdated as reference timestamp
// - If LastUpdated is zero (never processed), falls back to created_time
//
// Max Age Intervals:
// - Resources with Ready=true: maxAgeReady (default 30m)
Expand Down Expand Up @@ -71,6 +73,16 @@ func (e *DecisionEngine) Evaluate(resource *client.Resource, now time.Time) Deci
}
}

// Check if resource has never been processed by an adapter
// LastUpdated is zero means no adapter has updated the status yet
// This ensures first-time resources are published immediately
if resource.Status.LastUpdated.IsZero() {
return Decision{
ShouldPublish: true,
Reason: ReasonNeverProcessed,
}
}

// Check for generation mismatch
// This triggers immediate reconciliation regardless of max age
if resource.Generation > resource.Status.ObservedGeneration {
Expand All @@ -81,12 +93,9 @@ func (e *DecisionEngine) Evaluate(resource *client.Resource, now time.Time) Deci
}

// Determine the reference timestamp for max age calculation
// Use LastUpdated if available (adapter has processed the resource)
// Otherwise fall back to created_time (resource is newly created)
// At this point, we know LastUpdated is not zero (checked above)
// so we can use it directly for the max age calculation
referenceTime := resource.Status.LastUpdated
if referenceTime.IsZero() {
referenceTime = resource.CreatedTime
}

// Determine the appropriate max age based on resource ready status
var maxAge time.Duration
Expand Down
Loading