Skip to content

Commit baafeb8

Browse files
committed
HYPERFLEET-733 - fix: sentinel delays first event publish for newly created clusters
1 parent 3f193c7 commit baafeb8

3 files changed

Lines changed: 97 additions & 105 deletions

File tree

docs/sentinel-operator-guide.md

Lines changed: 56 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,10 @@ This comprehensive guide teaches operators how to deploy, configure, and operate
99
- [When to Use Sentinel](#12-when-to-use-sentinel)
1010
2. [Core Concepts](#2-core-concepts)
1111
- [Decision Engine](#21-decision-engine)
12-
- [State-Based Reconciliation](#211-state-based-reconciliation)
13-
- [Time-Based Reconciliation (Max Age Intervals)](#212-time-based-reconciliation-max-age-intervals)
12+
- [Never-Processed Reconciliation](#211-never-processed-reconciliation)
13+
- [State-Based Reconciliation](#212-state-based-reconciliation)
14+
- [Time-Based Reconciliation (Max Age Intervals)](#213-time-based-reconciliation-max-age-intervals)
15+
- [Complete Decision Flow](#214-complete-decision-flow)
1416
- [Resource Filtering](#22-resource-filtering)
1517
3. [Configuration Reference](#3-configuration-reference)
1618
- [Configuration File Structure](#31-configuration-file-structure)
@@ -59,7 +61,8 @@ graph LR
5961
E -->|Update Status| B
6062
```
6163

62-
Sentinel publishes events to a message broker, which fans out messages to downstream adapters. It uses a **dual-trigger reconciliation strategy**:
64+
Sentinel publishes events to a message broker, which fans out messages to downstream adapters. It uses a **three-part decision strategy**:
65+
- **Never-processed**: Publish immediately for new resources that have never been processed
6366
- **State-based**: Publish immediately when resource state indicates unprocessed spec changes
6467
- **Time-based**: Publish periodically based on max age intervals to ensure eventual consistency
6568

@@ -79,19 +82,36 @@ Deploy Sentinel when you need:
7982

8083
### 2.1 Decision Engine
8184

82-
Sentinel's decision engine evaluates resources during each poll cycle to determine when to publish events. It uses a **dual-trigger strategy** that combines two complementary mechanisms to ensure both immediate response to changes and eventual consistency over time:
85+
Sentinel's decision engine evaluates resources during each poll cycle to determine when to publish events. It uses a **three-part decision strategy** that ensures both immediate response to changes and eventual consistency over time:
8386

84-
1. **State-Based Reconciliation** — Immediate event publishing when resource state indicates unprocessed spec changes, which is checked first
85-
2. **Time-Based Reconciliation** — Periodic event publishing to handle drift and failures when state is in sync
87+
1. **Never-Processed Reconciliation** — Immediate event publishing for new resources that have never been processed by any adapter
88+
2. **State-Based Reconciliation** — Immediate event publishing when resource state indicates unprocessed spec changes
89+
3. **Time-Based Reconciliation** — Periodic event publishing to handle drift and failures when state is in sync
8690

8791
**How Sentinel Reads Resource State:**
8892

89-
When Sentinel polls the HyperFleet API, it retrieves cluster or nodepool resources with their current state.
93+
When Sentinel polls the HyperFleet API, it retrieves cluster or nodepool resources with their current state.
9094

9195
1. **`resource.Generation`** — Retrieved from the API resource. The HyperFleet API increments this value every time the resource spec is updated.
9296
2. **`resource.status`** — Extracted from the API resource's `type=Ready` condition.
9397

94-
#### 2.1.1 State-Based Reconciliation
98+
#### 2.1.1 Never-Processed Reconciliation
99+
100+
Never-processed reconciliation is the **first and highest-priority check** that ensures new resources are published immediately on the first poll cycle after creation.
101+
102+
**How It Works:**
103+
104+
Sentinel checks if `status.LastUpdated` is zero (meaning no adapter has ever updated the resource status):
105+
106+
- If zero → **Publish immediately** with reason `"never processed"`
107+
- If non-zero → Proceed to state-based check
108+
109+
**Why This Matters:**
110+
111+
- New clusters are published within one poll interval (~5s)
112+
- Ensures consistent handling of all new resources regardless of generation initialization
113+
114+
#### 2.1.2 State-Based Reconciliation
95115

96116
State-based reconciliation is a **spec-change detection mechanism** where Sentinel immediately publishes events when resource state indicates the spec has changed but hasn't been fully processed yet.
97117

@@ -105,35 +125,14 @@ Sentinel detects unprocessed spec changes by comparing the resource's `generatio
105125

106126
**Note**: Sentinel uses the Ready condition's `ObservedGeneration` field as a proxy signal for spec changes. While the Ready condition can also be False for other reasons (e.g., adapter-reported infrastructure failures), the `ObservedGeneration` field specifically tracks spec processing, making this an effective spec-change detection mechanism.
107127

108-
**Flow Diagram:**
109-
110-
```mermaid
111-
sequenceDiagram
112-
participant User
113-
participant API
114-
participant Sentinel
115-
participant Broker
116-
participant Adapter
117-
118-
User->>API: Update cluster spec (generation: 1 → 2)
119-
API->>API: Increment generation
120-
Sentinel->>API: Poll resources
121-
API-->>Sentinel: cluster (gen: 2, observed_gen: 1)
122-
Sentinel->>Sentinel: Evaluate: 2 > 1 → PUBLISH
123-
Sentinel->>Broker: CloudEvent (reason: state change detected)
124-
Broker->>Adapter: Consume event
125-
Adapter->>API: Reconcile cluster
126-
Adapter->>API: Update status (observed_generation: 2)
127-
```
128-
129128
**Key Properties:**
130129

131130
- **Immediate Response**: No need to wait for max age interval when state indicates unprocessed changes
132131
- **Idempotent**: Adapters can safely process the same generation multiple times
133132
- **Race Prevention**: Ensures spec changes are never missed due to timing
134133
- **Condition-Based**: Uses Ready condition data as a reliable proxy for tracking spec processing status
135134

136-
#### 2.1.2 Time-Based Reconciliation (Max Age Intervals)
135+
#### 2.1.3 Time-Based Reconciliation (Max Age Intervals)
137136

138137
Time-based reconciliation ensures **eventual consistency** by publishing events periodically, even when specs haven't changed. This handles external state drift and transient failures.
139138

@@ -148,42 +147,44 @@ Sentinel uses two configurable max age intervals based on the resource's status
148147

149148
**Decision Logic:**
150149

151-
When the resource's `generation` matches the `Ready` condition's `ObservedGeneration` (indicating the condition reflects the current state), Sentinel checks if enough time has elapsed:
150+
At this point, we know `status.last_updated` is not zero, so we can use it as the reference timestamp:
152151

153-
1. Calculate reference timestamp:
154-
- If `status.last_updated` exists → use it (adapter has processed resource)
155-
- Otherwise → use `created_time` (new resource never processed)
156-
157-
2. Determine max age interval:
152+
1. Determine max age interval based on ready status:
158153
- If resource is ready (`Ready` condition status == True) → use `max_age_ready` (default: 30m)
159154
- If resource is not ready (`Ready` condition status == False) → use `max_age_not_ready` (default: 10s)
160155

161-
3. Calculate next event time:
162-
```text
163-
next_event = reference_time + max_age_interval
164-
```
156+
2. Calculate next event time:
157+
- `next_event = last_updated + max_age_interval`
165158

166-
4. Compare with current time:
159+
3. Compare with current time:
167160
- If `now >= next_event`**Publish event** (reason: "max age exceeded")
168161
- Otherwise → **Skip** (reason: "max age not exceeded")
169162

170-
**Flow Diagram:**
163+
#### 2.1.4 Complete Decision Flow
164+
165+
The three reconciliation checks work together in priority order to determine when to publish events:
171166

172167
```mermaid
173168
graph TD
174-
A[Determine Reference Time] --> B{last_updated exists?}
175-
B -->|Yes| C[Use last_updated]
176-
B -->|No| D[Use created_time]
177-
C --> E{Resource Ready?}
178-
D --> E
179-
E -->|Yes| F[Max Age = 30m]
180-
E -->|No| G[Max Age = 10s]
181-
F --> H{now >= reference + max_age?}
182-
G --> H
183-
H -->|Yes| I[Publish: max age exceeded]
184-
H -->|No| J[Skip: within max age]
169+
START[Poll Resource] --> CHECK1{LastUpdated == zero?}
170+
CHECK1 -->|Yes| PUB1[Publish: never processed]
171+
CHECK1 -->|No| CHECK2{Generation > ObservedGeneration?}
172+
CHECK2 -->|Yes| PUB2[Publish: generation changed]
173+
CHECK2 -->|No| CHECK3{Resource Ready?}
174+
CHECK3 -->|Yes| AGE1[Max Age = 30m]
175+
CHECK3 -->|No| AGE2[Max Age = 10s]
176+
AGE1 --> CHECK4{now >= last_updated + max_age?}
177+
AGE2 --> CHECK4
178+
CHECK4 -->|Yes| PUB3[Publish: max age exceeded]
179+
CHECK4 -->|No| SKIP[Skip: within max age]
185180
```
186181

182+
**Key Takeaways:**
183+
184+
- **Never-processed takes absolute priority** - New resources always publish immediately
185+
- **State changes override max age** - Spec changes don't wait for intervals
186+
- **Max age is the fallback** - Ensures eventual consistency when nothing else triggers
187+
187188
### 2.2 Resource Filtering
188189

189190
Resource filtering enables **horizontal scaling** by allowing operators to distribute resources across multiple Sentinel instances using label-based selectors.
@@ -362,7 +363,7 @@ The `message_data` field defines the CloudEvents payload structure using **Commo
362363
| Variable | Type | Description | Example Fields |
363364
|----------|------|-------------|----------------|
364365
| `resource` | Resource | The HyperFleet resource | `id`, `kind`, `href`, `generation`, `status`, `labels`, `created_time` |
365-
| `reason` | string | Decision engine reason | `"max age exceeded"`, `"generation changed"` |
366+
| `reason` | string | Decision engine reason | `"generation changed"`, `"never processed"`, `"max age exceeded"` |
366367

367368
**CEL Expression Syntax:**
368369

@@ -652,7 +653,7 @@ Follow this checklist to ensure successful Sentinel deployment and operation.
652653
Trigger cycle completed total=15 published=3 skipped=12 duration=0.125s topic=hyperfleet-dev-clusters subset=clusters
653654
```
654655
- `count` - Number of resources fetched from the API matching the resource selector
655-
- `published` - Number of events published (generation changed or max age exceeded)
656+
- `published` - Number of events published (generation changed, never processed, or max age exceeded)
656657
- `skipped` - Number of resources skipped (no reconciliation needed)
657658

658659
For detailed deployment guidance, see [docs/running-sentinel.md](running-sentinel.md)

internal/engine/decision.go

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ import (
1111
const (
1212
ReasonMaxAgeExceeded = "max age exceeded"
1313
ReasonGenerationChanged = "generation changed"
14+
ReasonNeverProcessed = "never processed"
1415
ReasonNilResource = "resource is nil"
1516
ReasonZeroNow = "now time is zero"
1617
)
@@ -38,11 +39,12 @@ type Decision struct {
3839
// Evaluate determines if an event should be published for the resource.
3940
//
4041
// Decision Logic (in priority order):
41-
// 1. Generation-based reconciliation: If resource.Generation > status.ObservedGeneration,
42+
// 1. Never-processed reconciliation: If status.LastUpdated is zero, publish immediately
43+
// (resource has never been processed by any adapter)
44+
// 2. Generation-based reconciliation: If resource.Generation > status.ObservedGeneration,
4245
// publish immediately (spec has changed, adapter needs to reconcile)
43-
// 2. Time-based reconciliation: If max age exceeded since last update, publish
46+
// 3. Time-based reconciliation: If max age exceeded since last update, publish
4447
// - Uses status.LastUpdated as reference timestamp
45-
// - If LastUpdated is zero (never processed), falls back to created_time
4648
//
4749
// Max Age Intervals:
4850
// - Resources with Ready=true: maxAgeReady (default 30m)
@@ -71,6 +73,16 @@ func (e *DecisionEngine) Evaluate(resource *client.Resource, now time.Time) Deci
7173
}
7274
}
7375

76+
// Check if resource has never been processed by an adapter
77+
// LastUpdated is zero means no adapter has updated the status yet
78+
// This ensures first-time resources are published immediately
79+
if resource.Status.LastUpdated.IsZero() {
80+
return Decision{
81+
ShouldPublish: true,
82+
Reason: ReasonNeverProcessed,
83+
}
84+
}
85+
7486
// Check for generation mismatch
7587
// This triggers immediate reconciliation regardless of max age
7688
if resource.Generation > resource.Status.ObservedGeneration {
@@ -81,12 +93,9 @@ func (e *DecisionEngine) Evaluate(resource *client.Resource, now time.Time) Deci
8193
}
8294

8395
// Determine the reference timestamp for max age calculation
84-
// Use LastUpdated if available (adapter has processed the resource)
85-
// Otherwise fall back to created_time (resource is newly created)
96+
// At this point, we know LastUpdated is not zero (checked above)
97+
// so we can use it directly for the max age calculation
8698
referenceTime := resource.Status.LastUpdated
87-
if referenceTime.IsZero() {
88-
referenceTime = resource.CreatedTime
89-
}
9099

91100
// Determine the appropriate max age based on resource ready status
92101
var maxAge time.Duration

internal/engine/decision_test.go

Lines changed: 24 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -121,25 +121,24 @@ func TestDecisionEngine_Evaluate(t *testing.T) {
121121
ready bool
122122
wantShouldPublish bool
123123
}{
124-
// Zero LastUpdated tests - should fall back to created_time
125-
// These tests use the test factory default (created 1 hour ago)
124+
// Zero LastUpdated tests - should publish immediately as never processed
126125
{
127126
name: "zero LastUpdated - ready",
128127
ready: true,
129-
lastUpdated: time.Time{}, // Zero time - will use created_time
128+
lastUpdated: time.Time{}, // Zero time - never processed
130129
now: now,
131130
wantShouldPublish: true,
132-
wantReasonContains: "max age exceeded",
133-
description: "Resources with zero LastUpdated should use created_time and publish (created > 30m ago)",
131+
wantReasonContains: ReasonNeverProcessed,
132+
description: "Resources with zero LastUpdated should publish immediately (never processed)",
134133
},
135134
{
136135
name: "zero LastUpdated - not ready",
137136
ready: false,
138-
lastUpdated: time.Time{}, // Zero time - will use created_time
137+
lastUpdated: time.Time{}, // Zero time - never processed
139138
now: now,
140139
wantShouldPublish: true,
141-
wantReasonContains: "max age exceeded",
142-
description: "Resources with zero LastUpdated should use created_time and publish (created > 10s ago)",
140+
wantReasonContains: ReasonNeverProcessed,
141+
description: "Resources with zero LastUpdated should publish immediately (never processed)",
143142
},
144143

145144
// Not-ready resources (10s max age)
@@ -413,8 +412,9 @@ func TestDecisionEngine_Evaluate_InvalidInputs(t *testing.T) {
413412
}
414413
}
415414

416-
// TestDecisionEngine_Evaluate_CreatedTimeFallback tests that created_time is used when lastUpdated is zero
417-
func TestDecisionEngine_Evaluate_CreatedTimeFallback(t *testing.T) {
415+
// TestDecisionEngine_Evaluate_NeverProcessedResources tests the never-processed reconciliation
416+
// When lastUpdated is zero, resources are published immediately with "never processed" reason
417+
func TestDecisionEngine_Evaluate_NeverProcessedResources(t *testing.T) {
418418
engine := newTestEngine()
419419
now := time.Now()
420420

@@ -428,49 +428,31 @@ func TestDecisionEngine_Evaluate_CreatedTimeFallback(t *testing.T) {
428428
wantShouldPublish bool
429429
}{
430430
{
431-
name: "zero lastUpdated - created 11s ago - not ready",
432-
createdTime: now.Add(-11 * time.Second),
433-
lastUpdated: time.Time{}, // Zero - should use created_time
431+
name: "never processed - not ready",
432+
createdTime: now.Add(-1 * time.Hour), // Doesn't affect decision
433+
lastUpdated: time.Time{}, // Zero - never processed
434434
ready: false,
435435
wantShouldPublish: true,
436-
wantReasonContains: "max age exceeded",
437-
description: "Should use created_time and publish (11s > 10s max age)",
438-
},
439-
{
440-
name: "zero lastUpdated - created 5s ago - not ready",
441-
createdTime: now.Add(-5 * time.Second),
442-
lastUpdated: time.Time{}, // Zero - should use created_time
443-
ready: false,
444-
wantShouldPublish: false,
445-
wantReasonContains: "max age not exceeded",
446-
description: "Should use created_time and not publish (5s < 10s max age)",
436+
wantReasonContains: ReasonNeverProcessed,
437+
description: "Should publish immediately (never processed)",
447438
},
448439
{
449-
name: "zero lastUpdated - created 31m ago - ready",
450-
createdTime: now.Add(-31 * time.Minute),
451-
lastUpdated: time.Time{}, // Zero - should use created_time
440+
name: "never processed - ready",
441+
createdTime: now.Add(-1 * time.Hour), // Doesn't affect decision
442+
lastUpdated: time.Time{}, // Zero - never processed
452443
ready: true,
453444
wantShouldPublish: true,
454-
wantReasonContains: "max age exceeded",
455-
description: "Should use created_time and publish (31m > 30m max age)",
456-
},
457-
{
458-
name: "zero lastUpdated - created 15m ago - ready",
459-
createdTime: now.Add(-15 * time.Minute),
460-
lastUpdated: time.Time{}, // Zero - should use created_time
461-
ready: true,
462-
wantShouldPublish: false,
463-
wantReasonContains: "max age not exceeded",
464-
description: "Should use created_time and not publish (15m < 30m max age)",
445+
wantReasonContains: ReasonNeverProcessed,
446+
description: "Should publish immediately (never processed)",
465447
},
466448
{
467-
name: "non-zero lastUpdated - should ignore created_time",
468-
createdTime: now.Add(-1 * time.Hour), // Created long ago
449+
name: "processed resource - use lastUpdated for max age",
450+
createdTime: now.Add(-1 * time.Hour), // Irrelevant once processed
469451
lastUpdated: now.Add(-5 * time.Second), // Updated recently
470452
ready: false,
471453
wantShouldPublish: false,
472454
wantReasonContains: "max age not exceeded",
473-
description: "Should use lastUpdated, not created_time (5s < 10s max age)",
455+
description: "Should use lastUpdated for max age calculation (5s < 10s max age)",
474456
},
475457
}
476458

@@ -551,7 +533,7 @@ func TestDecisionEngine_Evaluate_GenerationBasedReconciliation(t *testing.T) {
551533
lastUpdated: now,
552534
wantShouldPublish: true,
553535
wantReasonContains: ReasonGenerationChanged,
554-
description: "First reconciliation should be triggered by generation (never processed)",
536+
description: "Generation mismatch triggers immediate reconciliation",
555537
},
556538

557539
// Generation in sync tests - should follow normal max age logic

0 commit comments

Comments
 (0)