Cacti Syslog v2 — Design Principles (discussion / tracking)

## Mission

Build a reliable, scalable syslog pipeline that:
- never loses data
- degrades gracefully under failure
- supports fast operational queries
- evolves beyond MySQL constraints over time

## Core principle

**Ingest correctness > storage cleverness.** If anything breaks, data must still land safely. Optimization can catch up later.

## System model

```
Collector -> Ingest -> Buffer/Overflow -> Normalize -> Store -> Rollups -> Query
                          |
                      Reconcile
```

A side path for invalid input:

```
late / bad / out-of-range events -> quarantine or overflow -> bounded reconciler
```

## Design principles

**1. Append-first ingest (non-negotiable).** Writes must always succeed. Minimal indexing on the ingest path. Use either an ingest table or the `dMaxValue` overflow partition as a safety net. Never block writes because partitions are not perfect.

**2. UTC is law.** All partitioning and storage in UTC. No `date()` logic for partition math. Local time is a presentation concern. This eliminates DST and timezone bugs entirely.

**3. Partitioning is an optimization, not correctness.** A missing partition is a performance issue, not a failure. `dMaxValue` absorbs mistakes safely. Stop treating partitions as the system backbone.

**4. Pre-create, never react.** Always maintain partitions ahead (3-7 days). Rotate partitions early (T-2h). Avoid midnight races completely.

**5. Bounded reconciliation only.** Move overflow data in batches. Never large blocking reorgs. Always resumable. Repair slowly, never crash the database.

**6. Design for outages.** Multi-day downtime must be survivable. On startup: validate partitions, assess overflow, schedule recovery. Recovery is a feature, not an afterthought.

**7. Late data is normal.** Accept out-of-order and delayed events. Define an explicit policy: move if cheap, otherwise tolerate slight misplacement. Do not fight the reality of distributed systems.

**8. Retention is lifecycle, not cleanup.** Partition drop is the fast path. Time-based delete is the fallback. Both must account for late-arriving data so retention does not corrupt correctness.

**9. Observability is required.** Track partition coverage, overflow size (`dMaxValue`), reconcile progress, last maintenance run. If you cannot see it, it will break silently.

**10. Async everywhere (future direction).** Move toward ingest -> queue/buffer, background processors, decoupled storage. Aligns with Splunk, ClickHouse, and Grafana Loki.

## Anti-patterns

- "Partitions must always exist or system fails"
- Large `ALTER TABLE` during peak hours
- Local-time partition logic
- Unbounded reconciliation
- Assuming perfect scheduler execution
- Treating MySQL as a true time-series or log engine

## Operational guarantees

The system must always guarantee:
1. Writes never fail due to partition issues
2. Data is never lost, only delayed in organization
3. Recovery is deterministic and bounded
4. Performance degrades gradually, not catastrophically

## Evolution path

**Now (Stabilize)** — partition manager, startup validation, overflow monitoring.

**Next (Decouple)** — ingest table, async worker, rollups.

**Later (Scale)** — storage abstraction, external backend (TSDB or log system).

## Corner cases to design for

Grouped by category. Each must answer the test: *can ingest continue safely even if optimization is broken?*

### Time and partition boundary

1. **Midnight rollover race** — data arrives as the date flips before the next partition exists. Pre-create ahead, never just-in-time.
2. **DST transitions** — local-time logic produces 23 or 25 hour days. UTC for storage and partitioning, local time only in display.
3. **Host clock drift** — app server, DB, and scheduler disagree on time. One source of truth for partition math.
4. **Late-arriving events** — yesterday's message arrives today. Explicit late-data policy with staging or bounded reconcile.
5. **Far-future timestamps** — bad device sends year 2099. Reject, clamp, or quarantine.
6. **Ancient timestamps** — bad parse sends 1970 or 1900. Minimum acceptable timestamp policy.

### Outage and recovery

7. **Multi-day outage** — system down 2-10 days, then resumes. Startup reconciliation, bounded recovery.
8. **Scheduler missed runs** — cron dies, plugin disabled, hook never fires. Startup check plus periodic check, not only cron dependence.
9. **Crash during reconcile** — process dies mid-batch. Idempotent movement, transactional batches, post-batch verification.
10. **DB restart during ALTER** — partition reorg interrupted. Detect incomplete maintenance state, fail safe and report loudly.
11. **Backlog flood after resume** — devices dump backlog immediately. Backpressure strategy, separate ingest from heavy repair work.

### Data correctness

12. **Duplicate messages** — collectors resend after retry. Decide whether dedupe matters; if so, define key and window.
13. **Out-of-order arrival** — events for the same host arrive non-sequentially. No correctness logic should depend on arrival order.
14. **Parse failures** — malformed lines, bad encoding, truncated payloads. Dead-letter or quarantine path; never let one bad message poison the pipeline.
15. **Null or missing key fields** — no host, facility, severity, timestamp. Fallback defaults; quarantine for truly unusable records.
16. **Conflicting timezones in message vs collector** — explicit precedence rule, documented.
17. **Retention deleting data still needed for reconcile** — retention must understand overflow and late-arrival windows.

### Partition manager specific

18. **Partition already exists** — two workers race to create the same one. Advisory lock, idempotent DDL handling.
19. **Missing dMaxValue** — manual schema change leaves install partial. Schema validation at startup.
20. **Overflow partition too large to reorganize** — millions of rows. Chunked migration, hard cutoff for manual intervention.
21. **Partition naming drift** — names do not match convention. Inspect by boundary metadata, not name.
22. **Gap in partition sequence** — yesterday missing. Detect gaps explicitly, not just "is today present".

### Scale and performance

23. **Huge cardinality fields** — programs, hosts, tags explode. Careful index policy, top-N strategies for summaries.
24. **Hot partition overload** — nearly all traffic lands on the current day. Minimize secondary indexes on the write path.
25. **Reconcile competing with user queries** — bounded reconcile, lighter during peak windows.
26. **Large retention drops** — many old partitions dropped at once after lag. Controlled cleanup cadence.

### Replication and HA

27. **Replication lag** — heavy DDL/writes lag replicas. Replication-aware maintenance.
28. **Failover during maintenance** — primary changes mid-run. Leader election or advisory-lock strategy tied to active primary.

### Security and abuse

29. **Log injection / huge payloads** — max payload size, safe truncation, escaped display.
30. **Untrusted content breaks downstream** — control chars, bad UTF-8. Normalize encoding; preserve raw, sanitize for display.

### Operational

31. **Config changed mid-day** — retention days, timezone, partition window changed live. Define what applies immediately vs next maintenance cycle.
32. **Manual DBA intervention** — humans create or drop partitions by hand. Detect drift and report, do not bulldoze.
33. **Disk nearly full** — partition creation needs space. Headroom checks, degraded mode behavior.

### Hardening priority order

If we harden one batch at a time, the order is:

1. Midnight rollover race
2. DST and local time mistakes
3. Multi-day outage recovery
4. Huge `dMaxValue` growth
5. Late-arriving events
6. Crash during reconcile
7. Missing scheduler runs
8. Far-future / ancient bogus timestamps
9. Retention interacting badly with overflow
10. Duplicate concurrent partition maintenance

## How this compares to battle-tested systems

The design above is directionally right, but it is still a transitional architecture. Real log platforms separate ingestion, storage, compaction, retention, and query execution much more cleanly than a single MySQL schema can.

**Where this design matches the proven systems.** Append-oriented ingest, async background processing, bounded reconciliation, time-shard aware retention, and separation of write and read paths are all how scaled systems survive. Splunk documents a multi-phase data pipeline rather than writing directly into final search form. Loki splits responsibilities across write, read, and backend components with retention handled by a compactor. ClickHouse treats partitioning as a management primitive while relying on background merges for long-term efficiency. The shift away from "partitions are the whole strategy" is the right move.

**Where this design is still weaker.**

- *Overflow is a recovery hack, not a real ingest boundary.* `dMaxValue` is useful, but it is still an emergency bucket inside the same database abstraction. Loki writes compressed chunks against object storage. Splunk has a formal ingest pipeline before data becomes searchable. ClickHouse uses MergeTree parts and background merges as the safety valve.
- *Ingest correctness is still coupled to MySQL operational health.* Loki and ClickHouse assume heavy ingest, delayed compaction, and independent retention. This design still depends on partition coverage being healthy, reconcile succeeding, and MySQL staying responsive during repair windows.
- *Search and index strategy is underdeveloped.* OpenSearch and Splunk are built around making large event corpora searchable, with explicit indexing and search-tuning controls. This design is much stronger on retention correctness than on query architecture.
- *Lifecycle management is bolted on, not native.* Loki has a compactor. ClickHouse has TTL and partition management built into the storage model. Splunk has explicit indexer/index lifecycle. The proposed partition manager moves toward this, but is still on top of a general-purpose relational database.

**Comparison at a glance.**

| | This design | Loki | ClickHouse | OpenSearch | Splunk |
|---|---|---|---|---|---|
| Append-first ingest | partial (overflow) | yes (chunks) | yes (parts/merges) | tunable | yes (pipeline) |
| Storage decoupled from query | no | yes (object storage) | yes (MergeTree) | partial | yes |
| Native lifecycle | no (manager) | yes (compactor) | yes (TTL) | partial | yes |
| Search-first query model | no | partial | yes | yes | yes |
| Distributed scale assumptions | no | yes | yes | yes | yes |

**Honest verdict.** As a near-term Cacti solution, sensible. As a long-term log platform, not enough yet. As a migration step toward a real platform, strong.

**Recommended positioning.**

> A resilient MySQL-backed syslog control plane with an upgrade path to async and external storage.

That is credible. The roadmap then becomes:

- **Now** — partition manager, startup validation, bounded reconcile
- **Next** — append-only ingest table, async worker
- **After that** — optional backend abstraction for ClickHouse or Loki-style external storage and query paths

## Final framing

You are not building a partitioning system. You are building a log pipeline with survivability guarantees. That shift is what separates fragile systems from battle-tested ones.

## What this issue is for

This is a discussion and tracking issue for v2 design direction. It is not a PR. Concrete implementation work should be split into separate, atomic PRs aligned with the evolution path above.

Related: #313, #314, #315.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cacti Syslog v2 — Design Principles (discussion / tracking) #317

Mission

Core principle

System model

Design principles

Anti-patterns

Operational guarantees

Evolution path

Corner cases to design for

Time and partition boundary

Outage and recovery

Data correctness

Partition manager specific

Scale and performance

Replication and HA

Security and abuse

Operational

Hardening priority order

How this compares to battle-tested systems

Final framing

What this issue is for

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	This design	Loki	ClickHouse	OpenSearch	Splunk
Append-first ingest	partial (overflow)	yes (chunks)	yes (parts/merges)	tunable	yes (pipeline)
Storage decoupled from query	no	yes (object storage)	yes (MergeTree)	partial	yes
Native lifecycle	no (manager)	yes (compactor)	yes (TTL)	partial	yes
Search-first query model	no	partial	yes	yes	yes
Distributed scale assumptions	no	yes	yes	yes	yes

Cacti Syslog v2 — Design Principles (discussion / tracking) #317

Description

Mission

Core principle

System model

Design principles

Anti-patterns

Operational guarantees

Evolution path

Corner cases to design for

Time and partition boundary

Outage and recovery

Data correctness

Partition manager specific

Scale and performance

Replication and HA

Security and abuse

Operational

Hardening priority order

How this compares to battle-tested systems

Final framing

What this issue is for

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions