Skip to content

Cacti Syslog v2 — Design Principles (discussion / tracking) #317

@somethingwithproof

Description

@somethingwithproof

Mission

Build a reliable, scalable syslog pipeline that:

  • never loses data
  • degrades gracefully under failure
  • supports fast operational queries
  • evolves beyond MySQL constraints over time

Core principle

Ingest correctness > storage cleverness. If anything breaks, data must still land safely. Optimization can catch up later.

System model

Collector -> Ingest -> Buffer/Overflow -> Normalize -> Store -> Rollups -> Query
                          |
                      Reconcile

A side path for invalid input:

late / bad / out-of-range events -> quarantine or overflow -> bounded reconciler

Design principles

1. Append-first ingest (non-negotiable). Writes must always succeed. Minimal indexing on the ingest path. Use either an ingest table or the dMaxValue overflow partition as a safety net. Never block writes because partitions are not perfect.

2. UTC is law. All partitioning and storage in UTC. No date() logic for partition math. Local time is a presentation concern. This eliminates DST and timezone bugs entirely.

3. Partitioning is an optimization, not correctness. A missing partition is a performance issue, not a failure. dMaxValue absorbs mistakes safely. Stop treating partitions as the system backbone.

4. Pre-create, never react. Always maintain partitions ahead (3-7 days). Rotate partitions early (T-2h). Avoid midnight races completely.

5. Bounded reconciliation only. Move overflow data in batches. Never large blocking reorgs. Always resumable. Repair slowly, never crash the database.

6. Design for outages. Multi-day downtime must be survivable. On startup: validate partitions, assess overflow, schedule recovery. Recovery is a feature, not an afterthought.

7. Late data is normal. Accept out-of-order and delayed events. Define an explicit policy: move if cheap, otherwise tolerate slight misplacement. Do not fight the reality of distributed systems.

8. Retention is lifecycle, not cleanup. Partition drop is the fast path. Time-based delete is the fallback. Both must account for late-arriving data so retention does not corrupt correctness.

9. Observability is required. Track partition coverage, overflow size (dMaxValue), reconcile progress, last maintenance run. If you cannot see it, it will break silently.

10. Async everywhere (future direction). Move toward ingest -> queue/buffer, background processors, decoupled storage. Aligns with Splunk, ClickHouse, and Grafana Loki.

Anti-patterns

  • "Partitions must always exist or system fails"
  • Large ALTER TABLE during peak hours
  • Local-time partition logic
  • Unbounded reconciliation
  • Assuming perfect scheduler execution
  • Treating MySQL as a true time-series or log engine

Operational guarantees

The system must always guarantee:

  1. Writes never fail due to partition issues
  2. Data is never lost, only delayed in organization
  3. Recovery is deterministic and bounded
  4. Performance degrades gradually, not catastrophically

Evolution path

Now (Stabilize) — partition manager, startup validation, overflow monitoring.

Next (Decouple) — ingest table, async worker, rollups.

Later (Scale) — storage abstraction, external backend (TSDB or log system).

Corner cases to design for

Grouped by category. Each must answer the test: can ingest continue safely even if optimization is broken?

Time and partition boundary

  1. Midnight rollover race — data arrives as the date flips before the next partition exists. Pre-create ahead, never just-in-time.
  2. DST transitions — local-time logic produces 23 or 25 hour days. UTC for storage and partitioning, local time only in display.
  3. Host clock drift — app server, DB, and scheduler disagree on time. One source of truth for partition math.
  4. Late-arriving events — yesterday's message arrives today. Explicit late-data policy with staging or bounded reconcile.
  5. Far-future timestamps — bad device sends year 2099. Reject, clamp, or quarantine.
  6. Ancient timestamps — bad parse sends 1970 or 1900. Minimum acceptable timestamp policy.

Outage and recovery

  1. Multi-day outage — system down 2-10 days, then resumes. Startup reconciliation, bounded recovery.
  2. Scheduler missed runs — cron dies, plugin disabled, hook never fires. Startup check plus periodic check, not only cron dependence.
  3. Crash during reconcile — process dies mid-batch. Idempotent movement, transactional batches, post-batch verification.
  4. DB restart during ALTER — partition reorg interrupted. Detect incomplete maintenance state, fail safe and report loudly.
  5. Backlog flood after resume — devices dump backlog immediately. Backpressure strategy, separate ingest from heavy repair work.

Data correctness

  1. Duplicate messages — collectors resend after retry. Decide whether dedupe matters; if so, define key and window.
  2. Out-of-order arrival — events for the same host arrive non-sequentially. No correctness logic should depend on arrival order.
  3. Parse failures — malformed lines, bad encoding, truncated payloads. Dead-letter or quarantine path; never let one bad message poison the pipeline.
  4. Null or missing key fields — no host, facility, severity, timestamp. Fallback defaults; quarantine for truly unusable records.
  5. Conflicting timezones in message vs collector — explicit precedence rule, documented.
  6. Retention deleting data still needed for reconcile — retention must understand overflow and late-arrival windows.

Partition manager specific

  1. Partition already exists — two workers race to create the same one. Advisory lock, idempotent DDL handling.
  2. Missing dMaxValue — manual schema change leaves install partial. Schema validation at startup.
  3. Overflow partition too large to reorganize — millions of rows. Chunked migration, hard cutoff for manual intervention.
  4. Partition naming drift — names do not match convention. Inspect by boundary metadata, not name.
  5. Gap in partition sequence — yesterday missing. Detect gaps explicitly, not just "is today present".

Scale and performance

  1. Huge cardinality fields — programs, hosts, tags explode. Careful index policy, top-N strategies for summaries.
  2. Hot partition overload — nearly all traffic lands on the current day. Minimize secondary indexes on the write path.
  3. Reconcile competing with user queries — bounded reconcile, lighter during peak windows.
  4. Large retention drops — many old partitions dropped at once after lag. Controlled cleanup cadence.

Replication and HA

  1. Replication lag — heavy DDL/writes lag replicas. Replication-aware maintenance.
  2. Failover during maintenance — primary changes mid-run. Leader election or advisory-lock strategy tied to active primary.

Security and abuse

  1. Log injection / huge payloads — max payload size, safe truncation, escaped display.
  2. Untrusted content breaks downstream — control chars, bad UTF-8. Normalize encoding; preserve raw, sanitize for display.

Operational

  1. Config changed mid-day — retention days, timezone, partition window changed live. Define what applies immediately vs next maintenance cycle.
  2. Manual DBA intervention — humans create or drop partitions by hand. Detect drift and report, do not bulldoze.
  3. Disk nearly full — partition creation needs space. Headroom checks, degraded mode behavior.

Hardening priority order

If we harden one batch at a time, the order is:

  1. Midnight rollover race
  2. DST and local time mistakes
  3. Multi-day outage recovery
  4. Huge dMaxValue growth
  5. Late-arriving events
  6. Crash during reconcile
  7. Missing scheduler runs
  8. Far-future / ancient bogus timestamps
  9. Retention interacting badly with overflow
  10. Duplicate concurrent partition maintenance

How this compares to battle-tested systems

The design above is directionally right, but it is still a transitional architecture. Real log platforms separate ingestion, storage, compaction, retention, and query execution much more cleanly than a single MySQL schema can.

Where this design matches the proven systems. Append-oriented ingest, async background processing, bounded reconciliation, time-shard aware retention, and separation of write and read paths are all how scaled systems survive. Splunk documents a multi-phase data pipeline rather than writing directly into final search form. Loki splits responsibilities across write, read, and backend components with retention handled by a compactor. ClickHouse treats partitioning as a management primitive while relying on background merges for long-term efficiency. The shift away from "partitions are the whole strategy" is the right move.

Where this design is still weaker.

  • Overflow is a recovery hack, not a real ingest boundary. dMaxValue is useful, but it is still an emergency bucket inside the same database abstraction. Loki writes compressed chunks against object storage. Splunk has a formal ingest pipeline before data becomes searchable. ClickHouse uses MergeTree parts and background merges as the safety valve.
  • Ingest correctness is still coupled to MySQL operational health. Loki and ClickHouse assume heavy ingest, delayed compaction, and independent retention. This design still depends on partition coverage being healthy, reconcile succeeding, and MySQL staying responsive during repair windows.
  • Search and index strategy is underdeveloped. OpenSearch and Splunk are built around making large event corpora searchable, with explicit indexing and search-tuning controls. This design is much stronger on retention correctness than on query architecture.
  • Lifecycle management is bolted on, not native. Loki has a compactor. ClickHouse has TTL and partition management built into the storage model. Splunk has explicit indexer/index lifecycle. The proposed partition manager moves toward this, but is still on top of a general-purpose relational database.

Comparison at a glance.

This design Loki ClickHouse OpenSearch Splunk
Append-first ingest partial (overflow) yes (chunks) yes (parts/merges) tunable yes (pipeline)
Storage decoupled from query no yes (object storage) yes (MergeTree) partial yes
Native lifecycle no (manager) yes (compactor) yes (TTL) partial yes
Search-first query model no partial yes yes yes
Distributed scale assumptions no yes yes yes yes

Honest verdict. As a near-term Cacti solution, sensible. As a long-term log platform, not enough yet. As a migration step toward a real platform, strong.

Recommended positioning.

A resilient MySQL-backed syslog control plane with an upgrade path to async and external storage.

That is credible. The roadmap then becomes:

  • Now — partition manager, startup validation, bounded reconcile
  • Next — append-only ingest table, async worker
  • After that — optional backend abstraction for ClickHouse or Loki-style external storage and query paths

Final framing

You are not building a partitioning system. You are building a log pipeline with survivability guarantees. That shift is what separates fragile systems from battle-tested ones.

What this issue is for

This is a discussion and tracking issue for v2 design direction. It is not a PR. Concrete implementation work should be split into separate, atomic PRs aligned with the evolution path above.

Related: #313, #314, #315.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions