Skip to content

Releases: gma1k/podtrace

Release v0.11.0

25 Apr 15:26
4134f64

Choose a tag to compare

Podtrace v0.11.0 — Kubernetes-native operator, agent, and session runtime

This is the largest release since the project began. v0.11.0 turns Podtrace from a CLI into a full Kubernetes operator with declarative PodTrace, PodTraceSession, ExporterConfig, and TracerConfig CRDs, a per-node agent DaemonSet, a session Job runtime that wires the existing --diagnose flow to Kubernetes, and a deploy/operational chart layer that gives you a working trace fleet end-to-end with helm install.

The standalone podtrace CLI is unchanged, every flag, every output, every test on the existing binary still works.

Highlights

  • Four podtrace.io/v1alpha1 CRDs with admission validation:
    • PodTrace — continuous tracing over a dynamic pod selector
    • PodTraceSession — bounded "diagnose this pod for 30s" jobs
    • ExporterConfig — central exporter declaration (OTLP / Jaeger / Zipkin / Splunk / DataDog), referenced by name
    • TracerConfig — cluster-wide agent + session image and runtime knobs
  • Operator with leader election and three reconcilers (TracerConfig, PodTrace, PodTraceSession). Cross-namespace cleanup via finalizers; per-PodTrace exporter bundle synced into podtrace-system for agents to consume.
  • Per-node agent DaemonSet that watches PodTraces cluster-wide, merges all active CRs into one shared eBPF tracer, routes events to the right exporter per CR, and patches per-node status back via Server-Side Apply (no clobbering across nodes).
  • Session Jobs that mount the operator-rendered exporter bundle and call podtrace --diagnose, three parallel artifact channels (summary file, kubelet termination message, ConfigMap/Secret upload) so the result of a session is durably visible regardless of how the Job ended.
  • Helm install gives you a working fleet: chart now renders a default TracerConfig, optional Prometheus ServiceMonitor / PodMonitor, and an opt-in OTLP starter ExporterConfig.
  • Hardened security posture: all systemic suppressions replaced with scoped os.Root helpers and saturating integer conversions; a overflow bug in alert thresholds fixed at the source.

What's new

Custom resources and webhooks (#69)

Four CRDs with kubebuilder validation markers, generated typed clientset under pkg/client/clientset/versioned/, and a validating webhook that enforces invariants the CRD schema can't express:

  • spec.selector and spec.podRefs are mutually exclusive — exactly one is required.
  • spec.exporterRef.name must resolve to an ExporterConfig in the same namespace.
  • ExporterConfig.spec.type must match the populated typed sub-field (otlp: with type: otlp, etc.).

A stable pkg/tracer engine seam (Engine, Exporter, TracerBackend, Target, TargetSet) becomes the single boundary between the eBPF core and every operational mode (CLI, agent, session job). The Helm chart, multi-stage Dockerfile, and make generate | manifests | clientset | docker-build | helm-lint | helm-template | envtest workflow all land here.

Operator control plane (#70)

podtrace operator now runs a controller-runtime manager with three reconcilers:

  • TracerConfigReconciler — owns the agent DaemonSet, ServiceAccount, ClusterRole/Binding, and (new in #73) a namespaced bundle Role.
  • PodTraceReconciler — resolves the referenced ExporterConfig, materializes a per-CR exporter bundle ConfigMap (+ optional credential Secret) in podtrace-system, rolls per-node statuses up into top-level conditions.
  • PodTraceSessionReconciler — fans the session out into one Job per node hosting matched pods, enforces an optional per-node concurrency cap, drives TTL-based cleanup.

Cross-namespace child cleanup uses a single shared finalizer; the operator runs unprivileged. End-to-end smoke verified on kind.

Per-node agent DaemonSet (#71)

The agent watches PodTraces cluster-wide, Pods on its own node, and exporter bundles in podtrace-system. Concrete behaviors:

  • Multi-CR merge router — every event is dispatched to the subset of per-CR exporters whose cgroup set contains the event's CgroupID and whose filter set includes the event type. One shared eBPF pipeline serves N overlapping CRs.
  • Exporter cache keyed by bundle ResourceVersion — credential rotation rebuilds the OTLP exporter; no-op reconciles don't reopen connections.
  • Server-Side Apply per-node status — the field-owner is podtrace-agent-<NODE_NAME> so many agents write status.nodeStatus[] concurrently without clobbering each other.
  • Health probes/healthz (stall-window liveness) + /readyz (flips true post-cache-sync).
  • Prometheus metricspodtrace_agent_{events_exported_total, events_dropped_total, active_cgroups, active_crs, reconcile_total}, labeled per CR.

OTLP is the only concrete agent-side exporter for now; Jaeger/Zipkin/Splunk/DataDog return an explicit "not yet implemented" so a Degraded condition surfaces it on kubectl describe.

Session Jobs (#72)

PodTraceSession is the bounded-diagnose CRD. The operator now creates one Job per matched node that mounts the per-session exporter bundle at /etc/podtrace/exporter/, runs the existing podtrace --diagnose CLI with three new flags, and captures the result through three parallel channels:

  • --summary-file <path> — full JSON summary on a shared emptyDir.
  • --termination-message-path <path> — compact JSON on /dev/termination-log so kubelet surfaces it in Pod.Status.ContainerStatuses[…].State.Terminated.Message.
  • --report-to <kind>/<ns>/<name> — self-uploads the human-readable report to a ConfigMap or Secret via the in-cluster ServiceAccount.

A new podtrace report-uploader subcommand provides a native sidecar mode for clusters that need re-upload-on-completion semantics (TracerConfig.spec.session.sidecarUploader: true).

The session Job runs as a dedicated podtrace-session ServiceAccount with a narrow per-session Role + RoleBinding in the user namespace, not the agent SA. RBAC is scoped to resourceNames of the specific report sink.

A shared pkg/exporter/bundle package now provides the canonical Payload type plus FromConfigMapData/ToConfigMapData and FromYAML/ToYAML codecs, so the operator, agent, and CLI all consume byte-identical exporter config.

End-to-end verified on kind: a session against an nginx pod produces phase: Completed, aggregated status.summary, per-Job eventCount, and a human-readable report in a user-namespace ConfigMap.

Deploy / operational chart layer (#73)

helm install podtrace deploy/charts/podtrace now produces a working trace fleet end-to-end:

  • Default TracerConfig rendered from values.yaml (image, agent runtime, session caps, scheduling, BTF mode, sidecar uploader). Default-on with the operator; opt out via tracerConfig.create=false.
  • ServiceMonitor + PodMonitor templates for Prometheus scrape config (operator :8080, agent :9090). Double-gated on toggle and monitoring.coreos.com/v1 API availability — enabling them on a cluster without prometheus-operator is a silent no-op rather than a hard install failure.
  • Opt-in starter OTLP ExporterConfig so a fresh install can run a sample diagnose without first authoring exporter config.
  • Refined metrics.serviceMonitor / metrics.podMonitor values — labels, interval, path.
  • Post-install NOTES surface what was rendered and the right next-step command.

Agent RBAC narrowed: the agent ClusterRole no longer grants cluster-wide configmaps/secrets reads. Bundle reads are scoped to podtrace-system via a namespaced Role + RoleBinding the TracerConfigReconciler materializes alongside the agent ServiceAccount.

Operator RBAC: gains roles/rolebindings create/update/delete (per-session RBAC) and delete on podtrace.io CRDs (TTL-driven session cleanup — pre-existing bug fixed in passing).

ExporterConfig → PodTrace watch wired, so changes to a referenced ExporterConfig requeue every PodTrace that references it.

Documentation: full per-CRD reference docs (PodTrace, PodTraceSession, ExporterConfig, TracerConfig) plus a Chainsaw e2e suite that exercises lifecycle invariants — continuous, session, multi-CR overlap, agent restart, credential rotation, PSA enforcement.

Security hardening (#74)

  • Five new helper packages that own the trust boundaries:
    • internal/safeconv — saturating integer conversions; replaces 19 G115 sites.
    • internal/procfsos.Root-scoped reads under /proc (~25 G304 sites).
    • internal/sysfsos.Root-scoped reads under /sys/fs/cgroup.
    • internal/ldsoconfos.Root-scoped /etc/ld.so.conf* parsing.
    • internal/hostfs — validated boundary helpers (Stat, WalkRegular, ReadFile, WriteFile) for paths that legitimately cross os.Root's scope.
  • Real correctness fixes that the suppressions had been hiding:
    • AlertWarnPct/CritPct/EmergPct are now clamped to [0, 100] at config-load time. A misconfigured PODTRACE_ALERT_WARN_PCT=4294967296 previously wrapped to 0 and made every event trip a "WARNING" alert; it now clamps to 100.
    • BPFHashMapSize and RingBufferSizeKB are clamped at the source via config.ClampUint32, with overflow-safe multiplication for the ring-buffer byte count.
    • EventResourceLimit utilization comparisons run in int end-to-end with a defensive negative-value guard.
    • string(rune(event.PID)) in internal/diagnose/tracker/trace_tracker.go (which produced a single Unicode codepoint per PID and silently broke for PIDs above 0x10FFFF) is replaced with strconv.FormatUint.
  • Envtest path fix: make envtest now runs the full CRD validation suite (api/v1alpha1, operator, agent) again — three suite_test.go files were looking under deploy/charts/podtrace/templates/crds/ but the chart layout puts CRDs at deploy/charts/podtrace/crds/...
Read more

Release v0.10.0

19 Apr 07:25
618b5eb

Choose a tag to compare

What's Changed

This release brings two major tracing capabilities, broader export options, and a round of security and stability work.

Features

Multi-pod tracing & cross-namespace support (#63)
Trace workloads that span multiple pods across namespaces in a single session.

Performance profiling integration (#67)
Slow events now come with the exact goroutine and CPU stacks that caused them. Podtrace fetches pprof data (heap, goroutine, CPU) from target pods and correlates it with BPF stack traces, auto-triggered on configurable latency thresholds, or on demand via /profile/start`. BPF kernel timestamps are now aligned to wall-clock for precise correlation with application logs.

New trace exporters: Datadog and Zipkin (#68)
In addition to OTLP, Jaeger, and Splunk HEC, you can now export traces directly to Datadog and Zipkin backends.

Security & Stability (#66)

  • Hardened Podtrace security posture
  • Updated dependencies

🧪 Quality & Tooling (#67, #68)

  • Expanded test coverage
  • Upgraded CI workflows
  • Consolidated documentation into a single source

Upgrading

No breaking changes.

Full Changelog: v0.9.0...v0.10.0

Release v0.9.0

28 Mar 09:18
d16ce2c

Choose a tag to compare

What's Changed

  • fix: DaemonSet no-events and systemd cgroup driver by @gma1k in #55
  • feat: improve cross-environment portability, improve test coverage, and improve docs by @gma1k in #56
  • fix: improve cgroup resolution for cri by @gma1k in #57
  • feat: harden reliability with stack traces, channel depth metrics, pprof, fuzz test, govulncheck, gosec, and security fixes by @gma1k in #58
  • Replace many code lines of vmlinux.h fallback with a minimal stub, since all types it provided are already locally defined or forward-declared in common.h by @gma1k in #59
  • feat: add filesystem path extraction, IPv6 fix, event schema V4, runtime probe groups, configurable BPF map sizes, alert thresholds, and unlink/rename event types by @gma1k in #60
  • feat: add language-runtime adapters for Redis, Memcached, FastCGI, gRPC, Kafka, USDT, and PII redaction by @gma1k in #61

Full Changelog: v0.8.0...v0.9.0

Release v0.8.0

17 Dec 22:23
3246d64

Choose a tag to compare

What's Changed

  • fix: Replace bpf_d_path with inode-based path resolution for cross-kernel compatibility by @gma1k in #41
  • feat: Add resource limit monitoring for CPU, memory, and I/O by @gma1k in #42
  • fix: improving go ci test by @gma1k in #43
  • feat: Add connection pool monitoring by @gma1k in #44
  • fix: Replace pathresolver with kernel path resolution by @gma1k in #45
  • feat: Add real-time alerting system with webhook, Slack, etc by @gma1k in #46
  • fix: fixing vmlinux.h placeholder file by @gma1k in #47
  • fix: Bug fixes and security hardening by @gma1k in #48
  • fix: Optimize CI workflow, improve test coverage, fix linter errors, and migrate alerting config by @gma1k in #49
  • fix: Print version in cli by @gma1k in #50
  • feat: add CRI-O support and enhancing the diagnostics by @gma1k in #53

Full Changelog: v0.7.0...v0.8.0

Release v0.7.0

08 Dec 06:35
b6d2697

Choose a tag to compare

What's Changed

  • add issue template by @olamilekan000 in #23
  • fix: Code hardening by @gma1k in #24
  • feat: Comprehensive test suite by @gma1k in #25
  • chore: harden eBPF parser and cgroup filter testing by @gma1k in #26
  • Refactor: Centralize Configuration, Fix Bugs, and Improve Code Quality by @gma1k in #27
  • Refactor: the GenerateReport() function in diagnose.go by @gma1k in #28
  • feat: Add structured logging with zap and internal metrics tracking by @gma1k in #29
  • test: Improve test coverage by @gma1k in #30
  • test: improve test coverage by @gma1k in #31
  • test: improve test coverage by @gma1k in #32
  • refactor: improve error handling, add object pooling, reduce code duplication, improve cache management, and more by @gma1k in #33
  • test: Improve test coverage, refactor: reorganize codebase into modular folder structure by @gma1k in #34
  • feat: implement multi-architecture support with dynamic library discovery, priority-based event sampling, LRU caching, and enhanced error handling by @gma1k in #35
  • fix: fix runDiagnoseMode lint issue by @gma1k in #36
  • test: improve coverage test by @gma1k in #37
  • feat: Add Kubernetes context enrichment with pod-to-pod tracking, error correlation with root cause analysis by @gma1k in #38
  • feat: Add TLS/SSL handshake tracking by @gma1k in #39
  • feat: Add distributed tracing support with OTLP, Jaeger, and Splunk exporters by @gma1k in #40

New Contributors

Full Changelog: v0.6.0...v0.7.0

Release v0.6.0

02 Dec 13:50
5482223

Choose a tag to compare

What's Changed

  • chore: Improve podtrace validation and metrics configuration by @gma1k in #20
  • feat: Add file path tracking and I/O bandwidth metrics by @gma1k in #21
  • feat: Add advanced observability features: stack traces, lock contention, s… by @gma1k in #22

Full Changelog: v0.5.0...v0.6.0

Release v0.5.0

01 Dec 22:50
1f17c53

Choose a tag to compare

What's Changed

  • fix: Security hardening - input validation, rate limiting, and security headers by @gma1k in #15
  • Add podtrace docs by @gma1k in #16
  • feat: Implement remaining features by @gma1k in #17
  • feat: Add UDP/HTTP tracing, fix I/O bandwidth val, and OOM Kill detection by @gma1k in #18
  • refactor: Split large files by @gma1k in #19

Full Changelog: v0.4.0...v0.5.0

Release v0.4.0

28 Nov 19:21
ab69d5e

Choose a tag to compare

What's Changed

  • feat: Add ring buffer migration and enhanced monitoring features by @gma1k in #13
  • feat: add metric visualization using prometheus and grafana by @rif7t in #14

New Contributors

  • @rif7t made their first contribution in #14

Full Changelog: v0.3.0...v0.4.0

Release v0.3.0

23 Nov 13:33
1ce027e

Choose a tag to compare

What's Changed

  • feat: add CPU usage by @gma1k in #8
  • fix: resolve ShellCheck warnings and fix CodeQL Go setup in CI by @gma1k in #9
  • feat: add real-time mode with periodic updates by @gma1k in #10

Full Changelog: v0.2.0...v0.3.0

Release v0.2.0

22 Nov 19:12
6bcd379

Choose a tag to compare

What's Changed

  • chore: Update dependencies to latest versions and improve Actions workflows by @gma1k in #7

What's New

Major Dependency Updates

  • cilium/ebpf: v0.11.0 → v0.20.0
  • spf13/cobra: v1.8.0 → v1.10.1
  • k8s.io/client-go: v0.29.0 → v0.34.2
  • k8s.io/apimachinery: v0.29.0 → v0.34.2
  • golang.org/x/sys: v0.31.0 → v0.38.0

Go 1.24 Support

  • Upgraded from Go 1.19 to Go 1.24
  • Automatic toolchain management with GOTOOLCHAIN=auto
  • Makefile now auto-detects Go 1.24 installation

CI Workflow Improvements

  • Enhanced workflow triggers for dependency changes
  • Improved bash script validation
  • Better error handling and automation

Code Quality

  • All bash scripts pass ShellCheck validation
  • All scripts formatted with shfmt
  • Improved code consistency

Breaking Changes

  • Go 1.24+ is now required (previously Go 1.19+)
    • The Makefile will guide you through upgrading if needed
    • CI automatically handles Go 1.24 via toolchain auto-download

Installation

Upgrading Go

wget https://go.dev/dl/go1.24.0.linux-amd64.tar.gz
sudo tar -C /usr/local -xzf go1.24.0.linux-amd64.tar.gz
export PATH=$PATH:/usr/local/go/bin

Build

make build

Full Changelog: v0.1.0...v0.2.0