Releases: gma1k/podtrace
Release v0.11.0
Podtrace v0.11.0 — Kubernetes-native operator, agent, and session runtime
This is the largest release since the project began. v0.11.0 turns Podtrace from a CLI into a full Kubernetes operator with declarative PodTrace, PodTraceSession, ExporterConfig, and TracerConfig CRDs, a per-node agent DaemonSet, a session Job runtime that wires the existing --diagnose flow to Kubernetes, and a deploy/operational chart layer that gives you a working trace fleet end-to-end with helm install.
The standalone podtrace CLI is unchanged, every flag, every output, every test on the existing binary still works.
Highlights
- Four
podtrace.io/v1alpha1CRDs with admission validation:PodTrace— continuous tracing over a dynamic pod selectorPodTraceSession— bounded "diagnose this pod for 30s" jobsExporterConfig— central exporter declaration (OTLP / Jaeger / Zipkin / Splunk / DataDog), referenced by nameTracerConfig— cluster-wide agent + session image and runtime knobs
- Operator with leader election and three reconcilers (TracerConfig, PodTrace, PodTraceSession). Cross-namespace cleanup via finalizers; per-PodTrace exporter bundle synced into
podtrace-systemfor agents to consume. - Per-node agent DaemonSet that watches PodTraces cluster-wide, merges all active CRs into one shared eBPF tracer, routes events to the right exporter per CR, and patches per-node status back via Server-Side Apply (no clobbering across nodes).
- Session Jobs that mount the operator-rendered exporter bundle and call
podtrace --diagnose, three parallel artifact channels (summary file, kubelet termination message, ConfigMap/Secret upload) so the result of a session is durably visible regardless of how the Job ended. - Helm install gives you a working fleet: chart now renders a default TracerConfig, optional Prometheus ServiceMonitor / PodMonitor, and an opt-in OTLP starter ExporterConfig.
- Hardened security posture: all systemic suppressions replaced with scoped
os.Roothelpers and saturating integer conversions; a overflow bug in alert thresholds fixed at the source.
What's new
Custom resources and webhooks (#69)
Four CRDs with kubebuilder validation markers, generated typed clientset under pkg/client/clientset/versioned/, and a validating webhook that enforces invariants the CRD schema can't express:
spec.selectorandspec.podRefsare mutually exclusive — exactly one is required.spec.exporterRef.namemust resolve to anExporterConfigin the same namespace.ExporterConfig.spec.typemust match the populated typed sub-field (otlp:withtype: otlp, etc.).
A stable pkg/tracer engine seam (Engine, Exporter, TracerBackend, Target, TargetSet) becomes the single boundary between the eBPF core and every operational mode (CLI, agent, session job). The Helm chart, multi-stage Dockerfile, and make generate | manifests | clientset | docker-build | helm-lint | helm-template | envtest workflow all land here.
Operator control plane (#70)
podtrace operator now runs a controller-runtime manager with three reconcilers:
- TracerConfigReconciler — owns the agent DaemonSet, ServiceAccount, ClusterRole/Binding, and (new in #73) a namespaced bundle Role.
- PodTraceReconciler — resolves the referenced ExporterConfig, materializes a per-CR exporter bundle ConfigMap (+ optional credential Secret) in
podtrace-system, rolls per-node statuses up into top-level conditions. - PodTraceSessionReconciler — fans the session out into one Job per node hosting matched pods, enforces an optional per-node concurrency cap, drives TTL-based cleanup.
Cross-namespace child cleanup uses a single shared finalizer; the operator runs unprivileged. End-to-end smoke verified on kind.
Per-node agent DaemonSet (#71)
The agent watches PodTraces cluster-wide, Pods on its own node, and exporter bundles in podtrace-system. Concrete behaviors:
- Multi-CR merge router — every event is dispatched to the subset of per-CR exporters whose cgroup set contains the event's
CgroupIDand whose filter set includes the event type. One shared eBPF pipeline serves N overlapping CRs. - Exporter cache keyed by bundle ResourceVersion — credential rotation rebuilds the OTLP exporter; no-op reconciles don't reopen connections.
- Server-Side Apply per-node status — the field-owner is
podtrace-agent-<NODE_NAME>so many agents writestatus.nodeStatus[]concurrently without clobbering each other. - Health probes —
/healthz(stall-window liveness) +/readyz(flips true post-cache-sync). - Prometheus metrics —
podtrace_agent_{events_exported_total, events_dropped_total, active_cgroups, active_crs, reconcile_total}, labeled per CR.
OTLP is the only concrete agent-side exporter for now; Jaeger/Zipkin/Splunk/DataDog return an explicit "not yet implemented" so a Degraded condition surfaces it on kubectl describe.
Session Jobs (#72)
PodTraceSession is the bounded-diagnose CRD. The operator now creates one Job per matched node that mounts the per-session exporter bundle at /etc/podtrace/exporter/, runs the existing podtrace --diagnose CLI with three new flags, and captures the result through three parallel channels:
--summary-file <path>— full JSON summary on a sharedemptyDir.--termination-message-path <path>— compact JSON on/dev/termination-logso kubelet surfaces it inPod.Status.ContainerStatuses[…].State.Terminated.Message.--report-to <kind>/<ns>/<name>— self-uploads the human-readable report to a ConfigMap or Secret via the in-cluster ServiceAccount.
A new podtrace report-uploader subcommand provides a native sidecar mode for clusters that need re-upload-on-completion semantics (TracerConfig.spec.session.sidecarUploader: true).
The session Job runs as a dedicated podtrace-session ServiceAccount with a narrow per-session Role + RoleBinding in the user namespace, not the agent SA. RBAC is scoped to resourceNames of the specific report sink.
A shared pkg/exporter/bundle package now provides the canonical Payload type plus FromConfigMapData/ToConfigMapData and FromYAML/ToYAML codecs, so the operator, agent, and CLI all consume byte-identical exporter config.
End-to-end verified on kind: a session against an nginx pod produces phase: Completed, aggregated status.summary, per-Job eventCount, and a human-readable report in a user-namespace ConfigMap.
Deploy / operational chart layer (#73)
helm install podtrace deploy/charts/podtrace now produces a working trace fleet end-to-end:
- Default TracerConfig rendered from
values.yaml(image, agent runtime, session caps, scheduling, BTF mode, sidecar uploader). Default-on with the operator; opt out viatracerConfig.create=false. - ServiceMonitor + PodMonitor templates for Prometheus scrape config (operator
:8080, agent:9090). Double-gated on toggle andmonitoring.coreos.com/v1API availability — enabling them on a cluster without prometheus-operator is a silent no-op rather than a hard install failure. - Opt-in starter OTLP ExporterConfig so a fresh install can run a sample diagnose without first authoring exporter config.
- Refined
metrics.serviceMonitor/metrics.podMonitorvalues — labels, interval, path. - Post-install NOTES surface what was rendered and the right next-step command.
Agent RBAC narrowed: the agent ClusterRole no longer grants cluster-wide configmaps/secrets reads. Bundle reads are scoped to podtrace-system via a namespaced Role + RoleBinding the TracerConfigReconciler materializes alongside the agent ServiceAccount.
Operator RBAC: gains roles/rolebindings create/update/delete (per-session RBAC) and delete on podtrace.io CRDs (TTL-driven session cleanup — pre-existing bug fixed in passing).
ExporterConfig → PodTrace watch wired, so changes to a referenced ExporterConfig requeue every PodTrace that references it.
Documentation: full per-CRD reference docs (PodTrace, PodTraceSession, ExporterConfig, TracerConfig) plus a Chainsaw e2e suite that exercises lifecycle invariants — continuous, session, multi-CR overlap, agent restart, credential rotation, PSA enforcement.
Security hardening (#74)
- Five new helper packages that own the trust boundaries:
internal/safeconv— saturating integer conversions; replaces 19 G115 sites.internal/procfs—os.Root-scoped reads under/proc(~25 G304 sites).internal/sysfs—os.Root-scoped reads under/sys/fs/cgroup.internal/ldsoconf—os.Root-scoped/etc/ld.so.conf*parsing.internal/hostfs— validated boundary helpers (Stat,WalkRegular,ReadFile,WriteFile) for paths that legitimately crossos.Root's scope.
- Real correctness fixes that the suppressions had been hiding:
AlertWarnPct/CritPct/EmergPctare now clamped to[0, 100]at config-load time. A misconfiguredPODTRACE_ALERT_WARN_PCT=4294967296previously wrapped to0and made every event trip a "WARNING" alert; it now clamps to100.BPFHashMapSizeandRingBufferSizeKBare clamped at the source viaconfig.ClampUint32, with overflow-safe multiplication for the ring-buffer byte count.EventResourceLimitutilization comparisons run inintend-to-end with a defensive negative-value guard.string(rune(event.PID))ininternal/diagnose/tracker/trace_tracker.go(which produced a single Unicode codepoint per PID and silently broke for PIDs above0x10FFFF) is replaced withstrconv.FormatUint.
- Envtest path fix:
make envtestnow runs the full CRD validation suite (api/v1alpha1, operator, agent) again — three suite_test.go files were looking underdeploy/charts/podtrace/templates/crds/but the chart layout puts CRDs atdeploy/charts/podtrace/crds/...
Release v0.10.0
What's Changed
This release brings two major tracing capabilities, broader export options, and a round of security and stability work.
Features
Multi-pod tracing & cross-namespace support (#63)
Trace workloads that span multiple pods across namespaces in a single session.
Performance profiling integration (#67)
Slow events now come with the exact goroutine and CPU stacks that caused them. Podtrace fetches pprof data (heap, goroutine, CPU) from target pods and correlates it with BPF stack traces, auto-triggered on configurable latency thresholds, or on demand via /profile/start`. BPF kernel timestamps are now aligned to wall-clock for precise correlation with application logs.
New trace exporters: Datadog and Zipkin (#68)
In addition to OTLP, Jaeger, and Splunk HEC, you can now export traces directly to Datadog and Zipkin backends.
Security & Stability (#66)
- Hardened Podtrace security posture
- Updated dependencies
🧪 Quality & Tooling (#67, #68)
- Expanded test coverage
- Upgraded CI workflows
- Consolidated documentation into a single source
Upgrading
No breaking changes.
Full Changelog: v0.9.0...v0.10.0
Release v0.9.0
What's Changed
- fix: DaemonSet no-events and systemd cgroup driver by @gma1k in #55
- feat: improve cross-environment portability, improve test coverage, and improve docs by @gma1k in #56
- fix: improve cgroup resolution for cri by @gma1k in #57
- feat: harden reliability with stack traces, channel depth metrics, pprof, fuzz test, govulncheck, gosec, and security fixes by @gma1k in #58
- Replace many code lines of vmlinux.h fallback with a minimal stub, since all types it provided are already locally defined or forward-declared in common.h by @gma1k in #59
- feat: add filesystem path extraction, IPv6 fix, event schema V4, runtime probe groups, configurable BPF map sizes, alert thresholds, and unlink/rename event types by @gma1k in #60
- feat: add language-runtime adapters for Redis, Memcached, FastCGI, gRPC, Kafka, USDT, and PII redaction by @gma1k in #61
Full Changelog: v0.8.0...v0.9.0
Release v0.8.0
What's Changed
- fix: Replace bpf_d_path with inode-based path resolution for cross-kernel compatibility by @gma1k in #41
- feat: Add resource limit monitoring for CPU, memory, and I/O by @gma1k in #42
- fix: improving go ci test by @gma1k in #43
- feat: Add connection pool monitoring by @gma1k in #44
- fix: Replace pathresolver with kernel path resolution by @gma1k in #45
- feat: Add real-time alerting system with webhook, Slack, etc by @gma1k in #46
- fix: fixing vmlinux.h placeholder file by @gma1k in #47
- fix: Bug fixes and security hardening by @gma1k in #48
- fix: Optimize CI workflow, improve test coverage, fix linter errors, and migrate alerting config by @gma1k in #49
- fix: Print version in cli by @gma1k in #50
- feat: add CRI-O support and enhancing the diagnostics by @gma1k in #53
Full Changelog: v0.7.0...v0.8.0
Release v0.7.0
What's Changed
- add issue template by @olamilekan000 in #23
- fix: Code hardening by @gma1k in #24
- feat: Comprehensive test suite by @gma1k in #25
- chore: harden eBPF parser and cgroup filter testing by @gma1k in #26
- Refactor: Centralize Configuration, Fix Bugs, and Improve Code Quality by @gma1k in #27
- Refactor: the
GenerateReport()function indiagnose.goby @gma1k in #28 - feat: Add structured logging with zap and internal metrics tracking by @gma1k in #29
- test: Improve test coverage by @gma1k in #30
- test: improve test coverage by @gma1k in #31
- test: improve test coverage by @gma1k in #32
- refactor: improve error handling, add object pooling, reduce code duplication, improve cache management, and more by @gma1k in #33
- test: Improve test coverage, refactor: reorganize codebase into modular folder structure by @gma1k in #34
- feat: implement multi-architecture support with dynamic library discovery, priority-based event sampling, LRU caching, and enhanced error handling by @gma1k in #35
- fix: fix runDiagnoseMode lint issue by @gma1k in #36
- test: improve coverage test by @gma1k in #37
- feat: Add Kubernetes context enrichment with pod-to-pod tracking, error correlation with root cause analysis by @gma1k in #38
- feat: Add TLS/SSL handshake tracking by @gma1k in #39
- feat: Add distributed tracing support with OTLP, Jaeger, and Splunk exporters by @gma1k in #40
New Contributors
- @olamilekan000 made their first contribution in #23
Full Changelog: v0.6.0...v0.7.0
Release v0.6.0
Release v0.5.0
What's Changed
- fix: Security hardening - input validation, rate limiting, and security headers by @gma1k in #15
- Add podtrace docs by @gma1k in #16
- feat: Implement remaining features by @gma1k in #17
- feat: Add UDP/HTTP tracing, fix I/O bandwidth val, and OOM Kill detection by @gma1k in #18
- refactor: Split large files by @gma1k in #19
Full Changelog: v0.4.0...v0.5.0
Release v0.4.0
Release v0.3.0
Release v0.2.0
What's Changed
What's New
Major Dependency Updates
- cilium/ebpf: v0.11.0 → v0.20.0
- spf13/cobra: v1.8.0 → v1.10.1
- k8s.io/client-go: v0.29.0 → v0.34.2
- k8s.io/apimachinery: v0.29.0 → v0.34.2
- golang.org/x/sys: v0.31.0 → v0.38.0
Go 1.24 Support
- Upgraded from Go 1.19 to Go 1.24
- Automatic toolchain management with
GOTOOLCHAIN=auto - Makefile now auto-detects Go 1.24 installation
CI Workflow Improvements
- Enhanced workflow triggers for dependency changes
- Improved bash script validation
- Better error handling and automation
Code Quality
- All bash scripts pass ShellCheck validation
- All scripts formatted with shfmt
- Improved code consistency
Breaking Changes
- Go 1.24+ is now required (previously Go 1.19+)
- The Makefile will guide you through upgrading if needed
- CI automatically handles Go 1.24 via toolchain auto-download
Installation
Upgrading Go
wget https://go.dev/dl/go1.24.0.linux-amd64.tar.gz
sudo tar -C /usr/local -xzf go1.24.0.linux-amd64.tar.gz
export PATH=$PATH:/usr/local/go/bin
Build
make build
Full Changelog: v0.1.0...v0.2.0