Open-source, cross-platform GPU Profiling & Replay System
Designed for AI compilers, deep learning frameworks, and GPU driver engineers
- High-Performance Event Capture: Collect 10,000+ GPU instruction-level call stacks without interrupting execution
- Lock-Free Ring Buffer: Minimal overhead event collection using SPSC (Single Producer Single Consumer) design
- SBT Binary Trace Format: Compact, efficient binary format with string interning and delta timestamp encoding
- Multi-Platform Support: NVIDIA CUDA (via CUPTI/nsys), AMD ROCm, Apple Metal + Instruments (xctrace), MetaX MACA (via MCPTI), Huawei Ascend (via CANN/msprof)
- Multi-GPU & Multi-Stream: Full support for complex GPU topologies and async execution
- Multi-GPU Cluster Profiling (v0.7.x): GPUTopology discovery, TimeSync (NTP/PTP/CUDA), NCCLTracker for distributed training
- Perfetto SDK Integration: Native protobuf export (85% smaller files) + JSON fallback
- Real-time Tracing: Thread-safe
TracingSessionwith lock-free buffers (9K+ events/sec) - Kineto-Compatible Schema: PyTorch profiler compatibility with thread tracking, flexible metadata, and structured flows
- Memory & Counter Profiling:
MemoryEventandCounterEventfor detailed resource tracking - LLVM XRay Support: Import compiler-instrumented function traces
- eBPF Types (Linux): Kernel-level GPU event tracing support
- RenderDoc-style Frame Capture: F12-trigger capture with resource state snapshots
- GPU Memory Profiler: Allocation tracking, leak detection, peak usage monitoring
- CLI Tools: Easy-to-use command-line interface for recording and viewing traces
- GDB Integration (v0.10.0): GPU-aware debugging via GDB Remote Serial Protocol (RSP)
- Tracy Profiler Integration (v0.11.0): Real-time visualization with full GPU timeline support for Ascend/MetaX/ROCm
📐 See also: Editable diagram (draw.io)
Core Modules:
| Module | Description |
|---|---|
| Capture | GPU profiling backends (CUPTI, Metal, BPF, Memory) |
| Common | Core types, lock-free ring buffer, stack capture, XRay import |
| Format | SBT binary trace format (read/write) |
| State | GPU state machine, timeline builder, Perfetto exporters |
| Replay | Trace replay engine, stream scheduler, determinism checker |
| Cluster | Multi-GPU profiling, time sync, NCCL tracking (v0.7.x) |
| GDB | GDB RSP backend, GPU breakpoints, trace replay debugging (v0.10.0) |
Supported Backends:
| Platform | Backend | Status |
|---|---|---|
| NVIDIA | CUPTI SDK | ✅ Production |
| Apple | Metal API | ✅ Production |
| Apple | Instruments (xctrace) | ✅ Production |
| MetaX | MCPTI SDK | ✅ Production |
| Huawei | Ascend CANN | ✅ Production |
| Huawei | msprof | ✅ Production |
| AMD | ROCm (roctracer) | ✅ Production |
| Linux | eBPF | ✅ Available |
Output Formats:
.sbt- TraceSmith Binary Trace (compact, indexed).json- Perfetto JSON (chrome://tracing).perfetto- Perfetto Protobuf (85% smaller).dot- Graphviz dependency graph.tracy- Tracy profiler format (via import)- ASCII Timeline - Terminal visualization
Integrations:
- Tracy Profiler - Real-time visualization (v0.11.0+)
- Perfetto SDK - Native protobuf export
- LLVM XRay - Compiler-instrumented traces
- PyTorch Kineto - Compatible schema
Before installing TraceSmith, install the required dependencies for your platform.
# Core build tools
sudo apt update
sudo apt install -y cmake g++ make git
# Optional: libunwind for call stack capture
sudo apt install -y libunwind-dev
# Optional: Python development headers (for Python bindings)
sudo apt install -y python3-dev python3-pip# 1. Install CUDA Toolkit (includes CUPTI)
# Download from: https://developer.nvidia.com/cuda-downloads
# Or use package manager:
sudo apt install -y nvidia-cuda-toolkit
# 2. Install Nsight Systems (required for `tracesmith profile --nsys`)
# Option A: Install from CUDA repository (recommended)
sudo apt install -y cuda-nsight-systems-12-8 # Match your CUDA version
# Option B: Install standalone
sudo apt install -y nsight-systems
# 3. Verify installation
nvcc --version # CUDA compiler
nsys --version # Nsight Systems profiler (2024.x recommended)
ls /usr/local/cuda/include/cupti.h # CUPTI headers
# 4. Set environment variables (add to ~/.bashrc)
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH# Install MACA SDK (includes MCPTI)
# Contact MetaX for SDK access: https://www.metax-tech.com
# Set MACA_ROOT environment variable
export MACA_ROOT=/opt/maca-3.0.0
# Install mcTracer (required for `tracesmith profile --mctracer`)
# mcTracer is included in MACA SDK# Install CANN Toolkit (8.0+ recommended)
# Download from: https://www.hiascend.com/software/cann
# Set environment variables
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# Verify installation
cat /usr/local/Ascend/ascend-toolkit/latest/version.cfg
# For NPU profiling with msprof
# msprof is included in CANN toolkit
/usr/local/Ascend/ascend-toolkit/latest/tools/profiler/bin/msprof --helpTested Hardware:
| NPU | CANN Version | Driver | Status |
|---|---|---|---|
| Ascend 910B2 | 8.1.RC1 | 7.7.0.1.238 | ✅ Verified |
Multi-NPU Support:
TraceSmith supports multi-NPU profiling with separate contexts and streams per device:
import acl
# Initialize multiple virtual GPUs on single physical device
acl.init()
for gpu_id in range(2):
acl.rt.set_device(0) # Map to physical device
ctx, _ = acl.rt.create_context(0)
stream, _ = acl.rt.create_stream()
# Each "GPU" has independent context, stream, and memoryVerified Multi-GPU Features:
- ✅ Multiple Context creation and switching
- ✅ Multiple Stream creation and synchronization
- ✅ Per-device memory allocation
- ✅ H2D / D2H data transfer
- ✅ Cross-GPU memory copy (GPU0 → Host → GPU1)
- ✅ Resource cleanup
# Install Xcode Command Line Tools (includes Metal framework)
xcode-select --install
# Install CMake via Homebrew
brew install cmake
# Verify Metal support
xcrun metal --version
# Note: Instruments (xctrace) is included with Xcode
# Used for `tracesmith profile --xctrace`# Install Visual Studio 2019+ with C++ workload
# Install CMake: https://cmake.org/download/
# For CUDA support:
# Install CUDA Toolkit: https://developer.nvidia.com/cuda-downloads
# Install Nsight Systems: https://developer.nvidia.com/nsight-systems| Dependency | Minimum Version | Recommended | Notes |
|---|---|---|---|
| CMake | 3.16 | 3.22+ | Build system |
| GCC | 8.0 | 11+ | C++17 support |
| Clang | 8.0 | 14+ | C++17 support |
| CUDA Toolkit | 11.0 | 12.x | CUPTI included |
| Nsight Systems | 2022.x | 2024.x | GPU profiling |
| libunwind | 1.3 | 1.6+ | Stack capture |
| Python | 3.7 | 3.10+ | Python bindings |
# Install from PyPI (auto-detects GPU platform)
pip install tracesmith
# Platform-specific installation:
# CUDA/CUPTI (NVIDIA GPU)
TRACESMITH_CUDA=1 pip install tracesmith
# ROCm (AMD GPU)
TRACESMITH_ROCM=1 pip install tracesmith
# Metal (Apple GPU)
TRACESMITH_METAL=1 pip install tracesmith
# Verify installation
python -c "import tracesmith; print(tracesmith.__version__, tracesmith.detect_platform())"
# Or install from source
git clone https://github.com/chenxingqiang/TraceSmith.git
cd TraceSmith
TRACESMITH_CUDA=1 pip install . # with CUDA supportPrerequisites:
- CMake 3.16+
- C++17 compatible compiler (GCC 8+, Clang 8+, MSVC 2019+)
- Python 3.7+ (for Python bindings)
- (Optional) NVIDIA CUDA Toolkit with CUPTI
- (Optional) Nsight Systems for
--nsysprofiling - (Optional) Xcode Command Line Tools (for Metal on macOS)
Basic Build:
git clone https://github.com/chenxingqiang/TraceSmith.git
cd TraceSmith
mkdir build && cd build
cmake ..
cmake --build . -j$(nproc)CMake Build Options:
| Option | Default | Description |
|---|---|---|
CMAKE_BUILD_TYPE |
Debug | Build type: Debug, Release, RelWithDebInfo |
TRACESMITH_ENABLE_CUDA |
OFF | Enable NVIDIA CUDA/CUPTI support |
TRACESMITH_ENABLE_ROCM |
OFF | Enable AMD ROCm support |
TRACESMITH_ENABLE_METAL |
OFF | Enable Apple Metal support |
TRACESMITH_ENABLE_MACA |
OFF | Enable MetaX MACA/MCPTI support |
TRACESMITH_ENABLE_ASCEND |
OFF | Enable Huawei Ascend/CANN support |
TRACESMITH_BUILD_PYTHON |
OFF | Build Python bindings (pybind11) |
TRACESMITH_BUILD_TESTS |
ON | Build unit tests (Google Test) |
TRACESMITH_BUILD_EXAMPLES |
ON | Build example programs |
TRACESMITH_BUILD_CLI |
ON | Build command-line interface |
TRACESMITH_USE_PERFETTO_SDK |
OFF | Use Perfetto SDK for protobuf export |
Build Examples:
# Release build with Metal support (macOS)
cmake .. -DCMAKE_BUILD_TYPE=Release -DTRACESMITH_ENABLE_METAL=ON
cmake --build . -j$(nproc)
# CUDA build (Linux/Windows with NVIDIA GPU)
cmake .. -DCMAKE_BUILD_TYPE=Release -DTRACESMITH_ENABLE_CUDA=ON
cmake --build . -j$(nproc)
# Full build with all features
cmake .. -DCMAKE_BUILD_TYPE=Release \
-DTRACESMITH_ENABLE_METAL=ON \
-DTRACESMITH_BUILD_PYTHON=ON \
-DTRACESMITH_USE_PERFETTO_SDK=ON
cmake --build . -j$(nproc)
# Minimal build (library only, no tests/examples/CLI)
cmake .. -DTRACESMITH_BUILD_TESTS=OFF \
-DTRACESMITH_BUILD_EXAMPLES=OFF \
-DTRACESMITH_BUILD_CLI=OFF
cmake --build . -j$(nproc)Install:
# Install to default location (/usr/local)
sudo cmake --install .
# Install to custom prefix
cmake --install . --prefix /path/to/install
# Installed files:
# bin/tracesmith - CLI executable
# include/tracesmith/ - Header files
# lib/libtracesmith-*.a - Static librariesRun Tests:
# Run all tests
ctest --output-on-failure
# Run specific test
./bin/tracesmith_tests --gtest_filter="RingBuffer*"docker build -t tracesmith .
docker run -it tracesmithimport tracesmith as ts
# Create profiler for your GPU platform
profiler = ts.create_profiler(ts.PlatformType.CUDA) # or ROCm, Metal
# Configure and capture
config = ts.ProfilerConfig()
config.capture_kernels = True
config.capture_memcpy = True
profiler.initialize(config)
profiler.start_capture()
# ... your GPU code here (CUDA kernels, etc.) ...
profiler.stop_capture()
# Get captured events
events = profiler.get_events()
print(f"Captured {len(events)} events")
# Build timeline and analyze
timeline = ts.build_timeline(events)
print(f"GPU Utilization: {timeline.gpu_utilization * 100:.1f}%")
print(f"Max Concurrent Ops: {timeline.max_concurrent_ops}")
# Export to Perfetto (chrome://tracing or ui.perfetto.dev)
ts.export_perfetto(events, "trace.json")
# Save to TraceSmith binary format
writer = ts.SBTWriter("trace.sbt")
writer.write_events(events)
writer.finalize()import tracesmith as ts
# Create tracing session with custom config
config = ts.TracingConfig()
config.buffer_size_kb = 8192 # 8MB buffer
config.enable_counter_tracks = True
session = ts.TracingSession()
session.start(config)
# Emit events from your application (thread-safe!)
event = ts.TraceEvent()
event.type = ts.EventType.KernelLaunch
event.name = "my_kernel"
event.thread_id = 12345
event.metadata["grid_dim"] = "256x256x1"
session.emit(event)
# Emit counter metrics
session.emit_counter("GPU Memory (MB)", 1024.5)
session.emit_counter("SM Occupancy %", 85.2)
# Stop and export
session.stop()
session.export_to_file("realtime_trace.perfetto-trace")
# Get statistics
stats = session.get_statistics()
print(f"Duration: {stats.duration_ms():.1f}ms")
print(f"Events: {stats.events_emitted} emitted, {stats.events_dropped} dropped")TraceSmith provides a comprehensive CLI with ASCII banner and colored output:
████████╗██████╗ █████╗ ██████╗███████╗███████╗███╗ ███╗██╗████████╗██╗ ██╗
╚══██╔══╝██╔══██╗██╔══██╗██╔════╝██╔════╝██╔════╝████╗ ████║██║╚══██╔══╝██║ ██║
██║ ██████╔╝███████║██║ █████╗ ███████╗██╔████╔██║██║ ██║ ███████║
██║ ██╔══██╗██╔══██║██║ ██╔══╝ ╚════██║██║╚██╔╝██║██║ ██║ ██╔══██║
██║ ██║ ██║██║ ██║╚██████╗███████╗███████║██║ ╚═╝ ██║██║ ██║ ██║ ██║
╚═╝ ╚═╝ ╚═╝╚═╝ ╚═╝ ╚═════╝╚══════╝╚══════╝╚═╝ ╚═╝╚═╝ ╚═╝ ╚═╝ ╚═╝
GPU Profiling & Replay System v0.9.0
Available Commands:
| Command | Description |
|---|---|
profile |
Profile a command (record + execute in one step) |
record |
Record GPU events to a trace file |
view |
View contents of a trace file |
info |
Show detailed information about a trace file |
export |
Export trace to Perfetto or other formats |
analyze |
Analyze trace for performance insights |
replay |
Replay a captured trace |
benchmark |
Run 10K GPU call stacks benchmark |
devices |
List available GPU devices (CUDA, Metal, MACA, ROCm) |
version |
Show version information |
help |
Show help message |
IMPORTANT: GPU Profiling API Limitations
CUPTI/MCPTI can only profile the calling process, not child processes. For CUDA/MACA platforms, you MUST use system-level profilers:
Platform Required Option Tool Min Version NVIDIA CUDA --nsysNsight Systems 2022.x (2024.x recommended) MetaX MACA --mctracermcTracer MACA SDK 3.0+ Apple Metal --xctraceInstruments Xcode 14+ The
recordcommand is not supported for CUDA/MACA platforms. Useprofile --nsysorprofile --mctracerinstead.
-
Nsight Systems Version: Use version 2024.x or later. Older versions (2021.x) may have compatibility issues with newer CUDA drivers.
# Check your nsys version nsys --version # Install latest version (Ubuntu with CUDA 12.x) sudo apt install cuda-nsight-systems-12-8
-
CUPTI Conflict: When using
--nsys, your program should not use CUPTI directly (nsys subscribes to CUPTI). If your program uses CUPTI, run it without--nsys:# For programs using CUPTI internally ./bin/cupti_example # Run directly, not with --nsys
-
Root/Admin Not Required: Modern nsys (2024.x) does not require root privileges for basic profiling.
C++ CLI Examples:
# NVIDIA CUDA - Use --nsys (REQUIRED for GPU profiling)
./bin/tracesmith profile --nsys -- python train.py
./bin/tracesmith profile --nsys --perfetto -- ./my_cuda_app
./bin/tracesmith profile --nsys -o model.sbt -- python train.py --epochs 10
# MetaX MACA - Use --mctracer (REQUIRED for GPU profiling)
./bin/tracesmith profile --mctracer -- ./my_maca_app
./bin/tracesmith profile --mctracer --perfetto -- python train.py
# Huawei Ascend - Use --msprof (REQUIRED for NPU profiling)
./bin/tracesmith profile --msprof -- ./my_ascend_app
./bin/tracesmith profile --msprof --perfetto -- python train.py
# Apple Metal - Use --xctrace for real Metal GPU events
./bin/tracesmith profile --xctrace -- python train.py
./bin/tracesmith profile --xctrace --keep-trace -- python mps_benchmark.py
./bin/tracesmith profile --xctrace --xctrace-template "GPU Driver" -- ./app
# View trace with statistics
./bin/tracesmith view trace.sbt --stats
# Show trace file info
./bin/tracesmith info trace.sbt
# Export to Perfetto (view at ui.perfetto.dev)
./bin/tracesmith export trace.sbt -f perfetto
# Analyze performance
./bin/tracesmith analyze trace.sbt
# Replay trace (dry-run)
./bin/tracesmith replay trace.sbt --mode dry-run
# List available GPUs
./bin/tracesmith devices
# Disable colored output
./bin/tracesmith --no-color helpPython CLI Examples:
# NVIDIA CUDA - Use --nsys (REQUIRED for GPU profiling)
tracesmith-cli profile --nsys -- python train.py
tracesmith-cli profile --nsys --perfetto -- ./my_cuda_app
tracesmith-cli profile --nsys -o model.sbt -- python train.py --epochs 10
# MetaX MACA - Use --mctracer (REQUIRED for GPU profiling)
tracesmith-cli profile --mctracer -- ./my_maca_app
tracesmith-cli profile --mctracer --perfetto -- python train.py
# Huawei Ascend - Use --msprof (REQUIRED for NPU profiling)
tracesmith-cli profile --msprof -- ./my_ascend_app
tracesmith-cli profile --msprof --perfetto -- python train.py
# Apple Metal - Use --xctrace for real Metal GPU events
tracesmith-cli profile --xctrace -- python train.py
tracesmith-cli profile --xctrace --keep-trace -- python mps_benchmark.py
# Other commands
tracesmith-cli info
tracesmith-cli devices
# View trace contents
tracesmith-cli view trace.sbt --stats
# Export to Perfetto
tracesmith-cli export trace.sbt -o trace.json
# Analyze trace
tracesmith-cli analyze trace.sbt
# Replay trace
tracesmith-cli replay trace.sbt --mode dry-runTraceSmith integrates with NVIDIA Nsight Systems for system-wide GPU profiling, providing comprehensive CUDA kernel and memory operation tracing.
Usage:
# Profile with nsys (system-wide profiling)
./bin/tracesmith profile --nsys -- python train.py
./bin/tracesmith profile --nsys -o trace.sbt -- ./my_cuda_app
# With custom nsys options
./bin/tracesmith profile --nsys --nsys-args="-t cuda,nvtx" -- python benchmark.pyFeatures:
- System-wide CUDA kernel tracing
- Memory transfer profiling (H2D, D2H, D2D)
- NVTX annotation support
- Multi-GPU profiling
- Automatic .nsys-rep to TraceSmith format conversion
On macOS, TraceSmith integrates with Apple Instruments (xctrace) for capturing real Metal GPU events. This provides accurate GPU timing and event capture that the Metal Frame Capture API cannot achieve programmatically.
Why use xctrace?
- Captures real Metal GPU execution events (kernel launches, command buffer submissions)
- Accurate GPU timing from hardware counters
- Works with any Metal application (PyTorch MPS, TensorFlow Metal, custom Metal apps)
Usage:
# Python CLI (recommended - includes event parsing)
tracesmith-cli profile --xctrace -- python train.py
tracesmith-cli profile --xctrace --keep-trace -o model.sbt -- python inference.py
tracesmith-cli profile --xctrace --perfetto -- python benchmark.py
# C++ CLI (calls xctrace, outputs raw .trace file)
./bin/tracesmith profile --xctrace -- python train.py
./bin/tracesmith profile --xctrace --xctrace-template "GPU Driver" -- ./app
# Python API
from tracesmith.xctrace import XCTraceProfiler, profile_with_xctrace
# Simple usage
events, trace_file = profile_with_xctrace(
["python", "train.py"],
duration=60,
template="Metal System Trace"
)
# Full control
profiler = XCTraceProfiler()
events = profiler.profile_command(["python", "train.py"])
profiler.export_perfetto("metal_trace.json")Available Templates:
Metal System Trace- Most detailed Metal profiling (default)GPU Driver- Driver-level analysisGame Performance- Frame rate and GPU timeAnimation Hitches- Animation performance
Output:
TraceSmith supports MetaX GPUs (C500, C550, etc.) using the MCPTI (MACA Profiling Tools Interface), which provides an API compatible with NVIDIA CUPTI.
📖 Full documentation: See docs/MACA_PROFILING.md for detailed setup and usage guide.
Tested Hardware:
| GPU | Memory | Compute Units | Driver | Status |
|---|---|---|---|---|
| MetaX C500 | 64 GB | 104 CUs | 3.0.11 | ✅ Verified |
| MetaX C550 | - | - | - | 🔜 Planned |
Benchmark Results (MetaX C500, MACA 3.0.0):
| Test | Data Size | Bandwidth |
|---|---|---|
| Host → Device | 256 MB | 10.1 GB/s |
| Device → Host | 256 MB | 9.9 GB/s |
| Device → Device | 256 MB | 608 GB/s |
| MCPTI Overhead | - | < 2% (negligible) |
CLI Device Detection:
$ ./bin/tracesmith devices
MetaX MACA:
✓ MACA available
Devices: 1
Driver: 3000
Device 0: MetaX C500
Vendor: MetaX
Compute: 10.0
Memory: 63.62 GB
SMs: 104
Clock: 1600 MHzSystem-Wide Profiling with mcTracer:
TraceSmith integrates with MetaX's mcTracer tool for comprehensive system-wide GPU profiling (similar to NVIDIA nsys):
# Profile with mcTracer
./bin/tracesmith profile --mctracer -- ./my_maca_app
./bin/tracesmith profile --mctracer --perfetto -- python train.py
# Output is Perfetto-compatible JSON
# View at: https://ui.perfetto.devmcTracer captures:
- All MACA API calls (mcInit, mcMalloc, mcMemcpy, etc.)
- GPU memory operations with bandwidth
- Stream operations and synchronization
- CPU-GPU launch flow arrows
Build with MetaX support:
# On MetaX system (MACA SDK auto-detected at /opt/maca-3.0.0)
cmake -DTRACESMITH_ENABLE_MACA=ON ..
make -j4
# Run examples
./bin/metax_example # Basic profiling demo
./bin/metax_benchmark # Memory bandwidth testC++ API:
#include <tracesmith/tracesmith.hpp>
// Check MetaX GPU availability
if (tracesmith::isMACAAvailable()) {
std::cout << "MetaX devices: " << tracesmith::getMACADeviceCount() << std::endl;
}
// Create MCPTI profiler
auto profiler = tracesmith::createProfiler(tracesmith::PlatformType::MACA);
// Configure
tracesmith::ProfilerConfig config;
config.capture_kernels = true;
config.capture_memcpy = true;
profiler->initialize(config);
// Capture events
profiler->startCapture();
// ... GPU code using MACA runtime ...
profiler->stopCapture();
// Get events
std::vector<tracesmith::TraceEvent> events;
profiler->getEvents(events);
// Export to Perfetto
tracesmith::PerfettoExporter exporter;
exporter.exportToFile(events, "metax_trace.json");Python API:
import tracesmith as ts
# Check MetaX availability
if ts.is_maca_available():
print(f"MetaX devices: {ts.get_maca_device_count()}")
# Create profiler
profiler = ts.create_profiler(ts.PlatformType.MACA)
profiler.initialize(ts.ProfilerConfig())
profiler.start_capture()
# ... GPU code ...
profiler.stop_capture()
events = profiler.get_events()
# Save trace
writer = ts.SBTWriter("metax_trace.sbt")
writer.write_events(events)
writer.finalize()MCPTI Captured Events:
| Event Type | Description |
|---|---|
| KernelLaunch/Complete | Kernel execution timing with grid/block dimensions |
| MemcpyH2D/D2H/D2D | Memory transfers with bandwidth calculation |
| MemsetDevice | Memory initialization operations |
| StreamSync/DeviceSync | Synchronization events with duration |
Output:
.sbt- TraceSmith Binary Trace format.json- Perfetto JSON (view at https://ui.perfetto.dev)- Device info: name, memory, compute capability, clock rates
TraceSmith provides full support for Huawei Ascend NPUs via CANN/ACL integration.
CLI Device Detection:
$ ./bin/tracesmith devices
Huawei Ascend:
✓ Ascend available
Devices: 1
CANN: [7.7.0.1.238:8.1.RC1]
Device 0: Ascend 910B2
Memory: 64 GB HBMSystem-Wide Profiling with msprof:
# Profile with msprof (NPU profiling)
./bin/tracesmith profile --msprof -- ./my_ascend_app
./bin/tracesmith profile --msprof --perfetto -- python train.py
# Output directory contains Perfetto-compatible trace
# View at: https://ui.perfetto.devC++ API:
#include <tracesmith/tracesmith.hpp>
// Check Ascend NPU availability
if (tracesmith::isAscendAvailable()) {
std::cout << "Ascend devices: " << tracesmith::getAscendDeviceCount() << std::endl;
std::cout << "CANN version: " << tracesmith::getAscendCANNVersion() << std::endl;
}
// Create Ascend profiler
auto profiler = tracesmith::createProfiler(tracesmith::PlatformType::Ascend);
// Configure and capture
tracesmith::ProfilerConfig config;
config.capture_kernels = true;
profiler->initialize(config);
profiler->startCapture();
// ... NPU code using ACL runtime ...
profiler->stopCapture();
// Get events and export
std::vector<tracesmith::TraceEvent> events;
profiler->getEvents(events);
tracesmith::PerfettoExporter exporter;
exporter.exportToFile(events, "ascend_trace.json");Python API:
import tracesmith as ts
# Check Ascend availability
if ts.is_ascend_available():
print(f"Ascend devices: {ts.get_ascend_device_count()}")
print(f"CANN version: {ts.get_ascend_cann_version()}")
# Create profiler
profiler = ts.create_profiler(ts.PlatformType.Ascend)
profiler.initialize(ts.ProfilerConfig())
profiler.start_capture()
# ... NPU code ...
profiler.stop_capture()
events = profiler.get_events()
ts.export_perfetto(events, "ascend_trace.json")TraceSmith provides native support for AMD GPUs via ROCm's roctracer API, enabling comprehensive profiling of HIP applications.
Supported Hardware:
| GPU Series | Architecture | Status |
|---|---|---|
| AMD Instinct MI300 | CDNA 3 (gfx942) | ✅ Verified |
| AMD Instinct MI200 | CDNA 2 (gfx90a) | ✅ Verified |
| AMD Instinct MI100 | CDNA (gfx908) | ✅ Verified |
| AMD Radeon Pro W7900 | RDNA 3 (gfx1100) | ✅ Verified |
| AMD Radeon RX 7900 | RDNA 3 (gfx1100) | ✅ Verified |
| AMD Radeon RX 6000 | RDNA 2 (gfx1030) | ✅ Verified |
Prerequisites:
# Install ROCm SDK (6.0+ recommended)
# Download from: https://rocm.docs.amd.com/en/latest/deploy/linux/index.html
# Ubuntu/Debian
sudo apt install rocm-dev roctracer-dev roctx-dev
# Set environment variables (add to ~/.bashrc)
export ROCM_PATH=/opt/rocm
export PATH=$ROCM_PATH/bin:$PATH
export LD_LIBRARY_PATH=$ROCM_PATH/lib:$LD_LIBRARY_PATH
# Verify installation
hipcc --version
rocm-smiCLI Device Detection:
$ ./bin/tracesmith devices
AMD ROCm:
✓ ROCm available
Devices: 2
Driver: 60000
Device 0: AMD Instinct MI100
Vendor: AMD
Arch: gfx908
Memory: 32.00 GB HBM2
CUs: 120
Device 1: AMD Instinct MI100
Vendor: AMD
Arch: gfx908
Memory: 32.00 GB HBM2
CUs: 120C++ API:
#include <tracesmith/tracesmith.hpp>
// Check ROCm availability
if (tracesmith::isROCmAvailable()) {
std::cout << "ROCm devices: " << tracesmith::getROCmDeviceCount() << std::endl;
std::cout << "GPU arch: " << tracesmith::getROCmGpuArch(0) << std::endl;
}
// Create ROCm profiler
auto profiler = tracesmith::createProfiler(tracesmith::PlatformType::ROCm);
// Configure
tracesmith::ProfilerConfig config;
config.capture_kernels = true;
config.capture_memcpy = true;
profiler->initialize(config);
// Capture events
profiler->startCapture();
// ... HIP GPU code ...
profiler->stopCapture();
// Get events
std::vector<tracesmith::TraceEvent> events;
profiler->getEvents(events);
// Export to Perfetto
tracesmith::PerfettoExporter exporter;
exporter.exportToFile(events, "rocm_trace.json");Python API:
import tracesmith as ts
# Check ROCm availability
if ts.is_rocm_available():
print(f"ROCm devices: {ts.get_rocm_device_count()}")
# Create profiler
profiler = ts.create_profiler(ts.PlatformType.ROCm)
profiler.initialize(ts.ProfilerConfig())
profiler.start_capture()
# ... HIP GPU code ...
profiler.stop_capture()
events = profiler.get_events()
ts.export_perfetto(events, "rocm_trace.json")Captured Events:
| Event Type | Description |
|---|---|
| KernelLaunch/Complete | HIP kernel execution with grid/block dimensions |
| MemcpyH2D/D2H/D2D | Memory transfers with bandwidth metrics |
| MemsetDevice | Memory initialization operations |
| StreamSync/DeviceSync | Synchronization events with duration |
Build with ROCm support:
# On ROCm system (auto-detected at /opt/rocm)
cmake -DTRACESMITH_ENABLE_ROCM=ON ..
make -j$(nproc)
# Run examples
./bin/rocm_example # Basic profiling demo
./bin/rocm_benchmark # Memory bandwidth testTraceSmith provides bidirectional integration with Tracy Profiler, enabling real-time visualization of GPU profiling data alongside Tracy's existing CPU profiling capabilities.
Features:
- Full GPU Timeline for Ascend, MetaX, ROCm (not just messages!)
- Export TraceSmith events to Tracy for real-time visualization
- Import Tracy captures (
.tracyfiles) into TraceSmith format - Unified profiling macros that work with both profilers
- GPU zone emission for kernel timing visualization
- Memory allocation tracking in Tracy
- Counter/plot data for metrics visualization
- Frame marking for game/real-time applications
GPU Timeline Support:
| GPU Platform | Tracy Native | TraceSmith Integration |
|---|---|---|
| NVIDIA CUDA | ✅ Native | ✅ Full Timeline |
| Vulkan | ✅ Native | ✅ Full Timeline |
| Metal | ✅ Native | ✅ Full Timeline |
| Huawei Ascend | ❌ None | ✅ Full Timeline |
| MetaX MACA | ❌ None | ✅ Full Timeline |
| AMD ROCm | ❌ None | ✅ Full Timeline |
TraceSmith enables full GPU timeline visualization in Tracy for platforms that Tracy doesn't natively support, including Ascend NPUs and MetaX GPUs.
Quick Start:
# Build with Tracy integration
cmake .. -DTRACESMITH_ENABLE_TRACY=ON
make -j$(nproc)
# Run example (connect Tracy server for visualization)
./bin/tracy_integration_exampleFull GPU Timeline API (Ascend/MetaX/ROCm):
#include <tracesmith/tracy/tracy_gpu_context.hpp>
using namespace tracesmith;
// Create GPU context for Ascend NPU
auto& ascend_ctx = tracy::getOrCreateGpuContext(
"Ascend 910B NPU", tracy::GpuContextType::Ascend, 0);
// Create GPU context for MetaX GPU
auto& metax_ctx = tracy::getOrCreateGpuContext(
"MetaX C500 GPU", tracy::GpuContextType::MACA, 0);
// Method 1: Emit GPU zone directly (creates timeline bar in Tracy)
ascend_ctx.emitGpuZone("AscendMatMul",
cpu_start, cpu_end, // CPU timestamps
gpu_start, gpu_end, // GPU timestamps
thread_id, color);
// Method 2: Use RAII macro
{
TracySmithGpuZone(metax_ctx, "MetaXGEMM");
// ... kernel execution ...
} // Zone emitted on scope exit
// Method 3: Convert TraceSmith events to GPU timeline
std::vector<TraceEvent> events;
profiler->getEvents(events);
ascend_ctx.emitGpuZones(events); // Batch convert to timelineTracy Visualization Result:
┌─────────────────────────────────────────────────────────┐
│ CPU Timeline │
│ ████ submit ████████████████████ submit ████ │
├─────────────────────────────────────────────────────────┤
│ Ascend 910B NPU │
│ ▓▓▓▓▓▓▓▓ AscendMatMul ▓▓▓▓▓▓ AscendConv2D │
├─────────────────────────────────────────────────────────┤
│ MetaX C500 GPU │
│ ▓▓▓▓ MetaXGEMM ▓▓▓▓▓▓▓▓▓▓▓▓ MetaXReduce │
└─────────────────────────────────────────────────────────┘
Basic C++ API:
#include <tracesmith/tracy/tracy_client.hpp>
#include <tracesmith/tracy/tracy_exporter.hpp>
using namespace tracesmith;
// Use unified profiling macros
TracySmithZoneScopedC("MyKernel", tracy::colors::KernelLaunch);
// Create Tracy exporter
tracy::TracyExporter exporter;
exporter.initialize();
// Export TraceSmith events to Tracy
TraceEvent event;
event.type = EventType::KernelLaunch;
event.name = "matmul_f32";
event.duration = 1500000; // 1.5ms
exporter.emitEvent(event);
// Frame marking
tracy::markFrame("RenderFrame");
// Plot metrics
exporter.emitPlotValue("GPU Utilization %", 85.0);Import Tracy Captures:
#include <tracesmith/tracy/tracy_importer.hpp>
tracy::TracyImporter importer;
auto result = importer.importFile("profile.tracy");
if (result.success()) {
// Access TraceSmith events
for (const auto& event : result.record.events()) {
std::cout << event.name << ": " << event.duration << " ns\n";
}
}CMake Options:
| Option | Default | Description |
|---|---|---|
TRACESMITH_ENABLE_TRACY |
ON | Enable Tracy profiler integration |
TraceSmith provides a GDB Remote Serial Protocol (RSP) backend for GPU-aware debugging, enabling developers to debug GPU applications with full visibility into GPU state.
Features:
- Automatic GPU state capture at CPU breakpoints
- GPU kernel call history tracking
- GPU memory monitoring
- Trace capture and replay debugging
- GPU-specific breakpoints (kernel launch, memcpy, memset)
- Custom GDB
monitorcommands
Quick Start:
# Start TraceSmith GDB Server
./bin/tracesmith-gdbserver --port 1234 -- ./my_cuda_app
# Or attach to running process
./bin/tracesmith-gdbserver --port 1234 --attach <pid>
# Connect from GDB
(gdb) target remote :1234TraceSmith GDB Commands:
| Command | Description |
|---|---|
monitor ts help |
Show all TraceSmith commands |
monitor ts status |
Show GPU state summary |
monitor ts kernels |
List kernel call history |
monitor ts memory |
Show GPU memory usage |
monitor ts break kernel <pattern> |
Set GPU kernel breakpoint |
monitor ts trace start |
Start GPU event capture |
monitor ts trace stop |
Stop capture |
monitor ts trace save <file> |
Save trace to SBT file |
monitor ts trace load <file> |
Load trace for replay |
monitor ts replay start |
Start trace replay |
monitor ts replay step |
Step to next event |
monitor ts replay seek <n> |
Seek to event N |
Example Debugging Session:
# Terminal 1: Start GDB server with your CUDA application
./bin/tracesmith-gdbserver --port 1234 -- ./my_cuda_app
# Terminal 2: Connect with GDB
gdb ./my_cuda_app
(gdb) target remote :1234
(gdb) break main
(gdb) continue
# At breakpoint, check GPU state
(gdb) monitor ts status
GPU State: Idle
Devices: 1 (NVIDIA GeForce RTX 4090)
Streams: 2 active
Memory: 1.2 GB / 24 GB
# Set kernel breakpoint
(gdb) monitor ts break kernel matmul*
GPU breakpoint 1: kernel launch matching "matmul*"
# Start trace capture
(gdb) monitor ts trace start
GPU trace capture started
# Continue execution
(gdb) continue
# When kernel breakpoint hits
(gdb) monitor ts kernels
Kernel History (last 10):
[0] matmul_f32 - 256x256x1, 1024 threads, 0.42ms
[1] relu_f32 - 1024x1x1, 256 threads, 0.01ms
# Save trace
(gdb) monitor ts trace stop
(gdb) monitor ts trace save debug_trace.sbt
# Load and replay trace
(gdb) monitor ts trace load debug_trace.sbt
(gdb) monitor ts replay start
(gdb) monitor ts replay step
Event 0: KernelLaunch "matmul_f32"Platform Support:
| Platform | Process Control | GPU Profiling |
|---|---|---|
| Linux | ✅ Full (ptrace) | ✅ CUPTI/MCPTI |
| macOS | ✅ Basic (Mach API) | ✅ Metal |
Build with GDB Support:
cmake .. -DTRACESMITH_BUILD_GDB=ON
make tracesmith-gdbserverAll Python examples support multiple GPU platforms with automatic device detection:
# Run examples on specific device
python examples/basic_usage.py --device cuda # NVIDIA GPU
python examples/basic_usage.py --device mps # Apple Silicon
python examples/basic_usage.py --device rocm # AMD GPU
python examples/basic_usage.py --device cpu # CPU fallback
# Run all examples with test runner
python examples/run_tests.py # Best available device
python examples/run_tests.py --all-devices # Test on all devices
python examples/run_tests.py --test pytorch # Run specific test
python examples/run_tests.py --list # List available testsUsing DeviceManager for cross-platform code:
from examples.device_utils import DeviceManager, benchmark
# Auto-detect best device
dm = DeviceManager() # or DeviceManager(prefer_device="mps")
print(f"Using: {dm.get_device_name()}") # Apple Silicon GPU (mps:0, 25.2 GB)
# Create tensors on device
x = dm.randn(1000, 1000)
y = dm.randn(1000, 1000)
# Benchmark with proper synchronization
results = benchmark(lambda: x @ y, warmup=3, iterations=10, dm=dm)
print(f"Mean: {results['mean_ms']:.2f} ms")
# Device-agnostic operations
dm.synchronize()
print(f"Memory: {dm.memory_allocated() / 1024**2:.1f} MB")#include <tracesmith/tracesmith.hpp>
using namespace tracesmith;
int main() {
// Create profiler
auto profiler = createProfiler(PlatformType::CUDA);
// Configure
ProfilerConfig config;
config.buffer_size = 1000000;
profiler->initialize(config);
// Start capture
profiler->startCapture();
// ... run GPU code ...
// Stop capture
profiler->stopCapture();
// Get events
std::vector<TraceEvent> events;
profiler->getEvents(events);
// Write to file
SBTWriter writer("trace.sbt");
writer.writeEvents(events);
writer.finalize();
return 0;
}#include <tracesmith/tracesmith.hpp>
#include <tracesmith/state/timeline_builder.hpp>
#include <tracesmith/state/timeline_viewer.hpp>
#include <tracesmith/state/perfetto_exporter.hpp>
using namespace tracesmith;
int main() {
// Capture events (see above)
std::vector<TraceEvent> events = captureEvents();
// Build timeline
TimelineBuilder builder;
builder.addEvents(events);
Timeline timeline = builder.build();
// Print ASCII visualization
TimelineViewer viewer;
std::cout << viewer.render(timeline);
// Export to Perfetto with enhanced GPU tracks
PerfettoExporter exporter;
exporter.setEnableGPUTracks(true); // GPU-specific tracks
exporter.setEnableFlowEvents(true); // Dependency visualization
exporter.exportToFile(events, "trace.json");
// Open https://ui.perfetto.dev and load trace.json
// Get statistics
std::cout << "GPU Utilization: " << timeline.gpu_utilization << std::endl;
std::cout << "Max Concurrent Ops: " << timeline.max_concurrent_ops << std::endl;
return 0;
}TraceSmith uses a custom binary format (SBT - TraceSmith Binary Trace) optimized for:
- Compactness: Variable-length integer encoding, string interning
- Streaming: Support for streaming writes during capture
- Fast Access: Indexed sections for random access
File structure:
┌──────────────────┐
│ Header (64 bytes)│ Magic, version, offsets
├──────────────────┤
│ Metadata Section │ Application info, timestamps
├──────────────────┤
│ Device Info │ GPU device details
├──────────────────┤
│ Events Section │ Trace events (variable length)
├──────────────────┤
│ String Table │ Deduplicated strings
├──────────────────┤
│ EOF Marker │
└──────────────────┘
- Project structure and build system
- Core data structures (TraceEvent, DeviceInfo)
- SBT binary trace format
- Lock-free ring buffer
- Platform abstraction interface
- CLI tools (record, view, info)
- Cross-platform stack capture (macOS/Linux/Windows)
- Symbol resolution with demangling
- GPU kernel call chain capture
- Instruction stream builder
- Dependency analysis
- GPU state machine with stream tracking
- Timeline builder with span generation
- Perfetto export (chrome://tracing format)
- ASCII timeline visualization
- Concurrent operation analysis
- Replay engine with full orchestration
- Stream scheduler with dependency tracking
- Determinism checker with validation
- Partial replay (time/operation ranges)
- Dry-run mode for analysis
- Python bindings (pybind11)
- pip-installable package
- Comprehensive documentation
- Docker support
- Example programs
- TraceSmith Studio GUI (future)
- Homebrew formula (future)
- Perfetto SDK Integration (85% smaller traces)
- Real-time TracingSession with lock-free buffers
- Kineto-compatible schema (thread_id, metadata, FlowInfo)
- Memory profiling (MemoryEvent, MemoryCategory)
- Counter tracks (CounterEvent)
- LLVM XRay import support
- eBPF types for Linux kernel tracing
Contributions are welcome! Please read our Contributing Guide before submitting PRs.
TraceSmith is licensed under the Apache License 2.0. See LICENSE for details.
Tested on NVIDIA GeForce RTX 4090 D (24GB, CUDA 12.8, Driver 570.124.06)
╔═══════════════════════════════════════════════════════════════════════╗
║ Non-intrusive capture of 10,000+ instruction-level GPU call stacks ║
║ ✅ VERIFIED! ║
╚═══════════════════════════════════════════════════════════════════════╝
| Metric | Result | Note |
|---|---|---|
| CUDA Kernels Launched | 10,000 | Real __global__ kernels |
| GPU Events (CUPTI) | 20,011 | Instruction-level events |
| Kernel Launches | 10,000 | Each kernel captured |
| Kernel Completes | 10,000 | Full lifecycle |
| Host Call Stacks | 10,000 | 7 frames/stack avg |
| Events with Stacks | 19,989 (99.9%) | GPU + Host merged |
| Total Time | 107 ms | Non-intrusive |
| Throughput | 93,457 kernels/sec | High performance |
Verified Capabilities:
- ✅ Real CUDA kernels executed on GPU )
- ✅ CUPTI captured instruction-level GPU events
- ✅ Host call stacks attached to GPU events
- ✅ Non-intrusive profiling
# On NVIDIA GPU server with CUDA
git clone https://github.com/chenxingqiang/TraceSmith.git
cd TraceSmith
mkdir build && cd build
# Build with CUDA support
cmake .. -DTRACESMITH_ENABLE_CUDA=ON -DTRACESMITH_BUILD_EXAMPLES=ON
make benchmark_10k_stacks -j8
# Run the benchmark
./bin/benchmark_10k_stacks| Kernel | Duration (ns) | Duration (µs) | Duration (ms) |
|---|---|---|---|
| vectorAdd (1M elements) | 5,313 | 5.31 | 0.0053 |
| matrixMul (512×512) | 66,912 | 66.91 | 0.0669 |
| relu (1M elements) | 4,704 | 4.70 | 0.0047 |
| TOTAL | 76,929 | 76.93 | 0.0769 |
| Phase | Operation | Memory |
|---|---|---|
| Parameters | 5× cudaMalloc | 31 MB |
| Activations | 8× cudaMalloc | 72 MB |
| Gradients | 5× cudaMalloc | 31 MB |
| Workspace | 3× cudaMalloc | 96 MB |
| Total Allocated | 21 operations | 230 MB |
| Total Freed | 16 cudaFree | 199 MB |
| Test Duration | - | 5 ms |
| Feature | Performance |
|---|---|
| GPU Event Capture | 93K+ kernels/sec |
| Ring Buffer Throughput | 10K+ events/sec |
| Event Collection Overhead | < 1% |
| SBT File Compression | ~3x vs JSON |
| Perfetto Protobuf | 85% smaller than JSON |
| Stack Capture (no symbols) | ~5 µs/stack |
| Stack Capture (with symbols) | ~13 µs/stack |
✅ RingBuffer Tests (9/9) - Lock-free SPSC buffer
✅ SBT Format Tests (7/7) - Binary trace format
✅ Types Tests (12/12) - Core data structures
✅ Kineto Schema Tests (7/7) - PyTorch compatibility
✅ Kineto V2 Tests (6/6) - Memory & Counter events
✅ TracingSession Tests (10/10) - Real-time tracing
✅ XRay Importer Tests (5/5) - LLVM XRay support
✅ BPF Types Tests (6/6) - eBPF integration
✅ FrameCapture Tests (12/12) - RenderDoc-style capture
✅ MemoryProfiler Tests (12/12) - GPU memory tracking
✅ CUPTI Profiler (14/14) - Real GPU profiling
# Basic installation
pip install tracesmith==0.9.0
# With CuPy for real GPU profiling in Python CLI (choose one):
pip install tracesmith[cuda12] # CUDA 12.x
pip install tracesmith[cuda11] # CUDA 11.x
pip install tracesmith[cuda118] # CUDA 11.8 specific
pip install tracesmith[cuda120] # CUDA 12.0 specific
# With visualization tools
pip install tracesmith[visualization]
# With PyTorch integration
pip install tracesmith[torch]
# All optional dependencies
pip install tracesmith[all]With CuPy installed, you can run real GPU profiling from Python:
# Install CuPy first
pip install tracesmith[cuda12]
# Run real GPU benchmark
tracesmith-cli benchmark --real-gpu -n 10000Tested on NVIDIA GPU Server (RTX 4090):
| Feature | Status |
|---|---|
| Core Types (69 exports) | ✅ |
| CUPTIProfiler | ✅ |
| MemoryProfiler | ✅ |
| Frame Capture | ✅ |
| Stack Capture | ✅ |
| BPF Tracing | ✅ (Linux) |
| CLI Tools | ✅ |
TraceSmith provides a comprehensive validation example that tests all features from PLANNING.md:
# Build and run feature validation
cd build
cmake .. -DTRACESMITH_ENABLE_CUDA=ON -DTRACESMITH_BUILD_EXAMPLES=ON
make goal_validation_example
./bin/goal_validation_exampleThe benchmark_10k_stacks uses real CUDA kernels and CUPTI profiling:
// Real CUDA kernel executed on GPU
__global__ void benchmark_kernel(float* data, int n, int kernel_id) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
data[idx] = data[idx] * 2.0f + static_cast<float>(kernel_id);
}
}
// Launches 10,000 real kernels with CUPTI profiling
for (int i = 0; i < 10000; ++i) {
benchmark_kernel<<<blocks, threads>>>(d_data, n, i);
}| Platform | Profiler | Test Command |
|---|---|---|
| NVIDIA CUDA | CUPTIProfiler | ./bin/cupti_example |
| Apple Metal | MetalProfiler | ./bin/metal_example |
| MetaX MACA | MCPTIProfiler | ./bin/metax_example |
| Huawei Ascend | AscendProfiler | ./bin/tracesmith profile --msprof |
| CPU Fallback | StackCapture | ./bin/stack_capture_example |
| Version | Date | Highlights |
|---|---|---|
| v0.11.1 | 2026-01 | Native ROCm Support - Full AMD GPU profiling via roctracer API, HIP kernel/memory tracing |
| v0.11.0 | 2026-01 | Tracy Integration - Bidirectional Tracy profiler integration, full GPU timeline for Ascend/MetaX/ROCm, real-time visualization, unified profiling macros, 45+ unit tests |
| v0.10.0 | 2025-12 | GDB Integration - GPU-aware debugging via RSP, kernel breakpoints, trace replay debugging, 85 unit tests |
| v0.9.0 | 2025-12 | Huawei Ascend NPU - Full CANN/ACL integration, msprof profiling, Multi-GPU simulation verified |
| v0.8.2 | 2025-12 | CLI Breaking Change - Enforce --nsys/--mctracer for CUDA/MACA, record command blocked, clearer API limitation messages |
| v0.8.1 | 2025-12 | nsys & MACA Enhancement - NVIDIA Nsight Systems integration, MetaX CLI device detection, MACA cluster module support |
| v0.8.1 | 2025-12 | mcTracer Integration - MetaX system-wide profiling, Enhanced MACA CLI, Cluster module support |
| v0.8.0 | 2025-12 | xctrace Integration - Apple Instruments, Cross-Platform Device Utils, Enhanced Examples |
| v0.7.1 | 2025-12 | Multi-GPU Phase 2 - TimeSync, NCCLTracker, ClockCorrelator, CommAnalysis |
| v0.7.0 | 2025-12 | Multi-GPU Cluster - GPUTopology, MultiGPUProfiler, GitHub Actions CI/CD |
| v0.6.9 | 2025-12 | Include reorganization - Directory structure matches src/ layout |
| v0.6.8 | 2025-12 | Enhanced CLI - ASCII banner, all commands, Python CLI |
| v0.6.7 | 2025-12 | Real GPU benchmark - 10K+ CUDA kernels with CUPTI |
| v0.6.5 | 2025-12 | StackCapture bindings, OverflowPolicy, detect_leaks |
| v0.6.2 | 2025-12 | PyPI release, Native extension packaging fix |
| v0.6.0 | 2025-12 | NVIDIA CUPTI integration, Full GPU testing |
| v0.5.0 | 2025-12 | RenderDoc-style frame capture, Resource tracking |
| v0.4.0 | 2025-12 | LLVM XRay, eBPF types, TracingSession, Counter tracks |
| v0.3.0 | 2025-12 | Real-time tracing, Counter events, Memory events |
| v0.2.0 | 2025-12 | Perfetto SDK (85% smaller traces), Kineto schema |
| v0.1.1 | 2025-11 | libunwind, Enhanced Perfetto export, Flow events |
| v0.1.0 | 2025-11 | Initial release: SBT format, Ring buffer, Replay |
TraceSmith draws inspiration from:
- GitHub Issues: Report a bug
- Discussions: Ask questions