Skip to content

add GKE TPU skills: exec-gke-tpu and profile-tpu-kernel#19

Open
cjx0709 wants to merge 1 commit intomainfrom
feat/gke-tpu-v2
Open

add GKE TPU skills: exec-gke-tpu and profile-tpu-kernel#19
cjx0709 wants to merge 1 commit intomainfrom
feat/gke-tpu-v2

Conversation

@cjx0709
Copy link
Copy Markdown
Contributor

@cjx0709 cjx0709 commented Mar 30, 2026

  • exec-gke-tpu: provision GKE TPU workloads via xpk, sync code, run multi-process benchmarks on TPU pods (e.g. TPU v7x-8)
  • profile-tpu-kernel: profile Pallas/JAX kernels with xprof LLO utilization (MXU, Vector ALU, etc.)
  • Register gke-tpu plugin in marketplace.json

Summary by CodeRabbit

  • New Features

    • Introduced a new GKE TPU plugin for provisioning and executing multi-process TPU v7x workloads on Google Kubernetes Engine.
    • Added kernel profiling capabilities with detailed performance utilization metrics.
  • Documentation

    • Comprehensive guides for TPU workload provisioning, execution, and kernel profiling with troubleshooting sections.

- exec-gke-tpu: provision GKE TPU workloads via xpk, sync code, run
  multi-process benchmarks on TPU pods (e.g. TPU v7x-8)
- profile-tpu-kernel: profile Pallas/JAX kernels with xprof LLO
  utilization (MXU, Vector ALU, etc.)
- Register gke-tpu plugin in marketplace.json

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 30, 2026

📝 Walkthrough

Walkthrough

This PR introduces a new gke-tpu plugin with metadata and comprehensive documentation for provisioning and executing TPU v7x workloads on GKE via xpk, along with guidance for profiling Pallas/JAX kernels using xprof to measure LLO utilization.

Changes

Cohort / File(s) Summary
Plugin Registration
.claude-plugin/marketplace.json
Registered new gke-tpu plugin entry with version 2.0.0, Apache-2.0 license, and infrastructure category including keywords for GKE, TPU, xpk, Pallas, xprof, profiling, and benchmarking.
Plugin Manifest
plugins/gke-tpu/.claude-plugin/plugin.json
Created plugin manifest with metadata for the gke-tpu plugin, including name, description covering TPU workload provisioning and profiling capabilities, and version 2.0.0.
Skill Documentation
plugins/gke-tpu/skills/exec-gke-tpu/SKILL.md, plugins/gke-tpu/skills/profile-tpu-kernel/SKILL.md
Added comprehensive guides: exec-gke-tpu covers prerequisites, cluster setup, multi-container workload provisioning/execution workflow, and troubleshooting; profile-tpu-kernel documents end-to-end Pallas kernel profiling with xprof, including trace capture, artifact transfer via GCS, TensorBoard visualization setup, and failure resolution.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A plugin hops in with TPUs so grand,
Xpk and xprof across the land,
Multi-process kernels profiled with care,
GKE clusters blooming everywhere! 🌟

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately summarizes the main changes: adding two new GKE TPU skills (exec-gke-tpu and profile-tpu-kernel) to the repository.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/gke-tpu-v2

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the gke-tpu plugin, which provides skills for provisioning and executing workloads on GKE-based TPUs using xpk, as well as profiling Pallas/JAX kernels with xprof. The documentation covers prerequisites, cluster creation, multi-process execution, and TensorBoard integration for trace analysis. Feedback was provided to improve the reusability of the documentation by replacing hardcoded project IDs with placeholders, updating deprecated package references like msgpack-python, and adding explanatory comments for specific dependency version pins.


# 4. Auth
gcloud auth login
gcloud config set project tpu-service-473302
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding the project ID tpu-service-473302 makes this skill less reusable for other users or projects. Consider using a placeholder like <YOUR_PROJECT_ID> or an environment variable to make it more flexible.

Suggested change
gcloud config set project tpu-service-473302
gcloud config set project <YOUR_PROJECT_ID>

huggingface-hub safetensors transformers tiktoken \
setproctitle psutil pandas httpx openai aiohttp \
pybase64 partial_json_parser omegaconf \
msgpack-python requests typing-extensions
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The msgpack-python package is deprecated. Please use msgpack instead for better compatibility and to avoid potential issues with outdated versions.

Suggested change
msgpack-python requests typing-extensions
msgpack fastapi orjson uvicorn jinja2 pydantic python-multipart \

'tensorboard-plugin-profile>=2.22' \
'xprof>=2.22' \
'protobuf>=5,<7' \
'setuptools<81'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Pinning setuptools to <81 is a workaround for the pkg_resources issue (as noted in troubleshooting). While necessary for now, this can lead to dependency conflicts. Consider adding a comment here explaining the reason for the pin, or explore a more robust solution that doesn't rely on pkg_resources or is compatible with newer setuptools versions.

Suggested change
'setuptools<81'
'setuptools<81' # Pinned due to pkg_resources removal in setuptools >= 82

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@plugins/gke-tpu/skills/exec-gke-tpu/SKILL.md`:
- Around line 15-37: The macOS-specific prerequisites and PATH instructions in
SKILL.md (lines showing brew installs, /Users/... path, and pipx with Python
3.13) need to be scoped and supplemented: update the doc to explicitly label the
current commands as macOS/Homebrew instructions and add brief Linux alternatives
(apt/yum or curl/install steps for Google Cloud SDK, kubectl install, pipx
install command for system python, and an equivalent PATH example using $HOME
instead of /Users/$(whoami)). Also add a short note that Windows users should
follow Cloud SDK and kubectl Windows installers or WSL, and ensure the PATH
section references a cross-platform pattern (e.g., $HOME/.local/bin) rather than
an absolute macOS-only path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2bbb198d-4715-47b4-9450-01d8dc09dcb5

📥 Commits

Reviewing files that changed from the base of the PR and between 0c4dd4d and a5ae946.

📒 Files selected for processing (4)
  • .claude-plugin/marketplace.json
  • plugins/gke-tpu/.claude-plugin/plugin.json
  • plugins/gke-tpu/skills/exec-gke-tpu/SKILL.md
  • plugins/gke-tpu/skills/profile-tpu-kernel/SKILL.md

Comment on lines +15 to +37
The following tools must be installed locally. Install via:

```bash
# 1. Google Cloud SDK
brew install --cask google-cloud-sdk

# 2. kubectl + auth plugin
gcloud components install kubectl gke-gcloud-auth-plugin beta --quiet

# 3. xpk (must use Python 3.13, NOT 3.14 which has argparse incompatibility)
brew install pipx
pipx install xpk --python python3.13

# 4. Auth
gcloud auth login
gcloud config set project tpu-service-473302
gcloud auth application-default login
```

**PATH setup** (needed in every shell/command):
```bash
export PATH="/Users/$(whoami)/.local/bin:/opt/homebrew/bin:/opt/homebrew/share/google-cloud-sdk/bin:/usr/bin:$PATH"
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify OS scope for setup commands (currently macOS-specific).

The prerequisite and PATH instructions are Homebrew/macOS-specific (brew, /Users/...) but the doc doesn’t explicitly scope this section to macOS or provide Linux alternatives. This can cause setup failures for non-macOS users.

🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 28-28: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/gke-tpu/skills/exec-gke-tpu/SKILL.md` around lines 15 - 37, The
macOS-specific prerequisites and PATH instructions in SKILL.md (lines showing
brew installs, /Users/... path, and pipx with Python 3.13) need to be scoped and
supplemented: update the doc to explicitly label the current commands as
macOS/Homebrew instructions and add brief Linux alternatives (apt/yum or
curl/install steps for Google Cloud SDK, kubectl install, pipx install command
for system python, and an equivalent PATH example using $HOME instead of
/Users/$(whoami)). Also add a short note that Windows users should follow Cloud
SDK and kubectl Windows installers or WSL, and ensure the PATH section
references a cross-platform pattern (e.g., $HOME/.local/bin) rather than an
absolute macOS-only path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant