Skip to content

Claude/debug lam colab q ub zi#100

Open
mirai-gpro wants to merge 206 commits intoaigc3d:masterfrom
mirai-gpro:claude/debug-lam-colab-qUbZI
Open

Claude/debug lam colab q ub zi#100
mirai-gpro wants to merge 206 commits intoaigc3d:masterfrom
mirai-gpro:claude/debug-lam-colab-qUbZI

Conversation

@mirai-gpro
Copy link

No description provided.

claude and others added 30 commits February 7, 2026 13:02
Root cause: defaults.py's default_setup() and default_config_parser()
assume a distributed training environment with writable filesystem.
On Cloud Run (read-only /app), this causes silent init failures.

Changes:
- app.py: Skip default_setup() entirely, manually set CPU/single-process config
- app.py: Redirect save_path to /tmp (only writable dir on Cloud Run)
- app.py: Add GCS FUSE mount path resolution with Docker-baked fallback
- cloudbuild.yaml: Add Cloud Storage FUSE volume mount for model serving
- cloudbuild.yaml: Increase max-instances to 4
- Include handoff docs and full LAM_Audio2Expression codebase

https://claude.ai/code/session_01C6n4TZ9PPdx46jCevmVo7P
The LAM model file was misidentified as .tar but is actually a PyTorch
weights file. Gemini renamed it to .pth on GCS. Also source wav2vec2
config.json from the model directory instead of LAM configs/.

https://claude.ai/code/session_01C6n4TZ9PPdx46jCevmVo7P
- Import gourmet-sp from implementation-testing branch
- Add sendAudioToExpression() to shop introduction TTS flow
  (firstShop and remainingShops now get lip sync data before playback)
- Remove legacy event hooks in concierge-controller init()
  (replaced with clean linkTtsPlayer helper)
- Clean up LAMAvatar.astro: remove legacy frame playback code
  (startFramePlaybackFromQueue, stopFramePlayback, frameQueue, etc.)
- Simplify to single sync mechanism: frameBuffer + ttsPlayer.currentTime
- Reduce health check interval from 2s to 10s

https://claude.ai/code/session_01C6n4TZ9PPdx46jCevmVo7P
Using official LAM sample avatar as placeholder. Will be replaced with
custom-generated avatar later.

https://claude.ai/code/session_01C6n4TZ9PPdx46jCevmVo7P
- Add fade-in/fade-out smoothing (6 frames / 200ms) to prevent
  Gaussian Splat visual distortion at speech start/end
- Parallelize expression generation with TTS synthesis:
  remaining sentence expression is pre-fetched during first
  sentence playback, eliminating wait time between segments
- Add fetchExpressionFrames() for background expression fetch
  with pendingExpressionFrames buffer swap pattern
- Apply same optimization to shop introduction flow

https://claude.ai/code/session_01C6n4TZ9PPdx46jCevmVo7P
sendAudioToExpression fetch could hang indefinitely (Cloud Run cold
start / service down), blocking await and preventing TTS play().

- Add AbortController timeout (8s) to all expression API fetches
- Wrap expression await with Promise.race so TTS plays even if
  expression API is slow/down (lip sync degrades gracefully)
- Applied to speakTextGCP, speakResponseInChunks, and shop flow

https://claude.ai/code/session_01C6n4TZ9PPdx46jCevmVo7P
Root cause: sendAudioToExpression fetch hung in browser, blocking
await and preventing TTS play() from ever being called.

Fix: all expression API calls are now fire-and-forget - TTS playback
starts immediately without waiting for expression frames. Frames
arrive asynchronously and getExpressionData() picks them up in
real-time from the frameBuffer.

- Remove await/Promise.race from all sendAudioToExpression calls
- Remove fetchExpressionFrames and pendingExpressionFrames
  (no longer needed - direct fire-and-forget is simpler)
- Keep AbortController timeout (8s) inside sendAudioToExpression
  to prevent leaked connections

https://claude.ai/code/session_01C6n4TZ9PPdx46jCevmVo7P
… calls

Architecture change: expression frames are now returned WITH TTS audio
from the backend, instead of the frontend calling audio2exp directly.

Backend (app_customer_support_modified.py):
- Replace fire-and-forget send_to_audio2exp with get_expression_frames
  that returns {names, frames, frame_rate}
- Send MP3 directly to audio2exp (no separate PCM generation needed)
- TTS response: {success, audio, expression: {...}}
- Server-to-server communication: no CORS, stable, fast

Frontend (concierge-controller.ts):
- New queueExpressionFromTtsResponse() reads expression from TTS response
- Remove sendAudioToExpression (direct browser→audio2exp REST calls)
- Remove audio2expApiUrl, audio2expWsUrl, connectLAMAvatarWebSocket
- Remove EXPRESSION_API_TIMEOUT_MS, AbortController timeout
- Existing 1st-sentence-ahead pattern now automatically includes
  expression data (no separate API call needed)

https://claude.ai/code/session_01C6n4TZ9PPdx46jCevmVo7P
…orget proxy

- Backend: TTS endpoint no longer blocks on expression generation
- Backend: New /api/audio2expression proxy (server-to-server, CORS-free)
- Frontend: All expression calls use fireAndForgetExpression() (never blocks TTS play)
- Removes ~2s first-sentence delay caused by synchronous expression in TTS

https://claude.ai/code/session_01C6n4TZ9PPdx46jCevmVo7P
…aining

Two bugs fixed:
1. Buffer corruption: frames from segment 1 mixed with segment 2
   (ttsPlayer.currentTime resets but frameBuffer was concatenated)
   → Now clear buffer before each new TTS segment

2. 3-second delay: expression frames arrived after TTS started playing
   → Pre-fetch remaining segment's expression during first segment playback
   → When second segment starts, pre-fetched frames are immediately available

New prefetchExpression() method returns Promise with parsed frames,
applied non-blocking via .then() to never delay TTS playback.

https://claude.ai/code/session_01C6n4TZ9PPdx46jCevmVo7P
Architecture change: backend includes expression data in TTS response
(server-to-server audio2exp call ~150ms) instead of separate proxy.

- Backend TTS endpoint calls audio2exp synchronously, includes result
- Frontend applyExpressionFromTts(): instant buffer queue from TTS data
- Proxy fireAndForgetExpression kept as fallback (timeout/error cases)
- All 5 call sites (speakTextGCP, speakResponseInChunks x2, shop x2) updated
- Removes prefetch complexity (TTS response already carries expression)

Result: lip sync starts from frame 0, no 2-3 second gap.

https://claude.ai/code/session_01C6n4TZ9PPdx46jCevmVo7P
Architecture redesign for true zero-delay TTS playback:
- Backend TTS endpoint starts audio2exp in background thread, returns
  audio + expression_token immediately (no blocking)
- New /api/expression/poll endpoint: frontend polls for result
- Frontend pollExpression(): fire-and-forget polling at 150ms intervals
- Removes sync expression, proxy, and prefetch approaches

Timeline: TTS returns ~500ms, audio2exp completes ~150ms later (background),
frontend first poll arrives ~200ms after TTS → expression available ~350ms
after playback starts. Previous: 2-3 seconds delay or TTS blocked.

https://claude.ai/code/session_01C6n4TZ9PPdx46jCevmVo7P
…aster response

Backend: revert to sync expression in TTS response (remove async cache/polling).
Frontend: replace pollExpression with applyExpressionFromTts (sync from TTS response).
Frontend: fire sendMessage() immediately while ack plays (don't await firstAckPromise).
pendingAckPromise is awaited before TTS playback to prevent ttsPlayer conflict.

https://claude.ai/code/session_01C6n4TZ9PPdx46jCevmVo7P
…nterrupt)

unlockAudioParams() does play→pause→reset on ttsPlayer for iOS unlock.
When called during ack playback (parallel LLM mode), it kills the ack audio.
Skip it when pendingAckPromise is active (audio already unlocked by ack).

https://claude.ai/code/session_01C6n4TZ9PPdx46jCevmVo7P
…rentAudio safety

Root cause: ack "はい" gets paused (not ended) by some interruption, so
pendingAckPromise never resolves → speakResponseInChunks stuck forever.
Fix 1: resolve pendingAckPromise on both 'ended' and 'pause' events.
Fix 2: call stopCurrentAudio() after pendingAckPromise resolves to ensure
ttsPlayer is clean before new TTS playback.

https://claude.ai/code/session_01C6n4TZ9PPdx46jCevmVo7P
- Container: max-height 650px → height calc(100dvh - 40px), max-height 960px
- Avatar stage: 140px → 300px (desktop), 100px → 200px (mobile)
- Chat area: min-height 150px guaranteed for message display

https://claude.ai/code/session_01C6n4TZ9PPdx46jCevmVo7P
Post-init camera: Z 1→0.6 (closer), Y 1.8→1.75 (slight down), FOV 50→36 (zoom in).
Eliminates wasted space above avatar head in the 300px avatar-stage.

https://claude.ai/code/session_01C6n4TZ9PPdx46jCevmVo7P
Previous: lookAt y=1.8 (head center) + tight zoom → mouth cut off at bottom.
Fix: lower target to y=1.62 (nose/mouth center), adjust OrbitControls target
to match. Camera Z=0.55, FOV=38 for balanced framing.

https://claude.ai/code/session_01C6n4TZ9PPdx46jCevmVo7P
targetY 1.62→1.66 (avatar lower in frame), camera Y 1.62→1.72
(above target, slight downward angle instead of looking up from below)

https://claude.ai/code/session_01C6n4TZ9PPdx46jCevmVo7P
Key improvements over existing lam_modal.py:
- @modal.asgi_app() + Gradio 4.x instead of subprocess + patching
- Direct Python integration with LAM pipeline (no regex patching)
- Blender 4.2 included for GLB generation (OpenAvatarChat format)
- Focused UI for concierge.zip generation with progress feedback
- Proper ASGI serving resolves Gradio UI display issue on Modal

Pipeline: Image → FLAME Tracking → LAM Inference → Blender GLB → ZIP

https://claude.ai/code/session_01XXVR6KsYFAQiJjHvdzCzoK
Major update to concierge_modal.py:
- Custom video upload: VHAP FLAME tracking extracts per-frame
  expression/pose parameters from user's own motion video
- Video preprocessing pipeline: frame extraction, face detection
  (VGGHead), background matting, landmark detection per frame
- VHAP GlobalTracker integration for multi-frame optimization
- Export to NeRF dataset format (transforms.json + flame_param/*.npz)
- Gradio UI: motion source selector (custom video or sample)
- Preview video with optional audio from source video
- Max 300 frames (10s@30fps) cap for manageable processing

This enables generating high-quality concierge.zip with custom
expressions/movements instead of being limited to pre-set samples.

https://claude.ai/code/session_01XXVR6KsYFAQiJjHvdzCzoK
- Replace add_local_dir("./assets") with HuggingFace downloads for all
  required model assets (FLAME tracking, parametric models, LAM assets)
- Remove REQUIRED_ASSET local check since assets are fetched at build time
- Build VHAP config programmatically instead of loading from YAML file
- Remove deprecated allow_concurrent_inputs parameter
- Add flame_vhap symlink for VHAP tracking compatibility
- Add critical file verification in _download_models()

Fixes FileNotFoundError: flame2023.pkl not found in container

https://claude.ai/code/session_01XXVR6KsYFAQiJjHvdzCzoK
Replace container-build-time HuggingFace downloads with add_local_dir
to mount model files from the user's local LAM repo. This is faster
and avoids dependency on HuggingFace availability.

- Add _has_model_zoo / _has_assets detection at module level
- Mount ./model_zoo and ./assets via add_local_dir (conditional)
- Add _setup_paths() to bridge directory layout differences:
  - assets/human_parametric_models → model_zoo/human_parametric_models
  - flame_assets/flame2023.pkl → flame_assets/flame/ (flat layout)
  - flame_vhap symlink for VHAP tracker
- Add model file verification with find-based search

https://claude.ai/code/session_01XXVR6KsYFAQiJjHvdzCzoK
Modal requires add_local_dir to be the last image build step.
Move _setup_model_paths() from run_function (build time) to
_init_lam_pipeline() (container startup) to comply with this.

https://claude.ai/code/session_01XXVR6KsYFAQiJjHvdzCzoK
User keeps all models under assets/ (not model_zoo/).
Instead of symlinking individual subdirectories, symlink the entire
model_zoo -> assets when model_zoo doesn't exist. This bridges
lam_models, flame_tracking_models, and human_parametric_models
all at once.

Also adds model.safetensors to the verification checklist.

https://claude.ai/code/session_01XXVR6KsYFAQiJjHvdzCzoK
Three files are not available locally and must be downloaded:
- model.safetensors (LAM-20K model weights from 3DAIGC/LAM-20K)
- template_file.fbx, animation.glb (from Ethan18/test_model LAM_assets.tar)

Download runs via run_function BEFORE add_local_dir to satisfy
Modal's ordering constraint. Downloads are cached in the image layer.

https://claude.ai/code/session_01XXVR6KsYFAQiJjHvdzCzoK
1. Downloaded LAM assets (template_file.fbx, animation.glb) were
   being overwritten by the add_local_dir mount of assets/.
   Fix: copy extracted assets into model_zoo/ during build so they
   survive the mount. Update all path references accordingly.

2. Pin gradio==4.44.0 and gradio_client==1.3.0 to avoid the
   json_schema_to_python_type TypeError on additionalProperties.

https://claude.ai/code/session_01XXVR6KsYFAQiJjHvdzCzoK
1. Switch assets download from Ethan18/test_model (incomplete) to
   official 3DAIGC/LAM-assets which includes sample_oac/ with
   template_file.fbx and animation.glb.

2. Monkey-patch gradio_client._json_schema_to_python_type to handle
   boolean additionalProperties schema (TypeError on bool).

https://claude.ai/code/session_01XXVR6KsYFAQiJjHvdzCzoK
claude and others added 30 commits March 1, 2026 11:31
Geminiオリジナルに無い .force_build() を削除。
a9bffb4(ユーザーアップロード版)と完全一致。

https://claude.ai/code/session_01XchDoYiekhyaAWFvV5DXVq
app_modal.py は触らず、lam_avatar_batch.py 側でゼロからビルドする。

https://claude.ai/code/session_01XchDoYiekhyaAWFvV5DXVq
force_build()はImageメソッドではない。
ゼロビルドは環境変数 MODAL_FORCE_BUILD=1 で指定する。

https://claude.ai/code/session_01XchDoYiekhyaAWFvV5DXVq
… step

The Modal image build failed because the sed command referenced a
non-existent file "pixlwise.py". The actual filename is "pixelwise.py".

https://claude.ai/code/session_01XchDoYiekhyaAWFvV5DXVq
Runs the same pipeline as official app_hf_space.py directly on Colab
with Google Drive mounted for model weights. Useful for debugging
the bird-monster issue independently of the Modal environment.

https://claude.ai/code/session_01XchDoYiekhyaAWFvV5DXVq
…book

The official code has [1.0, 1,0] which Python interprets as [1.0, 1, 0]
(3 elements). Fixed to [1.0, 1.0] (2 elements) for correctness.

https://claude.ai/code/session_01XchDoYiekhyaAWFvV5DXVq
Change prepare_motion_seqs enlarge_ratio from [1.0, 1.0] to [1.0, 1, 0]
to match the official LAM_Large_Avatar_Model/app.py exactly.

https://claude.ai/code/session_01XchDoYiekhyaAWFvV5DXVq
Cache pytorch3d, diff-gaussian-rasterization, and simple-knn wheels
to /content/drive/MyDrive/LAM/wheel_cache/. On first run, builds from
source and saves the wheel. On subsequent sessions, installs from
cache in seconds instead of 15-25 min rebuild.

https://claude.ai/code/session_01XchDoYiekhyaAWFvV5DXVq
GitHub official repo (aigc3d/LAM) is proven non-functional after 100+
hours of testing. Switch to ModelScope Studio source which is the
working reference implementation.

- Cell 14: clone from modelscope.cn/studios/Damo_XR_Lab/LAM_Large_Avatar_Model
- Cell 20: handle ModelScope layout (files at root, not in tools/)

https://claude.ai/code/session_01XchDoYiekhyaAWFvV5DXVq
…Scope clone

ModelScope (China server) is too slow from Colab. Use the LAM_Large_Avatar_Model
subdirectory from our own repo with shallow clone for fast setup (~150MB).

https://claude.ai/code/session_01XchDoYiekhyaAWFvV5DXVq
…n Colab

Colab Python 3.12 ships numpy 2.x. Force-installing 1.26.4 breaks
torch._dynamo and other prebuilt packages (dtype size mismatch).

https://claude.ai/code/session_01XchDoYiekhyaAWFvV5DXVq
- Force-reinstall numpy 1.26.x before other packages to ensure
  consistent binary interface across scipy, torch, pytorch3d etc.
- Add importlib.reload(np) in パス設定 cell to refresh numpy module
  before torch._dynamo import triggers numpy.random.mtrand

https://claude.ai/code/session_01XchDoYiekhyaAWFvV5DXVq
…n mismatch

torch._dynamo.eval_frame may fail with ImportError when Colab's
preinstalled PyTorch version conflicts with cu121 xformers.
The monkey-patch on torch.compile is sufficient to disable compilation.

https://claude.ai/code/session_01XchDoYiekhyaAWFvV5DXVq
When Colab upgrades PyTorch, cached .so files compiled against the old
version fail with 'undefined symbol' errors. Cache path now includes
torch.version.cuda (e.g. wheel_cache/torch12.1/) so wheels are
automatically rebuilt when the CUDA toolkit version changes.

Affects: pytorch3d, diff-gaussian-rasterization, simple-knn cells.

https://claude.ai/code/session_01XchDoYiekhyaAWFvV5DXVq
scipy compiled against numpy 2.x causes infinite recursion when used
with numpy 1.26.x. Now both are pinned together in cell-7 before any
other packages are installed.

https://claude.ai/code/session_01XchDoYiekhyaAWFvV5DXVq
- cell-7: xformers --no-deps to prevent numpy 2.x override
- cell-7: remove 2>/dev/null so install errors are visible
- cell-9: add numpy pin + post-install assertion to catch overrides
- cell-20: remove fragile importlib.reload(np), just import normally

Root cause: later pip installs (rembg, scikit-image, xformers) could
silently upgrade numpy to 2.x, breaking scipy's C extensions and
causing RecursionError on import.

https://claude.ai/code/session_01XchDoYiekhyaAWFvV5DXVq
onnxruntime-gpu==1.18.1 forces numpy 2.0.2 as a dependency,
overriding our numpy<2 pin. Fix by moving the force-reinstall
to AFTER all pip installs complete, then verifying with assert.

https://claude.ai/code/session_01XchDoYiekhyaAWFvV5DXVq
- cell-7: detect preinstalled xformers first, skip cu121-specific index
  (Colab now ships torch 2.10+cu128, not cu121)
- cell-9: verify numpy version via subprocess (new Python process) instead
  of importlib.reload() which cannot reload C extension modules in-memory

The previous assert was a false positive: numpy 1.26.4 was correctly
installed on disk, but the in-memory module still showed 2.0.2.

https://claude.ai/code/session_01XchDoYiekhyaAWFvV5DXVq
Root cause: importing numpy in cell-7 locks the version in memory.
When onnxruntime-gpu later installs numpy 2.x, the force-reinstall
puts 1.26.4 back on disk but memory keeps the stale version.
importlib.reload() cannot reload C extension modules.

Solution: remove all `import numpy` from install cells (7, 9).
The first real import now happens in cell-20, after all pip installs
are complete and numpy 1.26.4 is the final on-disk version.
No kernel restart needed.

https://claude.ai/code/session_01XchDoYiekhyaAWFvV5DXVq
Cells 23 and 26 import from the `lam` package but fail with
ModuleNotFoundError if cell-20 was skipped or errored.
Add a defensive sys.path.insert(0, '/content/LAM') check at the
top of each cell so they work even if run out of order.

https://claude.ai/code/session_01XchDoYiekhyaAWFvV5DXVq
…tion

Colab doesn't show cell indices, making it hard to identify cells
when discussing errors. Added clear labels like [1.4], [4.4], [5.2]
to the top of every code cell so users can easily find and reference
specific cells.

Section mapping:
  [0.1]       Google Drive mount
  [1.1]-[1.9] Environment setup (CUDA, packages, builds)
  [2.1]-[2.3] LAM repo clone & setup
  [3.1]       Image upload
  [4.1]-[4.5] Pipeline initialization
  [5.1]-[5.5] Inference execution
  [6.1]-[6.3] Results display & download
  [Debug]     Optional diagnostics

https://claude.ai/code/session_01XchDoYiekhyaAWFvV5DXVq
Show a clear error message telling the user to run [4.1]-[4.3] first,
instead of a cryptic NameError.

https://claude.ai/code/session_01XchDoYiekhyaAWFvV5DXVq
chumpy 0.70 uses inspect.getargspec which was removed in Python 3.11.
Colab now runs Python 3.12, causing AttributeError when FLAME model
loads chumpy via pickle. Added sed patch in [1.4] to replace
getargspec with getfullargspec in chumpy/ch.py.

https://claude.ai/code/session_01XchDoYiekhyaAWFvV5DXVq
Merge Drive mount, env setup, CUDA builds, LAM clone, symlinks into
one cell for one-click execution. Image upload stays separate (interactive).

Before: 15 cells (cell-2 through cell-16) requiring manual sequential execution
After:  1 cell [Setup] that runs everything automatically after Drive auth

https://claude.ai/code/session_016GSif9xGhfv1eGTDmYj6RR
After pip installing numpy 1.26 (replacing Colab's numpy 2.x), the kernel
process still has the old numpy C extensions in memory. This causes
ValueError when importing torch._dynamo in [4.1].

Added os.kill(os.getpid(), 9) at end of [Setup] to force kernel restart.
Drive mount, cloned repos, and uploaded files survive the restart.

https://claude.ai/code/session_016GSif9xGhfv1eGTDmYj6RR
… crash

Colab's Jupyter kernel passes '-f kernel-xxx.json' in sys.argv.
FlameTrackingSingleImage.__init__ uses argparse internally which does
not recognize -f, causing SystemExit: 2.

Temporarily set sys.argv = [''] before instantiation, restore after.

https://claude.ai/code/session_016GSif9xGhfv1eGTDmYj6RR
…ault change)

PyTorch 2.6 changed torch.load default to weights_only=True, breaking:
- VGGDetector.py: TorchScript archive -> use torch.jit.load instead
- All other torch.load calls: add weights_only=False for trusted model files

Also patches all torch.load calls in LAM repo (excluding audio2exp-service
which already has weights_only=False).

https://claude.ai/code/session_016GSif9xGhfv1eGTDmYj6RR
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants