Skip to content

feat(factory): put few known_issues cards into factory and add log#21

Open
Dong1017 wants to merge 1 commit intovigo999:refactor-arch-4from
Dong1017:refactor-arch-4
Open

feat(factory): put few known_issues cards into factory and add log#21
Dong1017 wants to merge 1 commit intovigo999:refactor-arch-4from
Dong1017:refactor-arch-4

Conversation

@Dong1017
Copy link
Copy Markdown
Collaborator

Summary

Expand incubating/factory/cards/known_issues with a first batch of high-frequency runtime failure cards so Factory can route common Ascend/GPU failures before deeper reasoning.

What Changed

Added these known_issue cards:

  • missing-cann-environment
  • device-out-of-memory
  • distributed-communication-timeout
  • ms-context-empty
  • ms-tbe-operator-compilation-error
  • stack-version-mismatch

Also updated the Factory manifest inventory in incubating/factory/manifests/pack.yaml so the new cards are included in the stable pack.

Why

The existing known_issues set covered only a small slice of failure signatures. In practice, many first-line diagnosis requests cluster around:

  • missing or incomplete CANN setup
  • device OOM
  • distributed/HCCL timeout
  • MindSpore context initialization failures
  • TBE compile failures on Ascend
  • mixed-version / ABI mismatch after partial upgrades

These are strong candidates for Factory because they are:

  • frequent
  • easy to detect from stable log/error patterns
  • high-value for early routing
  • low-ambiguity compared with deeper implementation bugs

Card Design Notes

Each card is kept in the current known_issue schema shape and includes:

  • stable id
  • symptom: failure
  • severity and lifecycle metadata
  • platform tags
  • detection regex/patterns
  • short description
  • concise first-line fix summary

The intent is to improve triage and reuse, not to encode full repair playbooks.

Impact

This increases the known_issues inventory in the pack and gives Factory better coverage for bootstrap/runtime failures that appear before model- or operator-specific analysis.

Expected user-facing effect:

  • faster routing for common infrastructure/runtime failures
  • fewer cases falling through to generic “investigate manually” paths
  • cleaner separation between known operational failures and deeper code-level issues

Validation

  • Confirm new YAML files conform to the current known_issue card shape.
  • Confirm incubating/factory/manifests/pack.yaml includes the new card entries and updated known_issues count.
  • Spot-check detection patterns and summaries against the intended runtime signatures.

@Dong1017 Dong1017 changed the title add: put few known_issues cards into factory and add log feat(factory): put few known_issues cards into factory and add log Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant