From e74c730883d7e8a58230edd5a95474fd854ed330 Mon Sep 17 00:00:00 2001 From: Facebook GitHub Bot Date: Fri, 22 May 2026 17:51:11 -0700 Subject: [PATCH] Re-sync with internal repository The internal and external repositories are out of sync. This Pull Request attempts to brings them back in sync by patching the GitHub repository. Please carefully review this patch. You must disable ShipIt for your project in order to merge this pull request. DO NOT IMPORT this pull request. Instead, merge it directly on GitHub using the MERGE BUTTON. Re-enable ShipIt after merging. fbshipit-source-id: a316528d8af69b3aef9b9261735e9f62996a2019 --- fboss/skills/fboss-code-standards/SKILL.md | 81 ++++++++++++ .../references/agent-patterns.md | 92 ++++++++++++++ .../references/contributing.md | 38 ++++++ .../references/general-patterns.md | 117 ++++++++++++++++++ .../references/platform-config-patterns.md | 28 +++++ .../references/sai-sdk-patterns.md | 48 +++++++ .../references/testing-patterns.md | 93 ++++++++++++++ .../references/thrift-cow-fsdb-patterns.md | 28 +++++ fboss/skills/fboss-review/SKILL.md | 86 +++++++++++++ .../fboss-review/references/contributing.md | 32 +++++ .../references/fboss-specific-reviewers.md | 117 ++++++++++++++++++ .../references/generic-reviewers.md | 68 ++++++++++ .../fboss-review/references/verifier.md | 44 +++++++ 13 files changed, 872 insertions(+) create mode 100644 fboss/skills/fboss-code-standards/SKILL.md create mode 100644 fboss/skills/fboss-code-standards/references/agent-patterns.md create mode 100644 fboss/skills/fboss-code-standards/references/contributing.md create mode 100644 fboss/skills/fboss-code-standards/references/general-patterns.md create mode 100644 fboss/skills/fboss-code-standards/references/platform-config-patterns.md create mode 100644 fboss/skills/fboss-code-standards/references/sai-sdk-patterns.md create mode 100644 fboss/skills/fboss-code-standards/references/testing-patterns.md create mode 100644 fboss/skills/fboss-code-standards/references/thrift-cow-fsdb-patterns.md create mode 100644 fboss/skills/fboss-review/SKILL.md create mode 100644 fboss/skills/fboss-review/references/contributing.md create mode 100644 fboss/skills/fboss-review/references/fboss-specific-reviewers.md create mode 100644 fboss/skills/fboss-review/references/generic-reviewers.md create mode 100644 fboss/skills/fboss-review/references/verifier.md diff --git a/fboss/skills/fboss-code-standards/SKILL.md b/fboss/skills/fboss-code-standards/SKILL.md new file mode 100644 index 0000000000000..c29329ea53245 --- /dev/null +++ b/fboss/skills/fboss-code-standards/SKILL.md @@ -0,0 +1,81 @@ +--- +name: fboss-code-standards +description: FBOSS coding standards and patterns. Auto-loaded when writing code in fboss/ to catch architecture violations, SAI/SDK misuse, thrift_cow pitfalls, platform config errors, and testing gaps. For explicit multi-reviewer review, use /fboss-review instead. +--- + +# FBOSS Code Standards + +## Overview + +Passive coding guidance for FBOSS. Applied automatically while writing or modifying code under `fboss/`. + +## Scope + +Currently `fboss/` only. TODO: extend to `configerator/source/neteng/fboss`, `neteng/netcastle`, `neteng/fboss`. + +## Quick Checklist + +| Area | Pattern | Check | +|------|---------|-------| +| Agent | Mono/multi-switch duality | State changes must work in both modes | +| Agent | Warmboot serialization | New SwitchState fields must serialize/deserialize | +| SAI | SaiApiTable registration | New SAI attributes must be registered | +| SAI | SaiStore consistency | SAI objects must be tracked, no orphans | +| FSDB | State/Stats duality | Always handle both trees | +| FSDB | extern template | New ThriftStructNode instantiations need extern template | +| thrift_cow | COW modification | Use `modify()`, never mutate shared nodes | +| Platform | JSON + Thrift sync | Config changes update both schemas | +| Testing | Naming convention | Follow `AgentHwTest` pattern | +| Testing | NSDB impact | Core FSDB changes must run NSDB tests | +| General | Follow local patterns | Be consistent with existing code in the directory | +| Testing | No GTEST_SKIP | Use ProductionFeatures filtering, never `GTEST_SKIP()` | +| Testing | DSF counters | Check reassembly errors on fabric ports, not discards | +| Agent | Non-coalescing behavioral deltas | Mark delta-significant updates non-coalescing | +| Agent | Rolled-out flag cleanup | Remove feature flags that are fully rolled out | +| Agent | Early return on empty deltas | Check for empty deltas and return early | +| Agent | No hardcoded ASIC types | Use feature/property lookups, not ASIC name checks | +| SAI | Explicit cancellation | Distinguish reconnectable errors from shutdown signals | +| General | No private fn defaults | Don't default parameters all callers override | +| General | Explicit state flags | Pass all enable/disable flags, no direction assumptions | +| General | Verify before deleting | Provide evidence files are unused before deleting | +| General | CHECK over assert | Use CHECK for production invariants, not assert() | +| General | emplace over operator[] | Use emplace/insert for map insertions | +| General | ASIC feature gating | Gate HW-specific constraints with ASIC features | +| General | Concise method naming | Don't include class name in method name | +| General | Explicit operator precedence | Parenthesize compound boolean/comparison expressions | +| General | Inline trivial comparisons | Don't wrap one-line checks in helper methods | +| General | Fix lint before landing | Run `arc f` and `arc lint` before submitting | +| Testing | Targeted unit tests | New modules need dedicated unit tests, not just HW tests | +| Agent | Documentation-value methods | Keep per-ASIC limit methods as architecture docs | +| General | Rate-limit verbose logging | Use XLOG_EVERY_MS, never log at line rate | +| General | Perspective-aware naming | Use fromX/toX instead of ambiguous Rx/Tx | +| General | Update OSS build files | Update CMake alongside BUCK when adding/moving files | +| Testing | Deep comparison over size checks | Compare content, not just collection size | +| Testing | Verify negative cases | Explicitly check absence, not just no-throw | +| Testing | Test overflow and error paths | Cover resource exhaustion, not just happy path | +| Testing | Optimize test execution time | Use smallest packet sizes, avoid unnecessary waits | +| Agent | ODS counters for debugging | Add ODS counters for drop/error events, not just logs | +| Agent | Validate external config limits | Reject unreasonable externally configurable values | +| Agent | Populate warmboot cache first | Fill all caches from warmboot state before processing | +| General | Empty vector over optional\ | Use empty vector, not optional\ | +| General | Map-based config | Use map lookups, not if-else chains for platform values | +| General | Log IDs + names | Include both numeric ID and name in log messages | +| Agent | Delta processing order | Remove → Change → Add; document in comments | +| Agent | Validator extraction | Extract complex validation into dedicated classes | +| Agent | Gate validations with flags | New enforcement behind FLAGS for gradual rollout | +| Testing | Test transient states | Test intermediate invalid states, not just final | +| Testing | Consolidate test helpers | Move shared helpers to base classes | +| Testing | Validation UTs | Pure validation → unit tests, not just HW tests | +| Testing | No setup changes to existing tests | Create new tests instead; setup changes break warmboot roundtrip CI | + +## When to Load References + +| Topic | Action | +|-------|--------| +| Team coding conventions (C++ style, control flow, naming) | Read `references/general-patterns.md` | +| Agent code (SwSwitch, HwSwitch, mono/multi-switch, warmboot) | Read `references/agent-patterns.md` | +| SAI/SDK layer (SaiApi, SaiStore, SaiManager, vendor SDK) | Read `references/sai-sdk-patterns.md` | +| FSDB or thrift_cow (subscriptions, COW nodes, PatchBuilder) | Read `references/thrift-cow-fsdb-patterns.md` | +| Platform services or config (sensor_service, fan_service, JSON configs) | Read `references/platform-config-patterns.md` | +| Tests (agent HW tests, multinode tests, naming, fixtures) | Read `references/testing-patterns.md` | +| Adding new patterns | Read `references/contributing.md` | diff --git a/fboss/skills/fboss-code-standards/references/agent-patterns.md b/fboss/skills/fboss-code-standards/references/agent-patterns.md new file mode 100644 index 0000000000000..01f010ec96fe9 --- /dev/null +++ b/fboss/skills/fboss-code-standards/references/agent-patterns.md @@ -0,0 +1,92 @@ +# Agent Architecture Patterns + +Patterns for code under `fboss/agent/`, including SwSwitch, HwSwitch, and state management. + +### 1. Mono/Multi-Switch Duality +**Check**: Any state change or control path must work in both mono mode (`FLAGS_multi_switch=false`, `agent/single/`) and multi-switch mode (`FLAGS_multi_switch=true`, `agent/mnpu/`) +**Why**: FBOSS supports both single-process and multi-process architectures. Code that only works in one mode causes production failures on the other deployment type. +**Confidence**: HIGH (always a bug if only one mode is handled) + +### 2. Warmboot State Preservation +**Check**: New fields added to SwitchState or its child nodes must be serialized and deserialized during warmboot. Verify the field appears in the warmboot state Thrift structure and is handled in `fromThrift()`/`toThrift()`. +**Why**: Missing serialization causes state loss during warmboot, leading to traffic disruption or config drift after agent restart. +**Confidence**: HIGH (always a bug if new state fields are not serialized) + +### 3. OperDelta Sync in Multi-Switch +**Check**: In multi-switch mode, HwSwitch state updates must go through OperDelta sync (`OperDeltaSyncer`), never by direct HwSwitch mutation from SwSwitch. +**Why**: Direct mutation bypasses the Thrift IPC boundary, causing state divergence between SwSwitch and HwSwitch processes. +**Confidence**: HIGH (always a bug in multi-switch mode) + +### 4. State Delta Ordering +**Check**: When applying state deltas, respect dependency order: VLANs before interfaces, interfaces before routes, ACLs before applying to interfaces. +**Why**: Out-of-order delta application causes transient failures or crashes when referenced objects don't exist yet. +**Confidence**: HIGH (ordering violations cause crashes) + +### 5. Lock Ordering in SwSwitch +**Check**: When acquiring multiple locks, follow the established lock hierarchy. Never hold `updateLock_` while acquiring `stateLock_`; `stateLock_` must be acquired first if both are needed. +**Why**: Lock ordering violations cause deadlocks that hang the agent. +**Confidence**: HIGH (always a bug) + +### 6. Non-Coalescing Updates for Behavioral Deltas +**Check**: State updates where the delta carries behavioral significance (e.g., deltaApplicationBehavior is set) must use non-coalescing update paths. Do not allow the state update framework to coalesce such deltas. +**Why**: Coalesced updates collapse intermediate states. A sequence of enable→disable→enable is different from just enable, but coalescing reduces it to the final state, losing behavioral intent. +**Confidence**: HIGH (always a bug if coalesced when delta behavior matters) + +### 7. Clean Up Fully Rolled-Out Feature Flags +**Check**: When a feature is fully rolled out across all platforms, remove the feature flag and the dead code path. Do not leave conditional checks for flags that always evaluate to true. +**Why**: Dead flag branches add complexity, confuse readers, and accumulate debt. If every platform has the feature, the flag serves no purpose. +**Confidence**: MEDIUM (requires confirming rollout is complete on all platforms) + +### 8. Early Return on Empty Deltas +**Check**: Delta processing functions must check for empty deltas at the start and return early. Do not process all state updates when only a specific delta changed. +**Why**: Processing empty deltas wastes CPU and can trigger unnecessary side effects (e.g., recomputing local switch IDs on every state update instead of only when DSF node map changes). +**Confidence**: HIGH (always add an early return for empty deltas) + +### 9. Don't Hardcode ASIC Types +**Check**: Do not check for specific ASIC types (e.g., `if asicType == J3`) when the information can be derived from config, DSF node map, or ASIC feature flags. Look up ASIC properties dynamically. +**Why**: Hardcoded ASIC checks break when new ASICs are added and create maintenance burden. Use feature-based gating instead. +**Confidence**: HIGH (always use feature/property lookups over hardcoded ASIC names) + +### 10. Retain Documentation-Value Per-ASIC Methods +**Check**: Keep per-ASIC limit methods or constants that document platform-specific capabilities (e.g., max neighbor entries per ASIC), even when a unified gflag-based limit is used at runtime. Do not delete them during cleanup unless the platform is fully deprecated. +**Why**: These methods serve as architecture documentation. Without them, future engineers must reverse-engineer per-platform limits from scattered config files or vendor docs. +**Confidence**: MEDIUM (requires judgment on whether the platform is still relevant) + +### 11. Use ODS Counters for Production Debugging +**Check**: Add ODS counters (not just logs) for events that need production debugging — packet drops, errors, resource exhaustion. Add follow-up configerator diffs to permit export of new counters to ODS. +**Why**: Logs are ephemeral and hard to aggregate. ODS counters provide exact counts, time-series trending, and alerting. Without counters, production packet drops require log scraping to debug. +**Confidence**: HIGH (always add ODS counters for drop/error events) + +### 12. Validate External Configuration Limits +**Check**: Validate externally configurable limits (thrift config, gflags, configerator) to prevent values that could destabilize the system. Reject unreasonable values (e.g., 1M pkts/ms rate limit) that would undo built-in protections like thrift overload guards. +**Why**: External users can configure arbitrary values. Without bounds checking, a misconfiguration can cause CPU exhaustion, memory exhaustion, or bypass overload protection. +**Confidence**: HIGH (always validate external config bounds) + +### 13. Populate Warmboot Cache Before Processing +**Check**: During warmboot, populate all caches and ID maps from the warmboot state before processing any new state changes. Do not lazily discover IDs as routes or entries are added. +**Why**: Lazy discovery causes ID collisions — new allocations may reuse IDs that exist in the warmboot cache but haven't been discovered yet, corrupting forwarding state. +**Confidence**: HIGH (always populate warmboot caches before processing new state) + +### 14. Delta Processing Order +**Check**: Process deltas in consistent order: removals → changes → additions (per interface, per object type). Document the ordering in comments. +**Why**: Wrong order creates transient invalid states (e.g., two neighbors with same MAC during update). Documented order prevents future refactors from breaking it. +**Source**: D96253394 reviewer feedback +**Confidence**: HIGH + +### 15. Extract Validation into Dedicated Classes +**Check**: Complex validation logic should be extracted into dedicated validator classes with clear `isValidDelta()` and `stateChanged()` APIs, not embedded in state update handlers. +**Why**: Single responsibility, easier to unit test in isolation, easier to maintain ordering semantics. +**Source**: D96253394 reviewer feedback +**Confidence**: MEDIUM + +### 16. Gate New Validations Behind Feature Flags +**Check**: New validation/enforcement logic must be gated behind a FLAGS (e.g., `FLAGS_enforce_single_nbr_mac_per_intf`) for gradual rollout. +**Why**: Allows disabling if issues arise in production, enables A/B comparison, prevents big-bang rollout risk. +**Source**: D96253394 reviewer feedback +**Confidence**: HIGH + +### 17. Rebuild Validator State on Rejection +**Check**: When `updateRejected()` is called after partial validation, rebuild validator state from scratch via `stateChanged(empty → oldState)` rather than attempting incremental undo. +**Why**: Incremental undo is error-prone. Clean rebuild ensures validator state stays in sync even when validation fails partway through. +**Source**: D96253394 reviewer feedback +**Confidence**: MEDIUM diff --git a/fboss/skills/fboss-code-standards/references/contributing.md b/fboss/skills/fboss-code-standards/references/contributing.md new file mode 100644 index 0000000000000..ad2329bc15090 --- /dev/null +++ b/fboss/skills/fboss-code-standards/references/contributing.md @@ -0,0 +1,38 @@ +# Contributing Patterns + +## Where to Add Your Pattern + +``` +Is your pattern FBOSS-specific? ++-- YES -> Which area? +| +-- Agent/SwSwitch/HwSwitch -> agent-patterns.md +| +-- SAI/SDK -> sai-sdk-patterns.md +| +-- FSDB/thrift_cow -> thrift-cow-fsdb-patterns.md +| +-- Platform/Config -> platform-config-patterns.md +| +-- Testing -> testing-patterns.md +| +-- Cross-cutting design judgment (not a single Check/Why rule) +| -> fboss-review/references/fboss-specific-reviewers.md (Architect reviewer) ++-- NO -> Is it a coding convention the team wants enforced? + +-- YES -> general-patterns.md + +-- NO -> Probably doesn't belong here +``` + +## Pattern Template + +Every pattern must follow this format: + +```markdown +### N. Pattern Name +**Check**: What specifically to look for +**Why**: Consequence of violation +**Confidence**: HIGH (always a bug) or MEDIUM (context-dependent) +``` + +## Guidelines + +- **Evidence-based only**: Patterns must come from real bugs, SEVs, or repeated review feedback. No hypothetical issues. +- **False positive test**: Before adding, check 3-5 recent diffs to confirm the pattern would not produce false positives. +- **HIGH confidence**: Use only when violation is always a bug, regardless of context. +- **MEDIUM confidence**: Use when violation is usually a problem but some exceptions exist. +- **No style preferences**: Style-only preferences (brace placement, blank lines) do not belong here. +- **Diff prefix**: When submitting pattern changes, use `[fboss][claude]` prefix in the diff title. diff --git a/fboss/skills/fboss-code-standards/references/general-patterns.md b/fboss/skills/fboss-code-standards/references/general-patterns.md new file mode 100644 index 0000000000000..0629c1eb8d107 --- /dev/null +++ b/fboss/skills/fboss-code-standards/references/general-patterns.md @@ -0,0 +1,117 @@ +# General Coding Patterns + +Team-wide coding conventions for FBOSS. These apply across all FBOSS subsystems. + +### 1. Switch-Case Over Long If-Else Chains +**Check**: When branching on an enum or discrete value with 3+ cases, use switch-case instead of if-else +**Why**: Switch-case gives compiler warnings for unhandled enum values, preventing silent bugs when new values are added +**Confidence**: MEDIUM (context-dependent, some cases warrant if-else) + +### 2. XLOG Over cout/printf +**Check**: Use `XLOG()` macros (XLOG_INFO, XLOG_ERR, XLOG_DBG) instead of `std::cout`, `printf`, or `LOG()` +**Why**: XLOG integrates with folly logging, supports structured logging, and can be filtered by category +**Confidence**: HIGH (always use XLOG in FBOSS code) + +### 3. Consistent Error Reporting +**Check**: Use `FbossError` or `FbossHwError` for FBOSS-specific exceptions, not generic `std::runtime_error` +**Why**: FBOSS exception types carry additional context (switch index, error code) and integrate with error handling infrastructure +**Confidence**: HIGH (always use FBOSS exception types) + +### 4. Const Correctness +**Check**: Mark methods, parameters, and variables `const` when they do not modify state +**Why**: Prevents accidental mutation, enables compiler optimizations, documents intent +**Confidence**: MEDIUM (follow local conventions in the file) + +### 5. Follow Local Naming Conventions +**Check**: Match the naming style of surrounding code (camelCase for methods, camelCase for variables, UPPER_SNAKE for constants) +**Why**: Consistency within a file/directory is more important than a global rule +**Confidence**: HIGH (always match local style) + +### 6. No Default Parameters on Private Functions +**Check**: Private/internal functions where all callers always pass the argument explicitly should not have a default value. If every caller overrides the default, the default is misleading. +**Why**: Defaults on private functions hide intent. Readers assume the default is common, but if all callers override it, the "default" is never used. +**Confidence**: MEDIUM (context-dependent; defaults on public APIs are fine) + +### 7. Pass All State Explicitly +**Check**: When configuring state with multiple directions or modes (e.g., RX/TX, enable/disable), pass all flags explicitly. Do not hardcode assumptions like "only RX is enabled." +**Why**: Hardcoded assumptions break when the other direction is later enabled. Use standard SAI attributes or explicit parameters for all directions. +**Confidence**: MEDIUM (depends on whether the assumption is documented) + +### 8. Verify Before Deleting +**Check**: When deleting files (especially via codemods), verify they are actually unused. Provide evidence: `buck targets`, dependency checks, or grep results. +**Why**: Automated codemods can incorrectly identify files as unused. Blindly trusting a codemod and deleting active files causes build breaks. +**Confidence**: HIGH (always provide evidence before approving file deletions) + +### 9. CHECK Over assert for Runtime Invariants +**Check**: Use `CHECK` / `CHECK_EQ` (folly) for invariants that must hold in production. Do not use `assert()` for runtime safety checks — assert only triggers in dbg/dev builds and is compiled out in opt builds. +**Why**: Assertions in production builds are no-ops. Critical invariant violations go undetected, leading to silent corruption or crashes later. +**Confidence**: HIGH (always use CHECK for production safety checks) + +### 10. Prefer emplace/insert Over operator[] for Maps +**Check**: When inserting into a map, use `emplace()` or `insert()` instead of `operator[]`. The latter default-constructs the value then assigns, wasting a construction. +**Why**: `operator[]` creates a default object and then copy/move-assigns. `emplace` constructs in place. For complex value types, this avoids unnecessary work. `insert` also returns whether the insertion happened, useful for logging. +**Confidence**: MEDIUM (performance impact depends on value type complexity) + +### 11. Gate Hardware-Specific Constraints with ASIC Features +**Check**: Hardware-specific constraints (e.g., single MAC per port RIF) must be gated by ASIC/ProductionFeature checks, not assumed globally. Different ASICs (J2 vs J3AI vs Spectrum4) have different capabilities. +**Why**: Applying constraints globally either breaks platforms that support the feature or leaves unsupported platforms unprotected. +**Confidence**: HIGH (always gate with ASIC feature checks) + +### 12. Concise Method Naming +**Check**: Method names should not redundantly include the class or module name. For example, a method in `ResourceAccountant` should be `getMaxNeighborTableSize()` not `getMaxNeighborTableSizeByResourceAccountant()`. +**Why**: Redundant context makes names unwieldy and harder to read. The class name already provides context at the call site. +**Confidence**: HIGH (always trim redundant prefixes/suffixes) + +### 13. Explicit Operator Precedence +**Check**: Parenthesize compound boolean and comparison expressions for clarity, even when operator precedence produces the correct result. E.g., `isValidUpdate &= (count <= getMaxSize());` not `isValidUpdate &= count <= getMaxSize();`. +**Why**: Relying on implicit precedence in compound expressions forces the reader to recall precedence rules. Parentheses make intent explicit and prevent subtle bugs during future edits. +**Confidence**: HIGH (always parenthesize compound expressions) + +### 14. Inline Trivial Comparisons +**Check**: Do not wrap one-line comparisons in helper methods. Inline the expression at the call site when it is equally readable. E.g., `count <= getMaxSize()` is more readable than `checkResource(count)` when the helper does nothing but that comparison. +**Why**: Unnecessary helper methods add indirection without abstraction benefit. The reader must jump to the helper to understand a trivial check. +**Confidence**: MEDIUM (judgment call — helpers are fine when they encapsulate non-trivial logic) + +### 15. Fix Lint Before Landing +**Check**: All lint errors and warnings must be resolved before landing a diff. Run `arc f` (format) and `arc lint` before submitting. Do not land with known lint issues. +**Why**: Accumulated lint creates broken-window syndrome. Each unlinted diff makes it harder for the next author to distinguish their lint from pre-existing issues. +**Confidence**: HIGH (always fix lint before landing) + +### 16. Rate-Limit Verbose Logging +**Check**: Rate-limit per-event logging (e.g., per-packet, per-request). Use `XLOG_EVERY_MS` or similar throttling instead of logging at line rate. Do not add an XLOG inside a tight loop or per-packet handler without rate limiting. +**Why**: Per-packet logging causes CPU hog and can destabilize the agent under load. One log line per 5 seconds is sufficient for debugging without impacting performance. +**Confidence**: HIGH (always rate-limit hot-path logging) + +### 17. Perspective-Aware Naming +**Check**: When naming involves direction (Rx/Tx, send/receive, ingress/egress), make the reference frame explicit. Use `fromX`/`toX` naming (e.g., `fromSidebandAgent`, `toSidebandAgent`) instead of ambiguous `Rx`/`Tx` which can be interpreted from either the switch or the external agent perspective. +**Why**: Rx and Tx are ambiguous — readers default to the switch perspective, but the code may mean the opposite. Explicit directional naming eliminates confusion. +**Confidence**: MEDIUM (judgment call — Rx/Tx is fine when the perspective is unambiguous) + +### 18. Update OSS Build Files Alongside Internal Changes +**Check**: When adding or moving source files, update OSS CMake files alongside BUCK targets. Also update the scope resolver when adding new state types or getters to SwitchState. +**Why**: OSS builds use CMake, not Buck. Forgetting to update CMake files breaks the open-source build, which is caught only after landing and delays vendor collaboration. +**Confidence**: HIGH (always update OSS CMake when modifying BUCK targets) + +### 19. Empty Vector Over optional\ +**Check**: Use `vector` (empty = no values) instead of `optional>` which is ambiguous +**Why**: `optional` conflates "no answer" with "empty answer". Callers can check `.empty()`. +**Source**: D59879844 reviewer feedback +**Confidence**: HIGH + +### 20. Map-Based Config Over If-Chains +**Check**: Replace platform-specific if-else chains with `std::unordered_map` lookups +**Why**: Adding a new platform requires only a map entry, not a code change. Reduces error-prone branching. +**Source**: D93164680 reviewer feedback +**Confidence**: MEDIUM + +### 21. Include Both ID and Name in Logs +**Check**: When logging switchId, portId, or similar identifiers, include both numeric ID and human-readable name +**Why**: Names are better mnemonics for debugging. Numeric-only logs require cross-referencing. +**Source**: D61140726 reviewer feedback +**Confidence**: MEDIUM + +### 22. No Hardcoded Magic Numbers When Enums/Constants Exist +**Check**: Never use hardcoded numeric literals when a named enum, macro, or constant exists for that value +**Why**: Magic numbers hide intent and break when enum values change. Existing usage of the same magic number is not a valid precedent. +**Source**: D100107691 reviewer feedback +**Confidence**: HIGH diff --git a/fboss/skills/fboss-code-standards/references/platform-config-patterns.md b/fboss/skills/fboss-code-standards/references/platform-config-patterns.md new file mode 100644 index 0000000000000..f664ebdd0ac31 --- /dev/null +++ b/fboss/skills/fboss-code-standards/references/platform-config-patterns.md @@ -0,0 +1,28 @@ +# Platform & Config Patterns + +Patterns for code under `fboss/platform/`, platform services (sensor_service, fan_service), and configuration files. + +### 1. JSON Config Schema Sync +**Check**: When modifying platform config, update both the JSON schema and the corresponding Thrift structure. Changes to one without the other cause deserialization failures. +**Why**: JSON configs are deserialized into Thrift structs. Schema mismatch causes config load failures at agent startup. +**Confidence**: HIGH (always a bug) + +### 2. Per-Platform Testing +**Check**: Platform-specific code changes must include test coverage for affected platforms. Check that platform-specific test targets exist and are updated. +**Why**: Platform code often has subtle differences across hardware (Memory/memory, memory/memory, memory). Untested platforms may break silently. +**Confidence**: MEDIUM (depends on scope of change) + +### 3. Startup Order Dependencies +**Check**: When adding new service dependencies during platform initialization, declare them explicitly in the startup sequence. Check for circular dependencies. +**Why**: Undeclared startup dependencies cause race conditions where services access uninitialized components. +**Confidence**: HIGH (causes crashes or undefined behavior) + +### 4. Config Compile-Time Generation +**Check**: After modifying config templates or Thrift config structures, regenerate compiled configs using the appropriate build targets. +**Why**: Stale compiled configs cause runtime mismatches between expected and actual configuration. +**Confidence**: HIGH (always required after config schema changes) + +### 5. OSS/Internal Split +**Check**: Changes to files that are open-sourced must also update the corresponding CMake build files. Check if the file is under an OSS-exported directory. +**Why**: Missing CMake updates break the open-source build, which is tested separately and may not fail until the OSS CI runs. +**Confidence**: MEDIUM (only applies to OSS-exported code) diff --git a/fboss/skills/fboss-code-standards/references/sai-sdk-patterns.md b/fboss/skills/fboss-code-standards/references/sai-sdk-patterns.md new file mode 100644 index 0000000000000..91934885ef418 --- /dev/null +++ b/fboss/skills/fboss-code-standards/references/sai-sdk-patterns.md @@ -0,0 +1,48 @@ +# SAI/SDK Layer Patterns + +Patterns for code under `fboss/agent/hw/sai/`, `fboss/lib/phy/`, and vendor SDK integrations. + +### 1. SaiApiTable Registration +**Check**: New SAI attributes must be registered in the corresponding `SaiApiTable` entry. Check that the attribute is added to the attribute list in the relevant `SaiApi` (e.g., `SaiPortApi`, `SaiRouteApi`). +**Why**: Unregistered attributes are silently ignored, causing features to appear configured but not actually programmed in hardware. +**Confidence**: HIGH (always a bug) + +### 2. SaiStore Consistency +**Check**: SAI objects must be tracked in `SaiStore`. Creating a SAI object without adding it to the store, or removing it without cleaning the store entry, creates orphans. +**Why**: Orphaned SAI objects leak hardware resources and cause inconsistency during warmboot reconstruction. +**Confidence**: HIGH (always a bug) + +### 3. SAI Attribute Mapping +**Check**: When adding new Thrift fields that map to hardware configuration, ensure the corresponding SAI attribute mapping exists in the `SaiManager` (e.g., `SaiPortManager`, `SaiRouteManager`). +**Why**: Missing mappings mean the Thrift config is accepted but never programmed, causing silent configuration failures. +**Confidence**: HIGH (always a bug if the field is meant to be programmed) + +### 4. SDK Version Guards +**Check**: Vendor-specific SAI behavior or attributes must be guarded by SDK version checks (e.g., `#if defined(MEMORY_SAI_VERSION_X_Y_Z)`). +**Why**: Using attributes not supported by older SDK versions causes compilation failures or runtime crashes on platforms with older SDKs. +**Confidence**: HIGH (always required for vendor-specific features) + +### 5. SAI Test Coverage +**Check**: New SAI features need both `SaiManagerTest` (unit test for manager logic) and `AgentHwTest` (integration test on real/simulated hardware). +**Why**: Manager tests alone don't verify hardware interaction; HW tests alone don't verify state management logic. +**Confidence**: MEDIUM (test coverage is important but some features may only need one type) + +### 6. Avoid Static/Thread-Local Buffers in SaiAttribute _fill Calls +**Check**: `_fill` functions that convert between SAI list types and C++ value types must not use `static` or `thread_local` intermediate buffers. Instead, choose a C++ value type whose memory layout matches the SAI type so data can be pointed to directly. +**Why**: Static/thread-local buffers are shared across calls. When multiple objects are serialized (e.g., during warmboot adapter host key serialization), the buffer is overwritten on each call, so only the last value survives. This caused a real bug (D98397282) where SRV6 segment lists were corrupted — only the final segment list value appeared in warmboot state. The fix was to change `SaiSrv6SidListTraits::SegmentList` from `std::vector` to `std::vector>`, which has the same memory layout as `sai_ip6_t` and can be pointed to directly without any conversion buffer. Type conversions (e.g., from `folly::IPAddressV6`) should happen at the caller boundary using helpers like `toSaiIp6List()` in `AddressUtil`. +**Confidence**: HIGH (caused a warmboot data corruption bug) + +### 7. Keep Locking Contained in SaiSwitch +**Check**: Locking logic (mutexes, lock guards, lock policies) must not leak into individual `SaiManager` classes (e.g., `SaiNextHopManager`, `SaiFdbManager`). Concurrency control should remain constrained to `SaiSwitch` — the rest of the SAI switch layer code should not need to worry about concurrency. +**Why**: The SAI switch layer is designed so that `SaiSwitch` owns the `saiSwitchMutex_` and manages all locking via `FineGrainedLockPolicy`. If individual managers start accepting lock guards or lock policies in their APIs, locking responsibilities become distributed and hard to reason about, increasing the risk of deadlocks and lock-ordering violations. This was flagged in D92985792 where a change attempted to pass lock policies into manager `listManagedObjects` calls — the correct approach is to acquire and release locks within `SaiSwitch` methods and call manager APIs without exposing locking to them. +**Confidence**: HIGH (architectural invariant) + +### 8. Never Silently Ignore SwitchState Programming Requests +**Check**: SAI Manager code must not use `if` conditions that silently skip programming a field from `SwitchState`. If the control plane asks the SAI layer to program something (ACL qualifier, next hop attribute, port config, etc.), the SAI layer must either program it or throw an error — never silently ignore it. +**Why**: Silent skips mask bugs. When a SwitchState delta requests a configuration change and the SAI layer silently drops it because a field is unset or an ASIC capability is missing, the system reports success while the hardware is misconfigured. This is especially dangerous with vendor-contributed code that may implement partial support — the feature appears to work in review but silently fails on certain inputs. The correct behavior is to throw `FbossError` or `SaiApiError` when a requested configuration cannot be programmed, so the failure is visible. Do not accept PRs from vendors that add `if (!field.has_value()) { return; }` guards that skip programming — these hide incomplete implementations. (Flagged in D104291402) +**Confidence**: HIGH (silent config drops are always a bug) + +### 9. Explicit Cancellation for Thrift/Stream Clients +**Check**: Reconnecting Thrift and stream clients must use explicit cancellation signals, not treat all exceptions as cancellation. Disconnections during normal operation should trigger reconnection, not shutdown. +**Why**: Treating any exception as cancellation causes clients to permanently stop when they should reconnect. Only explicit cancellation (e.g., shutdown signal) should terminate the client. +**Confidence**: HIGH (always distinguish reconnectable errors from shutdown signals) diff --git a/fboss/skills/fboss-code-standards/references/testing-patterns.md b/fboss/skills/fboss-code-standards/references/testing-patterns.md new file mode 100644 index 0000000000000..2dda292ec0be0 --- /dev/null +++ b/fboss/skills/fboss-code-standards/references/testing-patterns.md @@ -0,0 +1,93 @@ +# Testing Patterns + +Patterns for FBOSS test code, including agent HW tests, multinode tests, and test infrastructure. + +### 1. Test Naming Conventions +**Check**: Agent hardware tests follow the `AgentHwTest` naming pattern (e.g., `AgentHwAclTest`, `AgentHwRouteTest`). Multi-switch variants use `AgentHwMultiSwitchTest`. +**Why**: Consistent naming enables test discovery, filtering, and CI configuration. Non-standard names may be excluded from test suites. +**Confidence**: HIGH (always follow the naming convention) + +### 2. SAI/Non-SAI Test Variants +**Check**: When adding agent HW tests, include both SAI and non-SAI (if applicable) test variants unless the feature is SAI-only. +**Why**: FBOSS supports multiple hardware abstraction layers. Tests that only cover SAI miss regressions on other platforms. +**Confidence**: MEDIUM (depends on feature scope) + +### 3. Test Fixture Hierarchy +**Check**: Use existing test fixtures (`AgentHwTest`, `AgentTest`, `HwTest`) rather than creating new base classes. Extend fixtures via composition or inheritance from the established hierarchy. +**Why**: Custom fixtures often miss shared setup/teardown logic, causing flaky tests or incorrect test environments. +**Confidence**: MEDIUM (new fixtures are sometimes warranted) + +### 4. Test Plan Format +**Check**: Diff test plans should reference actual test execution results. Include paste numbers or links to test runs. +**Why**: "Tested locally" without evidence doesn't demonstrate actual test coverage. +**Confidence**: MEDIUM (enforcement varies by reviewer) + +### 5. NSDB Impact Testing +**Check**: Changes to core FSDB, thrift_cow, or state management code must include NSDB (Network State Database) test runs in the test plan. +**Why**: NSDB is a critical consumer of FSDB. Core changes that break NSDB cause production network state visibility outages. +**Confidence**: HIGH (always required for core FSDB changes) + +### 6. No GTEST_SKIP in Agent HW Tests +**Check**: Agent hardware tests must NOT use `GTEST_SKIP()`. Use ProductionFeatures filtering via netcastle to exclude unsupported tests per platform. +**Why**: GTEST_SKIP wastes test infra resources (ASIC init runs before the skip). The pattern also gets copied to new tests. Use production feature filtering so netcastle selects tests per platform. +**Confidence**: HIGH (always wrong in agent HW tests) + +### 7. DSF Test Counters +**Check**: DSF (Distributed Switching Fabric) tests must check reassembly errors, not `outDiscards`/`inDiscards`. Discard counters are not supported on fabric ports. +**Why**: Unsupported counters stay zero, producing false passes. Check reassembly errors and work backwards to identify affected switches/ports. +**Confidence**: HIGH (discard counters always wrong on fabric ports) + +### 8. Targeted Unit Tests for New Modules +**Check**: New classes or modules must have dedicated unit tests, not just integration or HW tests. When reviewing, ask: "Do we have targeted tests for [NewClass]?" +**Why**: Integration tests may cover the new code incidentally but don't verify it in isolation. When the integration test fails, it's unclear whether the new module or its consumers are at fault. +**Confidence**: HIGH (always require targeted tests for new modules) + +### 9. Deep Comparison Over Size Checks +**Check**: Verify state correctness with deep comparison (compare full objects or field-by-field), not just collection size checks. `EXPECT_EQ(map.size(), expected)` does not prove content equality. +**Why**: Size equality can mask content corruption — an insert paired with an accidental delete preserves size but changes data. Deep comparison catches these bugs. +**Confidence**: HIGH (always compare content, not just size) + +### 10. Verify Negative Cases Explicitly +**Check**: Tests must explicitly verify negative/absence conditions, not just assert no exception is thrown. Check that values are absent, counters are zero, or states are not set. Ask: "What does this check? Just no throw?" +**Why**: A test that only checks for no-throw provides no signal — it passes even if the code does nothing. Explicit negative assertions (e.g., `EXPECT_FALSE(map.count(key))`) prove the invariant holds. +**Confidence**: HIGH (always verify both positive and negative conditions) + +### 11. Test Overflow and Error Paths +**Check**: Tests must cover resource exhaustion and overflow conditions, not just the happy path. Verify the system handles table-full, buffer-full, and limit-exceeded gracefully instead of crashing. +**Why**: Happy-path-only tests miss the most dangerous production failures. Overflow handling code is rarely exercised and often has latent bugs. +**Confidence**: HIGH (always test error paths for resource-bounded operations) + +### 12. Optimize Test Execution Time +**Check**: Minimize test execution time by using smallest valid packet sizes, tightest loops, and avoiding unnecessary sleeps or waits. If a test takes 15 seconds due to large packets, parameterize the size and use 64-byte packets when full-size is not required. +**Why**: Long-running tests waste CI hardware resources and slow developer iteration. Each second saved is multiplied across thousands of test runs. +**Confidence**: MEDIUM (balance speed against test fidelity) + +### 13. Test Transient Invalid States +**Check**: Test that validation catches invalid intermediate states during delta processing, not just final state correctness. E.g., test that N1(MAC_A→MAC_B) and N2(MAC_A→MAC_B) doesn't create a transient duplicate. +**Why**: State machine bugs only manifest during transitions. Final-state-only tests miss ordering-dependent failures. +**Source**: D96253394 reviewer feedback +**Confidence**: HIGH + +### 14. Consolidate Test Helpers into Base Classes +**Check**: When similar test helper methods exist across multiple test files, consolidate into the shared base class (e.g., `AgentArsBase`) rather than duplicating. +**Why**: Duplicated helpers diverge over time, causing inconsistent test behavior and maintenance burden. +**Source**: D91603952, D91606153 reviewer feedback +**Confidence**: HIGH + +### 15. Prefer Unit Tests Over Integration Tests for Validation Logic +**Check**: Pure validation logic (e.g., neighbor MAC uniqueness checking) should have dedicated unit tests, not rely solely on agent_hw_test integration tests. +**Why**: Unit tests are faster, more targeted, and easier to debug. Integration tests may incidentally cover validation but don't isolate it. +**Source**: D93297587, D96253394 reviewer feedback +**Confidence**: MEDIUM + +### 16. Use initialConfig() Factory Pattern for Test Setup +**Check**: Tests should define configuration via `initialConfig()` override rather than ad-hoc setup lambdas or inline config construction. +**Why**: Decouples config setup from test startup. Config applied only on cold boot, reused on warmboot. Standard pattern across agent HW tests. +**Source**: D42425625 reviewer feedback +**Confidence**: MEDIUM + +### 17. Do Not Modify the Setup of Existing Tests +**Check**: Never change the `setup` lambda or `initialConfig` of an existing agent HW test. If new functionality needs different setup, create a new test instead. +**Why**: Existing tests run in trunk-to-prod (and prod-to-trunk) warmboot roundtrip CI. When you modify a test's setup, the trunk version programs different HW state than the prod version. On warmboot from trunk back to prod, the prod binary encounters unexpected programmed state, causing test failures that block releases. The setup defines the HW contract between cold boot and warmboot — changing it breaks that contract across commits. +**Source**: D102269880 — modified setup of an existing RouteTest to program v4 routes; broke warmboot roundtrip because prod binary didn't expect v4 route state programmed by trunk. +**Confidence**: HIGH (always create a new test instead of modifying existing setup) diff --git a/fboss/skills/fboss-code-standards/references/thrift-cow-fsdb-patterns.md b/fboss/skills/fboss-code-standards/references/thrift-cow-fsdb-patterns.md new file mode 100644 index 0000000000000..163e5f4365f19 --- /dev/null +++ b/fboss/skills/fboss-code-standards/references/thrift-cow-fsdb-patterns.md @@ -0,0 +1,28 @@ +# FSDB & thrift_cow Patterns + +Patterns for code under `fboss/fsdb/`, `fboss/thrift_cow/`, and FSDB client usage. + +### 1. State/Stats Duality +**Check**: FSDB operations must handle both the state tree and the stats tree. When adding a new subscription, publisher, or path handler, ensure both `OperState` and `OperStats` are covered. +**Why**: Handling only one tree causes missing telemetry (stats) or missing config reconciliation (state). +**Confidence**: HIGH (always a bug if only one tree is handled) + +### 2. extern template for ThriftStructNode +**Check**: New `ThriftStructNode` instantiations must have corresponding `extern template` declarations in the header and explicit instantiations in a `.cpp` file. +**Why**: Without `extern template`, every translation unit that includes the header instantiates the template, causing O(n) build time regression. +**Confidence**: HIGH (always required for new ThriftStructNode types) + +### 3. Subscription Type: Prefer Patch +**Check**: New FSDB subscriptions should use `PatchSubscription` (preferred), not `DeltaSubscription` (deprecated). +**Why**: Delta subscriptions are being deprecated. Patch subscriptions are more efficient and support partial updates. +**Confidence**: MEDIUM (existing delta subscriptions don't need immediate migration, but new code should use patch) + +### 4. COW Node Modification +**Check**: To modify a COW (copy-on-write) node, always use `modify()` to get a mutable copy. Never cast away const or mutate a shared node directly. +**Why**: Mutating a shared node corrupts other references to the same node, causing data races and state corruption. +**Confidence**: HIGH (always a bug) + +### 5. Publisher Queue Bounds +**Check**: FSDB publishers must respect queue size bounds. When publishing updates, check that the queue is not full before enqueuing. +**Why**: Unbounded publishing can cause memory exhaustion in the FSDB server process. +**Confidence**: MEDIUM (depends on publish frequency and data size) diff --git a/fboss/skills/fboss-review/SKILL.md b/fboss/skills/fboss-review/SKILL.md new file mode 100644 index 0000000000000..9a21ed6877437 --- /dev/null +++ b/fboss/skills/fboss-review/SKILL.md @@ -0,0 +1,86 @@ +--- +name: fboss-review +description: Comprehensive FBOSS code review with 11 parallel reviewers (5 generic + 6 FBOSS-specific). Covers reliability, engineering, code quality, silent failures, agent architecture, SAI/SDK, FSDB/thrift_cow, platform/config, testing, and cross-cutting design. Self-contained alternative to /review-diff for FBOSS diffs. Findings shown to user only, never auto-posted. +user-invocable: true +--- + +# FBOSS Review + +## Overview + +Multi-reviewer code review for FBOSS diffs. Subsumes general review (`/review-diff` functionality) plus FBOSS-specific domain expertise. + +## Scope + +Currently `fboss/` only. TODO: extend to `configerator/source/neteng/fboss`, `neteng/netcastle`, `neteng/fboss`. + +## IMPORTANT: No Auto-Posting + +All findings are shown to the user only. Never post to Phabricator automatically. + +## Review Workflow + +### Step 1: Identify Changes + +- Run `sl status` and `sl diff` to get changed files +- If user specifies a diff number, fetch diff details via `mcp__plugin_meta_mux__get_phabricator_diff_details` + +### Step 2: Determine Applicable FBOSS Reviewers + +- Files in `agent/`, `SwSwitch`, `HwSwitch` -> Agent Reviewer (#6) +- Files in `fsdb/`, `thrift_cow/` -> FSDB/thrift_cow Reviewer (#7) +- Files in `platform/`, config, sensor/fan -> Platform Reviewer (#8) +- Files in `sai/`, `hw/sai/`, SDK -> SAI/SDK Reviewer (#9) +- Test files or coverage-affecting changes -> Testing Reviewer (#10) +- Changes spanning SwSwitch + HwSwitch, new SwitchState fields, new counters, BUCK target changes, new client infrastructure, or 3+ directory scope -> Architect Reviewer (#11) + +### Step 3: Dispatch Reviewers in Parallel + +Use Agent tool (sonnet model) to dispatch all applicable reviewers simultaneously: + +- **Generic reviewers 1-5**: ALWAYS dispatch. Use personas from `references/generic-reviewers.md`. +- **FBOSS reviewers 6-10**: Dispatch ONLY if their area is touched. +- Each reviewer receives: full diff content + their persona instructions. +- FBOSS reviewers additionally load patterns from `../fboss-code-standards/references/-patterns.md`. + +| # | Reviewer | Focus | Patterns File | +|---|----------|-------|---------------| +| 1 | Reliability | Error handling, RAII, logging, timeout/retry, graceful degradation | - | +| 2 | Engineering / Performance | Algorithmic complexity, unnecessary copies, lock contention, modern C++ | - | +| 3 | Code Quality | Readability, modularity, duplication, API design | `../fboss-code-standards/references/general-patterns.md` | +| 4 | Summary & Test Plan | Title accuracy, summary completeness, test plan adequacy, diff coherence | - | +| 5 | Silent Failure Finder | Logic errors, lossy conversions, race conditions, silent data loss | - | +| 6 | Agent Architecture | Mono/multi-switch, state management, warmboot, HwSwitch/SwSwitch boundary | `../fboss-code-standards/references/agent-patterns.md` | +| 7 | FSDB & thrift_cow | State/Stats duality, COW node safety, subscriptions, build time | `../fboss-code-standards/references/thrift-cow-fsdb-patterns.md` | +| 8 | Platform & Config | Platform services, JSON configs, startup order, OSS sync | `../fboss-code-standards/references/platform-config-patterns.md` | +| 9 | SAI/SDK Integration | SAI API usage, object lifecycle, vendor SDK compat, attributes | `../fboss-code-standards/references/sai-sdk-patterns.md` | +| 10 | Testing Standards | Coverage, naming, fixtures, NSDB impact, HW test patterns | `../fboss-code-standards/references/testing-patterns.md` | +| 11 | FBOSS Architect | Cross-cutting design: layering, state derivability, abstraction, code org, infra reuse | `references/fboss-specific-reviewers.md` | + +### Step 4: Verification + +Dispatch a single verifier agent (loads `references/verifier.md`) to: +- Deduplicate findings across reviewers +- Calibrate confidence scores +- Filter false positives +- Threshold: only report findings with confidence >= 0.7 + +### Step 5: Present Findings + +Output a structured report: + +``` +| File:Line | Reviewer | Severity | Issue | Confidence | +|-----------|----------|----------|-------|------------| +``` + +Never post findings to Phabricator. + +## When to Load References + +| Topic | Action | +|-------|--------| +| Generic reviewer personas | Read `references/generic-reviewers.md` | +| Verification/dedup logic | Read `references/verifier.md` | +| FBOSS-specific reviewer personas | Read `references/fboss-specific-reviewers.md` | +| Adding/modifying reviewers | Read `references/contributing.md` | diff --git a/fboss/skills/fboss-review/references/contributing.md b/fboss/skills/fboss-review/references/contributing.md new file mode 100644 index 0000000000000..dd980ca12e825 --- /dev/null +++ b/fboss/skills/fboss-review/references/contributing.md @@ -0,0 +1,32 @@ +# Contributing to FBOSS Review + +## Adding a New Reviewer + +1. Decide if the reviewer is generic (applies to all code) or FBOSS-specific (applies to FBOSS domain). +2. Generic reviewers: add persona to `generic-reviewers.md` following the existing format. +3. FBOSS-specific reviewers: add patterns to the appropriate file in `../fboss-code-standards/references/`. The reviewer dispatching logic in SKILL.md determines which pattern file each reviewer loads. + +## Modifying Reviewer Personas + +- Keep focus areas specific and non-overlapping with other reviewers. +- Include severity guidance so the reviewer calibrates findings consistently. +- Test changes by running `/fboss-review` on 2-3 recent diffs and checking for false positives. + +## Adjusting Verification + +- Modify `verifier.md` to change confidence calibration rules or the reporting threshold. +- The default threshold of 0.7 balances signal vs noise. Lower only if reviewers consistently under-rate real issues. + +## Pattern Files vs Reviewer Personas + +- **Pattern files** (in `fboss-code-standards/references/`): specific, checkable rules. Team-extensible. +- **Reviewer personas** (in `generic-reviewers.md`): broad focus areas and severity guidance. Stable, rarely edited. +- New domain-specific rules go in pattern files. New review dimensions go as reviewer personas. + +## Adding Architectural Concerns + +For cross-cutting design concerns that don't fit a single pattern file: +1. Add the principle to the Architect reviewer (#11) persona in `fboss-specific-reviewers.md`. +2. Include: principle name, what to look for, example from a real diff with quote, and severity. +3. Keep principles at MEDIUM severity — architectural feedback suggests improvements, not bugs. +4. Test by running `/fboss-review` on 2-3 diffs where the concern would apply. diff --git a/fboss/skills/fboss-review/references/fboss-specific-reviewers.md b/fboss/skills/fboss-review/references/fboss-specific-reviewers.md new file mode 100644 index 0000000000000..f939d74a5fb6c --- /dev/null +++ b/fboss/skills/fboss-review/references/fboss-specific-reviewers.md @@ -0,0 +1,117 @@ +# FBOSS-Specific Reviewer Personas + +## Reviewers 6-10 (Pattern-Based) + +These reviewers load their respective pattern files from `../fboss-code-standards/references/`. + +- **#6 Agent Architecture**: agent-patterns.md — mono/multi-switch, warmboot, OperDelta, delta ordering, locks, empty delta returns, ASIC type hardcoding +- **#7 FSDB & thrift_cow**: thrift-cow-fsdb-patterns.md — state/stats duality, extern template, COW safety +- **#8 Platform & Config**: platform-config-patterns.md — JSON/Thrift sync, startup order, OSS split +- **#9 SAI/SDK Integration**: sai-sdk-patterns.md — SaiApiTable, SaiStore, attribute mapping, SDK guards, client cancellation +- **#10 Testing Standards**: testing-patterns.md — naming, fixtures, NSDB, GTEST_SKIP, DSF counters + +## Reviewer 11: FBOSS Architect + +Focus: Cross-cutting design judgment on layering, abstraction boundaries, state management, +code organization, and system-level efficiency. This reviewer looks beyond individual patterns +to assess whether the overall design direction is sound. + +This reviewer does NOT load a pattern file. It applies architectural reasoning informed by +these principles: + +### A. Prefer SwSwitch Layer for HW-Agnostic Logic +When validation or monitoring logic does not require ASIC calls, implement it at the SwSwitch +layer rather than in a HwSwitch manager (e.g., SaiRouteManager). +- **Benefits**: easier to unit test, single enforcement point, CLI-accessible, works across all HW backends +- **Pattern**: Use RouteObserver or similar observer patterns for monitoring at SwSwitch layer +- **Example**: Route metadata validation belongs in SwSwitch, not SaiRouteManager (D67951124). "Can this just be a UT now? since all validation is happening in SwSwitch layer" (D93297587) +- **Github PR**: https://github.com/search?q=repo%3Afacebook%2Ffboss+%22Differential+Revision%3A+D67951124%22&type=commits , https://github.com/search?q=repo%3Afacebook%2Ffboss+%22Differential+Revision%3A+D93297587%22&type=commits +- **Severity**: MEDIUM + +### B. No Unjustified Library/Target Splits +Question library splits that lack measured build performance improvement. Splits add cognitive +friction (more BUCK targets, more headers, more import paths). +- **When reviewing**: Ask if the split was motivated by a build dependency cycle or measured perf gain +- **Example**: "we always build all the tests together. So what do we get from this?... reader has to decide which group a test belongs to" (D93501799). Led to 3 reverts. +- **Github PR**: https://github.com/search?q=repo%3Afacebook%2Ffboss+%22Differential+Revision%3A+D93501799%22&type=commits +- **Severity**: MEDIUM + +### C. Per-Port Stats Over Device-Level Aggregates +When adding telemetry or counters, prefer per-port stats over device-level aggregates. Use +portIDs (globally unique across MNPU switches) to avoid dependency on switch state. Consider +splitting counters for better observability (e.g., activePortsWithoutReachability vs +inactivePortsWithReachability instead of a single inconsistency counter). +- **Example**: "Don't we want a per port stat?" (D74778889). "Can we break this into 2 counters" (D72295344) +- **Github PR**: https://github.com/search?q=repo%3Afacebook%2Ffboss+%22Differential+Revision%3A+D74778889%22&type=commits , https://github.com/search?q=repo%3Afacebook%2Ffboss+%22Differential+Revision%3A+D72295344%22&type=commits +- **Severity**: LOW + +### D. Don't Add Derivable State to SwitchState +Before adding new fields to SwitchState, check if the information can be derived from existing +state. If field X can be computed from fields Y and Z that already exist in SwitchState or +config, do not add X as a stored field. +- **When reviewing**: Ask "can this be derived from existing information instead of stored?" +- **Example**: "Why do we need additional field for this... local switch Ids can just be derived in SaiSwitch" (D59879844). "couldn't we simply derive this in SaiSwitch, by iterating over ports" (D61140726) +- **Github PR**: https://github.com/search?q=repo%3Afacebook%2Ffboss+%22Differential+Revision%3A+D59879844%22&type=commits , https://github.com/search?q=repo%3Afacebook%2Ffboss+%22Differential+Revision%3A+D61140726%22&type=commits +- **Severity**: MEDIUM (important — every SwitchState field is serialized during warmboot) + +### E. Build Shared Abstractions for Processing Order +When validation and programming must process deltas in the same order, build a shared abstraction +used by both layers rather than duplicating ordering logic. +- **Example**: "How do we ensure that this is the way SaiSwitch will continue to process updates? I suggest building a small abstraction" (D96253394) +- **Github PR**: https://github.com/search?q=repo%3Afacebook%2Ffboss+%22Differential+Revision%3A+D96253394%22&type=commits +- **Severity**: MEDIUM + +### F. Reuse Existing State Mechanisms +Before adding a new state field, check if existing state already captures the concept. Piggyback +on port active/inactive, interface admin state, or other established fields. +- **Example**: "We should just piggy back on port active/inactive state" (D86535930) +- **Github PR**: https://github.com/search?q=repo%3Afacebook%2Ffboss+%22Differential+Revision%3A+D86535930%22&type=commits +- **Severity**: MEDIUM + +### G. Composition Over Inheritance +Prefer composition (member variable) over inheritance when there is no true "is-a" relationship. +Relax encapsulation only when needed, start with maximum restriction. +- **Example**: "Do we need to derive from AgentIntiallizer? Just having AgentIntializer as a member seems like a less coupled approach" (D42425625) +- **Github PR**: https://github.com/search?q=repo%3Afacebook%2Ffboss+%22Differential+Revision%3A+D42425625%22&type=commits +- **Severity**: LOW + +### H. Share Infrastructure Code, Don't Duplicate +When building infrastructure (reconnection logic, stream clients, delta processors), check if +similar infrastructure already exists (e.g., FsdbStreamClient). Abstract a shared base class to +get existing test coverage for free. +- **Example**: "A bunch of logic in ReconnectingThriftClient is common b/w FsdbStreamClient. Can we abstract out a base class and share this code. Fsdb has a pretty rich set of UTs" (D47879519) +- **Github PR**: https://github.com/search?q=repo%3Afacebook%2Ffboss+%22Differential+Revision%3A+D47879519%22&type=commits +- **Severity**: MEDIUM + +### I. Config-Driven Over Code Changes +When behavior can be controlled via configuration or collection parameters, prefer that over +code changes that require an agent push. Don't slow down shared infrastructure for one consumer. +- **Example**: "This change requires new agent code to roll out, if we just control this via collections, then they can put in whatever interval they want w/o waiting for agent push" (D80991639) +- **Github PR**: https://github.com/search?q=repo%3Afacebook%2Ffboss+%22Differential+Revision%3A+D80991639%22&type=commits +- **Severity**: LOW + +### J. Minimize Library Dependencies +When reviewing new library dependencies, question whether they bloat build times for downstream +consumers. Prefer small, focused libraries. If a library is "small now," consider that future +additions to it will impact all dependents. +- **Example**: "many of the libraries in fboss folder end up taking a long time to build. fboss/lib:thrift_service_utils is small/fast now. But I am not sure what will get added there as dependency in the future" (D93124830) +- **Github PR**: https://github.com/facebook/fboss/pull/902 +- **Severity**: MEDIUM + +### K. Platform Services Uniformity +All platform services (fan_service, sensor_service, platform_manager, etc.) should use identical +patterns for thrift setup, signal handling, and initialization. Consolidate boilerplate into +shared helpers (e.g., `helpers::runThriftService()`). +- **Example**: D91612218 consolidated thrift server setup across 6 platform services +- **Github PR**: https://github.com/search?q=repo%3Afacebook%2Ffboss+%22Differential+Revision%3A+D91612218%22&type=commits +- **Severity**: LOW + +### Dispatch Criteria +Dispatch this reviewer when: +- Changes touch both SwSwitch AND HwSwitch code paths +- New state fields are being added to SwitchState +- New stats/counters are being introduced +- BUCK target structure is being modified (new targets, library splits) +- New Thrift client/connection infrastructure is being created +- Changes span 3+ directories under fboss/ +- Class inheritance hierarchies are being introduced or extended diff --git a/fboss/skills/fboss-review/references/generic-reviewers.md b/fboss/skills/fboss-review/references/generic-reviewers.md new file mode 100644 index 0000000000000..3cb4ebdbc4407 --- /dev/null +++ b/fboss/skills/fboss-review/references/generic-reviewers.md @@ -0,0 +1,68 @@ +# Generic Reviewer Personas + +## Reviewer 1: Reliability + +Focus: Error handling, resource management, logging, timeout/retry, graceful degradation. + +Review for: +- Missing error handling on fallible operations (SAI calls, Thrift RPCs, file I/O) +- Resource leaks: RAII violations, missing cleanup in error paths +- Missing or insufficient logging at error boundaries +- Timeout/retry logic: missing timeouts, unbounded retries, no backoff +- Graceful degradation: does failure in a non-critical path crash the process? + +Severity guide: Resource leaks and missing error handling = HIGH. Logging gaps = MEDIUM. + +## Reviewer 2: Engineering / Performance + +Focus: Algorithmic complexity, unnecessary copies, lock contention, modern C++17/20, const/constexpr. + +Review for: +- O(n^2) or worse algorithms where O(n) or O(n log n) is possible +- Unnecessary copies of large objects (use const ref or move semantics) +- Lock contention: holding locks during I/O, broad lock scopes +- Modern C++ opportunities: structured bindings, std::optional, constexpr +- Unnecessary allocations in hot paths +- Library dependency growth: does adding this dependency risk build time bloat for downstream consumers? + +Severity guide: Algorithmic issues in hot paths = HIGH. Style modernization = LOW. Library dependency bloat = MEDIUM. + +## Reviewer 3: Code Quality + +Focus: Readability, modularity, duplication, API design. Also loads `../fboss-code-standards/references/general-patterns.md` for team conventions. + +Review for: +- Code duplication that should be extracted +- Functions doing too many things (> 50 lines warrants scrutiny) +- Unclear variable/function names +- API design: are interfaces minimal, consistent, and hard to misuse? +- Team coding conventions from general-patterns.md +- Hardcoded magic numbers when named enums/constants exist + +Severity guide: Duplication causing maintenance burden = MEDIUM. Naming = LOW. Magic numbers = MEDIUM. + +## Reviewer 4: Summary & Test Plan + +Focus: Diff metadata quality. + +Review for: +- Title accuracy: does it describe what changed? +- Summary completeness: does it explain WHY, not just WHAT? Are design decisions documented? +- Test plan adequacy: are tests listed? Do they cover the change? +- Diff coherence: is this one logical change, or should it be split? + +Severity guide: Missing test plan = HIGH. Title/summary improvements = LOW. + +## Reviewer 5: Silent Failure Finder + +Focus: Logic errors, lossy conversions, race conditions, silent data loss. + +Review for: +- Off-by-one errors in loops and bounds checks +- Wrong comparison operators (< vs <=, == vs !=) +- Lossy numeric conversions (int64 -> int32, double -> int) without bounds checks +- Race conditions: shared mutable state accessed without synchronization +- Silent data loss: ignored return values, unchecked optional access +- Method override signature mismatches: any child class overriding a parent method must preserve the parent's full signature and forward all arguments. Read the parent class to verify — silently dropped arguments cause hard-to-detect data loss. + +Severity guide: Logic errors and race conditions = HIGH. Lossy conversions = MEDIUM. diff --git a/fboss/skills/fboss-review/references/verifier.md b/fboss/skills/fboss-review/references/verifier.md new file mode 100644 index 0000000000000..14ae379f91079 --- /dev/null +++ b/fboss/skills/fboss-review/references/verifier.md @@ -0,0 +1,44 @@ +# Verifier: Dedup, Confidence Calibration, False Positive Filtering + +## Deduplication + +- Same file:line reported by multiple reviewers = single finding +- Same underlying issue across nearby lines = single finding (use the most specific location) +- When merging, keep the highest severity and combine reviewer attributions + +## Confidence Calibration + +Starting confidence is the reviewer's assigned value (0.0-1.0). Adjust: + +- **+0.2** if 2 or more reviewers independently flagged the same issue +- **-0.2** if the flagged pattern is a common, established pattern in the surrounding codebase (check 5+ nearby files) +- **+0.1** if the pattern matches a HIGH confidence item from fboss-code-standards +- Cap final confidence at 1.0 + +## Anti-Downgrade Rules + +Do NOT reduce severity below MEDIUM for these, even if surrounding code follows the same pattern: +- **Hardcoded magic numbers when enums/constants exist** + +## False Positive Filtering + +Remove findings that match any of these: +- File has `@generated` tag (auto-generated code, do not review) +- Issue exists in code not touched by this diff (pre-existing, out of scope) +- Test-only finding with severity LOW (not worth flagging) +- Style-only preference with no functional impact + +## Threshold + +Only report findings with final confidence >= 0.7. + +## Output Format + +Present verified findings as a structured table: + +``` +| File:Line | Reviewer(s) | Severity | Issue | Confidence | +|-----------|-------------|----------|-------|------------| +``` + +Sort by severity (HIGH first), then by confidence (highest first).