Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 81 additions & 0 deletions fboss/skills/fboss-code-standards/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
---
name: fboss-code-standards
description: FBOSS coding standards and patterns. Auto-loaded when writing code in fboss/ to catch architecture violations, SAI/SDK misuse, thrift_cow pitfalls, platform config errors, and testing gaps. For explicit multi-reviewer review, use /fboss-review instead.
---

# FBOSS Code Standards

## Overview

Passive coding guidance for FBOSS. Applied automatically while writing or modifying code under `fboss/`.

## Scope

Currently `fboss/` only. TODO: extend to `configerator/source/neteng/fboss`, `neteng/netcastle`, `neteng/fboss`.

## Quick Checklist

| Area | Pattern | Check |
|------|---------|-------|
| Agent | Mono/multi-switch duality | State changes must work in both modes |
| Agent | Warmboot serialization | New SwitchState fields must serialize/deserialize |
| SAI | SaiApiTable registration | New SAI attributes must be registered |
| SAI | SaiStore consistency | SAI objects must be tracked, no orphans |
| FSDB | State/Stats duality | Always handle both trees |
| FSDB | extern template | New ThriftStructNode instantiations need extern template |
| thrift_cow | COW modification | Use `modify()`, never mutate shared nodes |
| Platform | JSON + Thrift sync | Config changes update both schemas |
| Testing | Naming convention | Follow `AgentHw<Feature>Test` pattern |
| Testing | NSDB impact | Core FSDB changes must run NSDB tests |
| General | Follow local patterns | Be consistent with existing code in the directory |
| Testing | No GTEST_SKIP | Use ProductionFeatures filtering, never `GTEST_SKIP()` |
| Testing | DSF counters | Check reassembly errors on fabric ports, not discards |
| Agent | Non-coalescing behavioral deltas | Mark delta-significant updates non-coalescing |
| Agent | Rolled-out flag cleanup | Remove feature flags that are fully rolled out |
| Agent | Early return on empty deltas | Check for empty deltas and return early |
| Agent | No hardcoded ASIC types | Use feature/property lookups, not ASIC name checks |
| SAI | Explicit cancellation | Distinguish reconnectable errors from shutdown signals |
| General | No private fn defaults | Don't default parameters all callers override |
| General | Explicit state flags | Pass all enable/disable flags, no direction assumptions |
| General | Verify before deleting | Provide evidence files are unused before deleting |
| General | CHECK over assert | Use CHECK for production invariants, not assert() |
| General | emplace over operator[] | Use emplace/insert for map insertions |
| General | ASIC feature gating | Gate HW-specific constraints with ASIC features |
| General | Concise method naming | Don't include class name in method name |
| General | Explicit operator precedence | Parenthesize compound boolean/comparison expressions |
| General | Inline trivial comparisons | Don't wrap one-line checks in helper methods |
| General | Fix lint before landing | Run `arc f` and `arc lint` before submitting |
| Testing | Targeted unit tests | New modules need dedicated unit tests, not just HW tests |
| Agent | Documentation-value methods | Keep per-ASIC limit methods as architecture docs |
| General | Rate-limit verbose logging | Use XLOG_EVERY_MS, never log at line rate |
| General | Perspective-aware naming | Use fromX/toX instead of ambiguous Rx/Tx |
| General | Update OSS build files | Update CMake alongside BUCK when adding/moving files |
| Testing | Deep comparison over size checks | Compare content, not just collection size |
| Testing | Verify negative cases | Explicitly check absence, not just no-throw |
| Testing | Test overflow and error paths | Cover resource exhaustion, not just happy path |
| Testing | Optimize test execution time | Use smallest packet sizes, avoid unnecessary waits |
| Agent | ODS counters for debugging | Add ODS counters for drop/error events, not just logs |
| Agent | Validate external config limits | Reject unreasonable externally configurable values |
| Agent | Populate warmboot cache first | Fill all caches from warmboot state before processing |
| General | Empty vector over optional\<vector\> | Use empty vector, not optional\<vector\> |
| General | Map-based config | Use map lookups, not if-else chains for platform values |
| General | Log IDs + names | Include both numeric ID and name in log messages |
| Agent | Delta processing order | Remove → Change → Add; document in comments |
| Agent | Validator extraction | Extract complex validation into dedicated classes |
| Agent | Gate validations with flags | New enforcement behind FLAGS for gradual rollout |
| Testing | Test transient states | Test intermediate invalid states, not just final |
| Testing | Consolidate test helpers | Move shared helpers to base classes |
| Testing | Validation UTs | Pure validation → unit tests, not just HW tests |
| Testing | No setup changes to existing tests | Create new tests instead; setup changes break warmboot roundtrip CI |

## When to Load References

| Topic | Action |
|-------|--------|
| Team coding conventions (C++ style, control flow, naming) | Read `references/general-patterns.md` |
| Agent code (SwSwitch, HwSwitch, mono/multi-switch, warmboot) | Read `references/agent-patterns.md` |
| SAI/SDK layer (SaiApi, SaiStore, SaiManager, vendor SDK) | Read `references/sai-sdk-patterns.md` |
| FSDB or thrift_cow (subscriptions, COW nodes, PatchBuilder) | Read `references/thrift-cow-fsdb-patterns.md` |
| Platform services or config (sensor_service, fan_service, JSON configs) | Read `references/platform-config-patterns.md` |
| Tests (agent HW tests, multinode tests, naming, fixtures) | Read `references/testing-patterns.md` |
| Adding new patterns | Read `references/contributing.md` |
92 changes: 92 additions & 0 deletions fboss/skills/fboss-code-standards/references/agent-patterns.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Agent Architecture Patterns

Patterns for code under `fboss/agent/`, including SwSwitch, HwSwitch, and state management.

### 1. Mono/Multi-Switch Duality
**Check**: Any state change or control path must work in both mono mode (`FLAGS_multi_switch=false`, `agent/single/`) and multi-switch mode (`FLAGS_multi_switch=true`, `agent/mnpu/`)
**Why**: FBOSS supports both single-process and multi-process architectures. Code that only works in one mode causes production failures on the other deployment type.
**Confidence**: HIGH (always a bug if only one mode is handled)

### 2. Warmboot State Preservation
**Check**: New fields added to SwitchState or its child nodes must be serialized and deserialized during warmboot. Verify the field appears in the warmboot state Thrift structure and is handled in `fromThrift()`/`toThrift()`.
**Why**: Missing serialization causes state loss during warmboot, leading to traffic disruption or config drift after agent restart.
**Confidence**: HIGH (always a bug if new state fields are not serialized)

### 3. OperDelta Sync in Multi-Switch
**Check**: In multi-switch mode, HwSwitch state updates must go through OperDelta sync (`OperDeltaSyncer`), never by direct HwSwitch mutation from SwSwitch.
**Why**: Direct mutation bypasses the Thrift IPC boundary, causing state divergence between SwSwitch and HwSwitch processes.
**Confidence**: HIGH (always a bug in multi-switch mode)

### 4. State Delta Ordering
**Check**: When applying state deltas, respect dependency order: VLANs before interfaces, interfaces before routes, ACLs before applying to interfaces.
**Why**: Out-of-order delta application causes transient failures or crashes when referenced objects don't exist yet.
**Confidence**: HIGH (ordering violations cause crashes)

### 5. Lock Ordering in SwSwitch
**Check**: When acquiring multiple locks, follow the established lock hierarchy. Never hold `updateLock_` while acquiring `stateLock_`; `stateLock_` must be acquired first if both are needed.
**Why**: Lock ordering violations cause deadlocks that hang the agent.
**Confidence**: HIGH (always a bug)

### 6. Non-Coalescing Updates for Behavioral Deltas
**Check**: State updates where the delta carries behavioral significance (e.g., deltaApplicationBehavior is set) must use non-coalescing update paths. Do not allow the state update framework to coalesce such deltas.
**Why**: Coalesced updates collapse intermediate states. A sequence of enable→disable→enable is different from just enable, but coalescing reduces it to the final state, losing behavioral intent.
**Confidence**: HIGH (always a bug if coalesced when delta behavior matters)

### 7. Clean Up Fully Rolled-Out Feature Flags
**Check**: When a feature is fully rolled out across all platforms, remove the feature flag and the dead code path. Do not leave conditional checks for flags that always evaluate to true.
**Why**: Dead flag branches add complexity, confuse readers, and accumulate debt. If every platform has the feature, the flag serves no purpose.
**Confidence**: MEDIUM (requires confirming rollout is complete on all platforms)

### 8. Early Return on Empty Deltas
**Check**: Delta processing functions must check for empty deltas at the start and return early. Do not process all state updates when only a specific delta changed.
**Why**: Processing empty deltas wastes CPU and can trigger unnecessary side effects (e.g., recomputing local switch IDs on every state update instead of only when DSF node map changes).
**Confidence**: HIGH (always add an early return for empty deltas)

### 9. Don't Hardcode ASIC Types
**Check**: Do not check for specific ASIC types (e.g., `if asicType == J3`) when the information can be derived from config, DSF node map, or ASIC feature flags. Look up ASIC properties dynamically.
**Why**: Hardcoded ASIC checks break when new ASICs are added and create maintenance burden. Use feature-based gating instead.
**Confidence**: HIGH (always use feature/property lookups over hardcoded ASIC names)

### 10. Retain Documentation-Value Per-ASIC Methods
**Check**: Keep per-ASIC limit methods or constants that document platform-specific capabilities (e.g., max neighbor entries per ASIC), even when a unified gflag-based limit is used at runtime. Do not delete them during cleanup unless the platform is fully deprecated.
**Why**: These methods serve as architecture documentation. Without them, future engineers must reverse-engineer per-platform limits from scattered config files or vendor docs.
**Confidence**: MEDIUM (requires judgment on whether the platform is still relevant)

### 11. Use ODS Counters for Production Debugging
**Check**: Add ODS counters (not just logs) for events that need production debugging — packet drops, errors, resource exhaustion. Add follow-up configerator diffs to permit export of new counters to ODS.
**Why**: Logs are ephemeral and hard to aggregate. ODS counters provide exact counts, time-series trending, and alerting. Without counters, production packet drops require log scraping to debug.
**Confidence**: HIGH (always add ODS counters for drop/error events)

### 12. Validate External Configuration Limits
**Check**: Validate externally configurable limits (thrift config, gflags, configerator) to prevent values that could destabilize the system. Reject unreasonable values (e.g., 1M pkts/ms rate limit) that would undo built-in protections like thrift overload guards.
**Why**: External users can configure arbitrary values. Without bounds checking, a misconfiguration can cause CPU exhaustion, memory exhaustion, or bypass overload protection.
**Confidence**: HIGH (always validate external config bounds)

### 13. Populate Warmboot Cache Before Processing
**Check**: During warmboot, populate all caches and ID maps from the warmboot state before processing any new state changes. Do not lazily discover IDs as routes or entries are added.
**Why**: Lazy discovery causes ID collisions — new allocations may reuse IDs that exist in the warmboot cache but haven't been discovered yet, corrupting forwarding state.
**Confidence**: HIGH (always populate warmboot caches before processing new state)

### 14. Delta Processing Order
**Check**: Process deltas in consistent order: removals → changes → additions (per interface, per object type). Document the ordering in comments.
**Why**: Wrong order creates transient invalid states (e.g., two neighbors with same MAC during update). Documented order prevents future refactors from breaking it.
**Source**: D96253394 reviewer feedback
**Confidence**: HIGH

### 15. Extract Validation into Dedicated Classes
**Check**: Complex validation logic should be extracted into dedicated validator classes with clear `isValidDelta()` and `stateChanged()` APIs, not embedded in state update handlers.
**Why**: Single responsibility, easier to unit test in isolation, easier to maintain ordering semantics.
**Source**: D96253394 reviewer feedback
**Confidence**: MEDIUM

### 16. Gate New Validations Behind Feature Flags
**Check**: New validation/enforcement logic must be gated behind a FLAGS (e.g., `FLAGS_enforce_single_nbr_mac_per_intf`) for gradual rollout.
**Why**: Allows disabling if issues arise in production, enables A/B comparison, prevents big-bang rollout risk.
**Source**: D96253394 reviewer feedback
**Confidence**: HIGH

### 17. Rebuild Validator State on Rejection
**Check**: When `updateRejected()` is called after partial validation, rebuild validator state from scratch via `stateChanged(empty → oldState)` rather than attempting incremental undo.
**Why**: Incremental undo is error-prone. Clean rebuild ensures validator state stays in sync even when validation fails partway through.
**Source**: D96253394 reviewer feedback
**Confidence**: MEDIUM
38 changes: 38 additions & 0 deletions fboss/skills/fboss-code-standards/references/contributing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Contributing Patterns

## Where to Add Your Pattern

```
Is your pattern FBOSS-specific?
+-- YES -> Which area?
| +-- Agent/SwSwitch/HwSwitch -> agent-patterns.md
| +-- SAI/SDK -> sai-sdk-patterns.md
| +-- FSDB/thrift_cow -> thrift-cow-fsdb-patterns.md
| +-- Platform/Config -> platform-config-patterns.md
| +-- Testing -> testing-patterns.md
| +-- Cross-cutting design judgment (not a single Check/Why rule)
| -> fboss-review/references/fboss-specific-reviewers.md (Architect reviewer)
+-- NO -> Is it a coding convention the team wants enforced?
+-- YES -> general-patterns.md
+-- NO -> Probably doesn't belong here
```

## Pattern Template

Every pattern must follow this format:

```markdown
### N. Pattern Name
**Check**: What specifically to look for
**Why**: Consequence of violation
**Confidence**: HIGH (always a bug) or MEDIUM (context-dependent)
```

## Guidelines

- **Evidence-based only**: Patterns must come from real bugs, SEVs, or repeated review feedback. No hypothetical issues.
- **False positive test**: Before adding, check 3-5 recent diffs to confirm the pattern would not produce false positives.
- **HIGH confidence**: Use only when violation is always a bug, regardless of context.
- **MEDIUM confidence**: Use when violation is usually a problem but some exceptions exist.
- **No style preferences**: Style-only preferences (brace placement, blank lines) do not belong here.
- **Diff prefix**: When submitting pattern changes, use `[fboss][claude]` prefix in the diff title.
Loading
Loading