Skip to content

[Celestica] Icecube: Config: Implement Software Overtemp Protection (OTP) for TH6 ASIC#1208

Open
zhongedward wants to merge 1 commit into
facebook:mainfrom
zhongedward:Implement_software_OTP_for_TH6_ASIC
Open

[Celestica] Icecube: Config: Implement Software Overtemp Protection (OTP) for TH6 ASIC#1208
zhongedward wants to merge 1 commit into
facebook:mainfrom
zhongedward:Implement_software_OTP_for_TH6_ASIC

Conversation

@zhongedward
Copy link
Copy Markdown
Contributor

@zhongedward zhongedward commented May 19, 2026

Pre-submission checklist

  • I've ran the linters locally and fixed lint errors related to the files I modified in this PR. You can install the linters by running pip install -r requirements-dev.txt && pre-commit install
  • pre-commit run
icecube_fan_otp

Summary

This PR introduces a comprehensive software-based Overtemp Protection (OTP) mechanism for the Icecube platform. By integrating sensor monitoring with the fan-service shutdown logic, we ensure the hardware is protected during thermal anomalies before reaching catastrophic physical limits.

Currently, the Icecube platform lacks a software-driven emergency power-off sequence for the TH6 ASIC. Relying solely on hardware-level protection can be risky if the thermal ramp is too steep. This change establishes a "Soft OTP" layer to trigger an orderly shutdown when the TH6 temperature hits the critical threshold.

Key Changes

  • platform_manager.json: Exported the SMB_CPLD sysfs path to ensure sensor_service has consistent access to temperature registers.

  • sensor_service.json: Defined the TH6_TEMP sensor (mapped to SMB_CPLD) with a critical threshold (upperCriticalVal) of 101.0°C.

  • fan_service.json:

    • Implemented shutdownCondition triggered by TH6_TEMP.
    • Defined shutdownCmd to explicitly disable TH6 power via SMB_CPLD (echo 0 > /run/devmap/cplds/SMB_CPLD/th6_pwr_en).

Test Plan

  1. Syntax Validation: Validated JSON syntax.

  2. Formatting: Pretty-printed configurations using the jq command for readability.

  3. Build & Config Tests: Compilation and configuration validation tests passed successfully.

  4. Service Verification: Confirmed that the following services start and run without errors:

    • platform_manager/platform_hw_test/platform_manager_hw_test
    • sensor_service/sensor_service_client/sensor_service_sw_test/sensor_service_hw_test
    • fan_service/fan_service_sw_test/fan_service_hw_test
  5. End-to-End Thermal Protection Verification (Soft OTP)
    To verify the effectiveness of the software shutdown logic, we performed a controlled thermal stress test:

  • Methodology:

    • Hardware Guardrail Adjustment: Temporarily increased the hardware initialization threshold of the TMP432 sensor to 110°C via platform_manager. This ensures the hardware-level protection is bypassed during the test window, allowing the software logic to be the primary defender.
    • Controlled Thermal Ramp: Adjusted the CDU (Cooling Distribution Unit) to allow the TH6 ASIC temperature to rise naturally.
    • Observation: Monitored the fan_service polling cycle and system logs to capture the exact trigger point.
  • Test Result:

    • Trigger Point: Once TH6_TEMP hit the software-defined threshold of 101°C, the fan_service successfully identified the shutdownCondition.
    • Action Executed: The shutdownCmd was immediately triggered, executing:
      echo 0 > /run/devmap/cplds/SMB_CPLD/th6_pwr_en
    • Conclusion: Confirmed that the TH6 power rail was successfully disabled by the software trigger, preventing the temperature from reaching the 110°C hardware limit.
    • Audit Logs: The attached .zip contains specific evidence:
      • temp_shutdown_monitor: Captures the thermal ramp and the fan_service trigger event at 101.604°C.
      • pcie_shutdown_monitor: Verifies the physical removal of TH6 from the PCIe bus post-shutdown, confirming successful power-off.
image

Attachment:
icecube_sw_OTP_test_2026_04_24_log.zip

@meta-cla meta-cla Bot added the CLA Signed label May 19, 2026
@zhongedward zhongedward marked this pull request as ready for review May 19, 2026 08:11
@zhongedward zhongedward requested a review from a team as a code owner May 19, 2026 08:11
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 22, 2026

@mikechoifb has imported this pull request. If you are a Meta employee, you can view this in D106028063.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant