Skip to content

thermal: only poll Cosmo T6 temperatures in A0+HP#2391

Open
hawkw wants to merge 2 commits intomasterfrom
eliza/t6-thermals-in-ao+hp-only
Open

thermal: only poll Cosmo T6 temperatures in A0+HP#2391
hawkw wants to merge 2 commits intomasterfrom
eliza/t6-thermals-in-ao+hp-only

Conversation

@hawkw
Copy link
Member

@hawkw hawkw commented Feb 14, 2026

Presently, the thermal loop on Cosmo will attempt to read the T6 temperature sensors whilst in the A0 or A0+HP power states. This is wrong, as the NIC is only known to be powered in the A0+HP state on Cosmo (since its sequencing is controlled by the host OS).1

This means that if the T6 MAPOs but the host processor is still up, the thermal loop will continue trying to read temperatures from the T6 whilst it is not powered. Because it's not powered, we will just read I2C NACKs from it. This causes the logic in the thermal loop that attempts to extrapolate a worst-case temperature increase for devices which have transient sensor read errors to imagine a T6 temperature which continues to increase, eventually leading to a thermal shutdown. This is described in #2384.

This commit adds a bit in the Cosmo thermal task's PowerBitmasks to differentiate between the A0 and A0+HP power states, and only consider the T6 temperature if the A0+HP bit is set. This will fix #2384...alongside #2390, which is necessary to fix Cosmo ignoring the sequencer IRQ that would tell it to transition from A0+HP back to A0 if the NIC MAPOs.

Footnotes

  1. This is not the case on Gimlet, as the TMP451's V3P3_NIC_A0HP rail was folded into the A0 domain by the FPGA logic on Gimlet. Per @rmustacc this was while debugging the T6 PERST issues. If this ever were to change in the Gimlet FPGA logic, we would want to do a similar thing in the thermal task, but for now, let's avoid the spurious state transition and always care about T6 temps in A0 on Gimlets.

Presently, the thermal loop on Gimlet and Cosmo will attempt to read the
T6 temperature sensors whilst in the A0 *or* A0+HP power states. This is
wrong, as the NIC is only known to be powered in the A0+HP state (since
its sequencing is controlled by the host OS).

This means that if the T6 MAPOs but the host processor is still up, the
thermal loop will continue trying to read temperatures from the T6
whilst it is not powered. Because it's not powered, we will just read
I2C NACKs from it. This causes the logic in the thermal loop that
attempts to extrapolate a worst-case temperature increase for devices
which have transient sensor read errors to imagine a T6 temperature
which continues to increase, eventually leading to a thermal shutdown.
This is described in #2384.

This commit adds a bit in the Gimlet and Cosmo thermal tasks'
`PowerBitmask`s to differentiate between the A0 and A0+HP power states,
and only consider the T6 temperature if the A0+HP bit is set. This will
fix #2384...alongside #2390, which is necessary to fix Cosmo ignoring
the sequencer IRQ that would tell it to transition from A0+HP back to A0
if the NIC MAPOs.
@hawkw hawkw self-assigned this Feb 14, 2026
@hawkw hawkw added gimlet cosmo SP5 Board labels Feb 14, 2026
@hawkw
Copy link
Member Author

hawkw commented Feb 14, 2026

Testing this in integration with #2390 (which is necessary for Cosmo to actually set the state back to A0 when the NIC MAPOs), it appears to work properly. Note that the thermal ringbuf just shows a power bitmask change, while previously, it would show an alleged T6 overheat:

eliza@alfred ~ $ pfexec humility pmbus -r V12_SYS_A2 -w VOUT_COMMAND=8 \
  && pfexec humility pmbus -r V12_SYS_A2 -w VOUT_COMMAND=12 \
  && pfexec humility ringbuf cosmo_seq \
  && pfexec humility ringbuf thermal
humility: WARNING: archive in environment variable overriding archive in environment file
humility: attached to 1fc9:0143:XJACFXSEKTQJS via CMSIS-DAP
humility: I2C4, port F, dev 0x67: successfully wrote VOUT_COMMAND
humility: WARNING: archive in environment variable overriding archive in environment file
humility: attached to 1fc9:0143:XJACFXSEKTQJS via CMSIS-DAP
humility: I2C4, port F, dev 0x67: successfully wrote VOUT_COMMAND
humility: WARNING: archive in environment variable overriding archive in environment file
humility: attached to 1fc9:0143:XJACFXSEKTQJS via CMSIS-DAP
humility: ring buffer drv_cosmo_seq_server::__RINGBUF in cosmo_seq:
   TOTAL VARIANT
   14159 RegStateValues
     534 ContinueBitstreamLoad
     112 CPUPresent
       3 SetState(InitialPowerOn)
       2 PmbusAlert
       2 EreportSent
       1 FpgaInit
       1 WaitForDone
       1 Programmed
       1 Startup
       1 SequencerInterrupt
       1 Coretype
       1 ResetCounts
       1 NicMapoInterrupt
 NDX LINE      GEN    COUNT PAYLOAD
  76  601        5        1 CPUPresent(true)
 ...
  65  530        6       48 RegStateValues { seq_api_status: SeqApiStatusView { a0_sm: Ok(Done) }, seq_raw_status: SeqRawStatusView { hw_sm: Ok(Done) }, nic_api_status: NicApiStatusView { nic_sm: Ok(Done) }, nic_raw_status: NicRawStatusView { hw_sm: Ok(Done) } }
  66 1082        6        1 SequencerInterrupt { our_state: A0PlusHP, seq_state: Ok(Done) }
  67  830        6        1 SequencerIfr(IfrView { fanfault: false, thermtrip: false, smerr_assert: false, a0mapo: false, nicmapo: false, amd_pwrok_fedge: false, amd_rstn_fedge: false, fan_central_hsc_alert: false, fan_east_hsc_alert: false, fan_west_hsc_alert: false, ibc_alert: false, m2_hsc_alert: false, nic_hsc_alert: false, v12_ddr5_abcdef_hsc_alert: false, v12_ddr5_ghijkl_hsc_alert: false, v12_mcio_a0hp_hsc_alert: false, main_hsc_alert: false, vr_v1p8_sys_to_fpga1_alert: true, vr_v3p3_sys_to_fpga1_alert: true, vr_v5p0_sys_to_fpga1_alert: true, v0p96_nic_to_fpga1_alert: true, pwr_cont1_to_fpga1_alert: true, pwr_cont2_to_fpga1_alert: false, pwr_cont3_to_fpga1_alert: false })
  68  878        6        1 PmbusAlert { now: 0x43aab }
  69 1157        6        1 EreportSent(0x87)
  70 1157        6        1 EreportSent(0x88)
  71  830        6        1 SequencerIfr(IfrView { fanfault: false, thermtrip: false, smerr_assert: false, a0mapo: false, nicmapo: true, amd_pwrok_fedge: false, amd_rstn_fedge: true, fan_central_hsc_alert: false, fan_east_hsc_alert: false, fan_west_hsc_alert: false, ibc_alert: false, m2_hsc_alert: false, nic_hsc_alert: false, v12_ddr5_abcdef_hsc_alert: false, v12_ddr5_ghijkl_hsc_alert: false, v12_mcio_a0hp_hsc_alert: false, main_hsc_alert: false, vr_v1p8_sys_to_fpga1_alert: true, vr_v3p3_sys_to_fpga1_alert: true, vr_v5p0_sys_to_fpga1_alert: true, v0p96_nic_to_fpga1_alert: true, pwr_cont1_to_fpga1_alert: true, pwr_cont2_to_fpga1_alert: true, pwr_cont3_to_fpga1_alert: false })
  72  878        6        1 PmbusAlert { now: 0x43b35 }
  73  933        6        1 ResetCounts { rstn: 0x1, pwrokn: 0x0 }
  74  939        6        1 NicMapoInterrupt
  75  530        6      668 RegStateValues { seq_api_status: SeqApiStatusView { a0_sm: Ok(Done) }, seq_raw_status: SeqRawStatusView { hw_sm: Ok(Done) }, nic_api_status: NicApiStatusView { nic_sm: Ok(Idle) }, nic_raw_status: NicRawStatusView { hw_sm: Ok(Idle) } }
humility: ring buffer drv_cosmo_seq_server::vcore::__RINGBUF in cosmo_seq:
 NDX LINE      GEN    COUNT PAYLOAD
   0  199        1        1 Initializing
   1  208        1        1 LimitsLoaded
   2  252        1        1 TriedToClearFaults { vddcr_cpu0: StatusWord(0x0), vddcr_cpu1: StatusWord(0x0) }
   3  219        1        1 Initialized
   4  267        1        1 PmbusAlert { timestamp: 0x43aab, alerted: Vrms { pwr_cont1: true, pwr_cont2: false } }
   5  360        1        1 StatusWord(VddcrCpu0, Ok(0x2001))
   6  400        1        1 RegulatorStatus { rail: VddcrCpu0, power_good: true, faulted: true }
   7  423        1        1 StatusInput(VddcrCpu0, Ok(0x20))
   8  435        1        1 StatusVout(VddcrCpu0, Ok(0x0))
   9  439        1        1 StatusIout(VddcrCpu0, Ok(0x0))
  10  444        1        1 StatusTemperature(VddcrCpu0, Ok(0x0))
  11  448        1        1 StatusCml(VddcrCpu0, Ok(0x0))
  12  453        1        1 StatusMfrSpecific(VddcrCpu0, Ok(0x0))
  13  360        1        1 StatusWord(VddcrCpu1, Ok(0x2001))
  14  400        1        1 RegulatorStatus { rail: VddcrCpu1, power_good: true, faulted: true }
  15  423        1        1 StatusInput(VddcrCpu1, Ok(0x20))
  16  435        1        1 StatusVout(VddcrCpu1, Ok(0x0))
  17  439        1        1 StatusIout(VddcrCpu1, Ok(0x0))
  18  444        1        1 StatusTemperature(VddcrCpu1, Ok(0x0))
  19  448        1        1 StatusCml(VddcrCpu1, Ok(0x0))
  20  453        1        1 StatusMfrSpecific(VddcrCpu1, Ok(0x0))
  21  333        1        1 Reading { timestamp: 0x43ac9, vddcr_cpu0_vin: Volts(8.39), vddcr_cpu1_vin: Volts(8.3) }
  22  333        1        1 Reading { timestamp: 0x43acc, vddcr_cpu0_vin: Volts(8.14), vddcr_cpu1_vin: Volts(8.06) }
  23  333        1        1 Reading { timestamp: 0x43ace, vddcr_cpu0_vin: Volts(7.9800005), vddcr_cpu1_vin: Volts(8.02) }
  24  333        1        1 Reading { timestamp: 0x43ad2, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.02) }
  25  333        1        1 Reading { timestamp: 0x43ade, vddcr_cpu0_vin: Volts(7.9800005), vddcr_cpu1_vin: Volts(8.01) }
  26  333        1        1 Reading { timestamp: 0x43ae0, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.02) }
  27  333        1        1 Reading { timestamp: 0x43ae3, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.02) }
  28  333        1        1 Reading { timestamp: 0x43ae5, vddcr_cpu0_vin: Volts(7.9800005), vddcr_cpu1_vin: Volts(8.02) }
  29  333        1        1 Reading { timestamp: 0x43ae8, vddcr_cpu0_vin: Volts(7.9800005), vddcr_cpu1_vin: Volts(8.02) }
  30  333        1        1 Reading { timestamp: 0x43aec, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.02) }
  31  333        1        1 Reading { timestamp: 0x43af8, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.02) }
  32  333        1        1 Reading { timestamp: 0x43afa, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.02) }
  33  333        1        1 Reading { timestamp: 0x43afd, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.02) }
  34  333        1        1 Reading { timestamp: 0x43aff, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.02) }
  35  333        1        1 Reading { timestamp: 0x43b02, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.02) }
  36  333        1        1 Reading { timestamp: 0x43b06, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.01) }
  37  333        1        1 Reading { timestamp: 0x43b12, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.02) }
  38  333        1        1 Reading { timestamp: 0x43b14, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.02) }
  39  333        1        1 Reading { timestamp: 0x43b17, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.02) }
  40  333        1        1 Reading { timestamp: 0x43b19, vddcr_cpu0_vin: Volts(7.9800005), vddcr_cpu1_vin: Volts(8.030001) }
  41  333        1        1 Reading { timestamp: 0x43b1c, vddcr_cpu0_vin: Volts(8.000001), vddcr_cpu1_vin: Volts(8.02) }
  42  333        1        1 Reading { timestamp: 0x43b20, vddcr_cpu0_vin: Volts(7.9800005), vddcr_cpu1_vin: Volts(8.01) }
  43  333        1        1 Reading { timestamp: 0x43b2c, vddcr_cpu0_vin: Volts(7.9800005), vddcr_cpu1_vin: Volts(8.030001) }
  44  333        1        1 Reading { timestamp: 0x43b2e, vddcr_cpu0_vin: Volts(7.9800005), vddcr_cpu1_vin: Volts(8.030001) }
  45  333        1        1 Reading { timestamp: 0x43b31, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.01) }
  46  252        1        1 TriedToClearFaults { vddcr_cpu0: StatusWord(0x2001), vddcr_cpu1: StatusWord(0x2001) }
  47  267        1        1 PmbusAlert { timestamp: 0x43b35, alerted: Vrms { pwr_cont1: true, pwr_cont2: true } }
  48  252        1      301 TriedToClearFaults { vddcr_cpu0: StatusWord(0x2001), vddcr_cpu1: StatusWord(0x2001) }
  49  252        1        1 TriedToClearFaults { vddcr_cpu0: StatusWord(0x2001), vddcr_cpu1: StatusWord(0x0) }
  50  252        1        1 TriedToClearFaults { vddcr_cpu0: StatusWord(0x0), vddcr_cpu1: NotFaulted }
humility: ring buffer drv_i2c_devices::bmr491::__RINGBUF in cosmo_seq:
 NDX LINE      GEN    COUNT PAYLOAD
   0  186        1        1 MitigationApplied(AlreadyApplied)
humility: ring buffer drv_oxide_vpd::__RINGBUF in cosmo_seq:
humility: ring buffer drv_packrat_vpd_loader::__RINGBUF in cosmo_seq:
humility: WARNING: archive in environment variable overriding archive in environment file
humility: attached to 1fc9:0143:XJACFXSEKTQJS via CMSIS-DAP
humility: ring buffer drv_i2c_devices::emc2305::__RINGBUF in thermal:
humility: ring buffer drv_i2c_devices::max31790::__RINGBUF in thermal:
humility: ring buffer task_thermal::__RINGBUF in thermal:
   TOTAL VARIANT
     281 ControlPwm
       4 AutoState(Boot)
       3 AutoState(Running)
       6 FanAdded
       3 MiscReadFailed
       3 PowerModeChanged
       1 Start
       1 ThermalMode(Auto)
       1 SensorReadFailed
       1 FanControllerInitialized
       1 SetFanWatchdogOk
 NDX LINE      GEN    COUNT PAYLOAD
  28 1352        1        1 ControlPwm(0xa)
  29 1352        1        1 ControlPwm(0x6)
  30 1352        1        2 ControlPwm(0x4)
  31 1352        1        3 ControlPwm(0x5)
   0 1352        2        1 ControlPwm(0x6)
   1 1352        2        8 ControlPwm(0x5)
   2 1352        2        1 ControlPwm(0x4)
   3 1352        2        1 ControlPwm(0x6)
   4 1352        2        6 ControlPwm(0x5)
   5 1352        2        1 ControlPwm(0x6)
   6 1352        2        8 ControlPwm(0x5)
   7 1352        2       14 ControlPwm(0x4)
   8 1352        2        2 ControlPwm(0x3)
   9 1352        2        1 ControlPwm(0x50)
  10 1352        2        1 ControlPwm(0x41)
  11 1352        2        1 ControlPwm(0x34)
  12 1352        2        1 ControlPwm(0x2b)
  13 1352        2        1 ControlPwm(0x26)
  14 1352        2        1 ControlPwm(0x21)
  15 1352        2        1 ControlPwm(0x1e)
  16 1352        2        1 ControlPwm(0x19)
  17 1352        2        1 ControlPwm(0x16)
  18 1352        2        1 ControlPwm(0x13)
  19 1352        2        1 ControlPwm(0x10)
  20 1352        2        1 ControlPwm(0xc)
  21 1352        2        1 ControlPwm(0x7)
  22 1352        2        1 ControlPwm(0x3)
  23 1352        2       72 ControlPwm(0x0)
  24 1100        2        1 PowerModeChanged(PowerBitmask(0b10))
  25  924        2        1 AutoState(Boot)
  26 1388        2        1 AutoState(Running)
  27 1352        2       11 ControlPwm(0x0)
eliza@alfred ~ $

@rmustacc informs me that this does indeed only apply on Cosmo, as:

> Yeah, Gimlet has the TMP451 on V3P3_NIC_A0HP.
> This was folded into the A0 domain by the FPGA logic.

So, we can ignore this on Gimlet, which simplifies things there a bit.
@hawkw hawkw removed the gimlet label Feb 14, 2026
@hawkw hawkw requested review from cbiffle and removed request for nathanaelhuffman February 14, 2026 19:50
@hawkw hawkw changed the title thermal: only poll T6 temperatures in A0+HP thermal: only poll Cosmo T6 temperatures in A0+HP Feb 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cosmo SP5 Board

Projects

None yet

Development

Successfully merging this pull request may close these issues.

On Cosmo, hubris misses NIC_MAPO leading to a fake "Overheat"

1 participant