thermal: only poll Cosmo T6 temperatures in A0+HP#2391
Open
Conversation
Presently, the thermal loop on Gimlet and Cosmo will attempt to read the T6 temperature sensors whilst in the A0 *or* A0+HP power states. This is wrong, as the NIC is only known to be powered in the A0+HP state (since its sequencing is controlled by the host OS). This means that if the T6 MAPOs but the host processor is still up, the thermal loop will continue trying to read temperatures from the T6 whilst it is not powered. Because it's not powered, we will just read I2C NACKs from it. This causes the logic in the thermal loop that attempts to extrapolate a worst-case temperature increase for devices which have transient sensor read errors to imagine a T6 temperature which continues to increase, eventually leading to a thermal shutdown. This is described in #2384. This commit adds a bit in the Gimlet and Cosmo thermal tasks' `PowerBitmask`s to differentiate between the A0 and A0+HP power states, and only consider the T6 temperature if the A0+HP bit is set. This will fix #2384...alongside #2390, which is necessary to fix Cosmo ignoring the sequencer IRQ that would tell it to transition from A0+HP back to A0 if the NIC MAPOs.
Member
Author
|
Testing this in integration with #2390 (which is necessary for Cosmo to actually set the state back to A0 when the NIC MAPOs), it appears to work properly. Note that the eliza@alfred ~ $ pfexec humility pmbus -r V12_SYS_A2 -w VOUT_COMMAND=8 \
&& pfexec humility pmbus -r V12_SYS_A2 -w VOUT_COMMAND=12 \
&& pfexec humility ringbuf cosmo_seq \
&& pfexec humility ringbuf thermal
humility: WARNING: archive in environment variable overriding archive in environment file
humility: attached to 1fc9:0143:XJACFXSEKTQJS via CMSIS-DAP
humility: I2C4, port F, dev 0x67: successfully wrote VOUT_COMMAND
humility: WARNING: archive in environment variable overriding archive in environment file
humility: attached to 1fc9:0143:XJACFXSEKTQJS via CMSIS-DAP
humility: I2C4, port F, dev 0x67: successfully wrote VOUT_COMMAND
humility: WARNING: archive in environment variable overriding archive in environment file
humility: attached to 1fc9:0143:XJACFXSEKTQJS via CMSIS-DAP
humility: ring buffer drv_cosmo_seq_server::__RINGBUF in cosmo_seq:
TOTAL VARIANT
14159 RegStateValues
534 ContinueBitstreamLoad
112 CPUPresent
3 SetState(InitialPowerOn)
2 PmbusAlert
2 EreportSent
1 FpgaInit
1 WaitForDone
1 Programmed
1 Startup
1 SequencerInterrupt
1 Coretype
1 ResetCounts
1 NicMapoInterrupt
NDX LINE GEN COUNT PAYLOAD
76 601 5 1 CPUPresent(true)
...
65 530 6 48 RegStateValues { seq_api_status: SeqApiStatusView { a0_sm: Ok(Done) }, seq_raw_status: SeqRawStatusView { hw_sm: Ok(Done) }, nic_api_status: NicApiStatusView { nic_sm: Ok(Done) }, nic_raw_status: NicRawStatusView { hw_sm: Ok(Done) } }
66 1082 6 1 SequencerInterrupt { our_state: A0PlusHP, seq_state: Ok(Done) }
67 830 6 1 SequencerIfr(IfrView { fanfault: false, thermtrip: false, smerr_assert: false, a0mapo: false, nicmapo: false, amd_pwrok_fedge: false, amd_rstn_fedge: false, fan_central_hsc_alert: false, fan_east_hsc_alert: false, fan_west_hsc_alert: false, ibc_alert: false, m2_hsc_alert: false, nic_hsc_alert: false, v12_ddr5_abcdef_hsc_alert: false, v12_ddr5_ghijkl_hsc_alert: false, v12_mcio_a0hp_hsc_alert: false, main_hsc_alert: false, vr_v1p8_sys_to_fpga1_alert: true, vr_v3p3_sys_to_fpga1_alert: true, vr_v5p0_sys_to_fpga1_alert: true, v0p96_nic_to_fpga1_alert: true, pwr_cont1_to_fpga1_alert: true, pwr_cont2_to_fpga1_alert: false, pwr_cont3_to_fpga1_alert: false })
68 878 6 1 PmbusAlert { now: 0x43aab }
69 1157 6 1 EreportSent(0x87)
70 1157 6 1 EreportSent(0x88)
71 830 6 1 SequencerIfr(IfrView { fanfault: false, thermtrip: false, smerr_assert: false, a0mapo: false, nicmapo: true, amd_pwrok_fedge: false, amd_rstn_fedge: true, fan_central_hsc_alert: false, fan_east_hsc_alert: false, fan_west_hsc_alert: false, ibc_alert: false, m2_hsc_alert: false, nic_hsc_alert: false, v12_ddr5_abcdef_hsc_alert: false, v12_ddr5_ghijkl_hsc_alert: false, v12_mcio_a0hp_hsc_alert: false, main_hsc_alert: false, vr_v1p8_sys_to_fpga1_alert: true, vr_v3p3_sys_to_fpga1_alert: true, vr_v5p0_sys_to_fpga1_alert: true, v0p96_nic_to_fpga1_alert: true, pwr_cont1_to_fpga1_alert: true, pwr_cont2_to_fpga1_alert: true, pwr_cont3_to_fpga1_alert: false })
72 878 6 1 PmbusAlert { now: 0x43b35 }
73 933 6 1 ResetCounts { rstn: 0x1, pwrokn: 0x0 }
74 939 6 1 NicMapoInterrupt
75 530 6 668 RegStateValues { seq_api_status: SeqApiStatusView { a0_sm: Ok(Done) }, seq_raw_status: SeqRawStatusView { hw_sm: Ok(Done) }, nic_api_status: NicApiStatusView { nic_sm: Ok(Idle) }, nic_raw_status: NicRawStatusView { hw_sm: Ok(Idle) } }
humility: ring buffer drv_cosmo_seq_server::vcore::__RINGBUF in cosmo_seq:
NDX LINE GEN COUNT PAYLOAD
0 199 1 1 Initializing
1 208 1 1 LimitsLoaded
2 252 1 1 TriedToClearFaults { vddcr_cpu0: StatusWord(0x0), vddcr_cpu1: StatusWord(0x0) }
3 219 1 1 Initialized
4 267 1 1 PmbusAlert { timestamp: 0x43aab, alerted: Vrms { pwr_cont1: true, pwr_cont2: false } }
5 360 1 1 StatusWord(VddcrCpu0, Ok(0x2001))
6 400 1 1 RegulatorStatus { rail: VddcrCpu0, power_good: true, faulted: true }
7 423 1 1 StatusInput(VddcrCpu0, Ok(0x20))
8 435 1 1 StatusVout(VddcrCpu0, Ok(0x0))
9 439 1 1 StatusIout(VddcrCpu0, Ok(0x0))
10 444 1 1 StatusTemperature(VddcrCpu0, Ok(0x0))
11 448 1 1 StatusCml(VddcrCpu0, Ok(0x0))
12 453 1 1 StatusMfrSpecific(VddcrCpu0, Ok(0x0))
13 360 1 1 StatusWord(VddcrCpu1, Ok(0x2001))
14 400 1 1 RegulatorStatus { rail: VddcrCpu1, power_good: true, faulted: true }
15 423 1 1 StatusInput(VddcrCpu1, Ok(0x20))
16 435 1 1 StatusVout(VddcrCpu1, Ok(0x0))
17 439 1 1 StatusIout(VddcrCpu1, Ok(0x0))
18 444 1 1 StatusTemperature(VddcrCpu1, Ok(0x0))
19 448 1 1 StatusCml(VddcrCpu1, Ok(0x0))
20 453 1 1 StatusMfrSpecific(VddcrCpu1, Ok(0x0))
21 333 1 1 Reading { timestamp: 0x43ac9, vddcr_cpu0_vin: Volts(8.39), vddcr_cpu1_vin: Volts(8.3) }
22 333 1 1 Reading { timestamp: 0x43acc, vddcr_cpu0_vin: Volts(8.14), vddcr_cpu1_vin: Volts(8.06) }
23 333 1 1 Reading { timestamp: 0x43ace, vddcr_cpu0_vin: Volts(7.9800005), vddcr_cpu1_vin: Volts(8.02) }
24 333 1 1 Reading { timestamp: 0x43ad2, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.02) }
25 333 1 1 Reading { timestamp: 0x43ade, vddcr_cpu0_vin: Volts(7.9800005), vddcr_cpu1_vin: Volts(8.01) }
26 333 1 1 Reading { timestamp: 0x43ae0, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.02) }
27 333 1 1 Reading { timestamp: 0x43ae3, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.02) }
28 333 1 1 Reading { timestamp: 0x43ae5, vddcr_cpu0_vin: Volts(7.9800005), vddcr_cpu1_vin: Volts(8.02) }
29 333 1 1 Reading { timestamp: 0x43ae8, vddcr_cpu0_vin: Volts(7.9800005), vddcr_cpu1_vin: Volts(8.02) }
30 333 1 1 Reading { timestamp: 0x43aec, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.02) }
31 333 1 1 Reading { timestamp: 0x43af8, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.02) }
32 333 1 1 Reading { timestamp: 0x43afa, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.02) }
33 333 1 1 Reading { timestamp: 0x43afd, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.02) }
34 333 1 1 Reading { timestamp: 0x43aff, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.02) }
35 333 1 1 Reading { timestamp: 0x43b02, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.02) }
36 333 1 1 Reading { timestamp: 0x43b06, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.01) }
37 333 1 1 Reading { timestamp: 0x43b12, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.02) }
38 333 1 1 Reading { timestamp: 0x43b14, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.02) }
39 333 1 1 Reading { timestamp: 0x43b17, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.02) }
40 333 1 1 Reading { timestamp: 0x43b19, vddcr_cpu0_vin: Volts(7.9800005), vddcr_cpu1_vin: Volts(8.030001) }
41 333 1 1 Reading { timestamp: 0x43b1c, vddcr_cpu0_vin: Volts(8.000001), vddcr_cpu1_vin: Volts(8.02) }
42 333 1 1 Reading { timestamp: 0x43b20, vddcr_cpu0_vin: Volts(7.9800005), vddcr_cpu1_vin: Volts(8.01) }
43 333 1 1 Reading { timestamp: 0x43b2c, vddcr_cpu0_vin: Volts(7.9800005), vddcr_cpu1_vin: Volts(8.030001) }
44 333 1 1 Reading { timestamp: 0x43b2e, vddcr_cpu0_vin: Volts(7.9800005), vddcr_cpu1_vin: Volts(8.030001) }
45 333 1 1 Reading { timestamp: 0x43b31, vddcr_cpu0_vin: Volts(7.9900007), vddcr_cpu1_vin: Volts(8.01) }
46 252 1 1 TriedToClearFaults { vddcr_cpu0: StatusWord(0x2001), vddcr_cpu1: StatusWord(0x2001) }
47 267 1 1 PmbusAlert { timestamp: 0x43b35, alerted: Vrms { pwr_cont1: true, pwr_cont2: true } }
48 252 1 301 TriedToClearFaults { vddcr_cpu0: StatusWord(0x2001), vddcr_cpu1: StatusWord(0x2001) }
49 252 1 1 TriedToClearFaults { vddcr_cpu0: StatusWord(0x2001), vddcr_cpu1: StatusWord(0x0) }
50 252 1 1 TriedToClearFaults { vddcr_cpu0: StatusWord(0x0), vddcr_cpu1: NotFaulted }
humility: ring buffer drv_i2c_devices::bmr491::__RINGBUF in cosmo_seq:
NDX LINE GEN COUNT PAYLOAD
0 186 1 1 MitigationApplied(AlreadyApplied)
humility: ring buffer drv_oxide_vpd::__RINGBUF in cosmo_seq:
humility: ring buffer drv_packrat_vpd_loader::__RINGBUF in cosmo_seq:
humility: WARNING: archive in environment variable overriding archive in environment file
humility: attached to 1fc9:0143:XJACFXSEKTQJS via CMSIS-DAP
humility: ring buffer drv_i2c_devices::emc2305::__RINGBUF in thermal:
humility: ring buffer drv_i2c_devices::max31790::__RINGBUF in thermal:
humility: ring buffer task_thermal::__RINGBUF in thermal:
TOTAL VARIANT
281 ControlPwm
4 AutoState(Boot)
3 AutoState(Running)
6 FanAdded
3 MiscReadFailed
3 PowerModeChanged
1 Start
1 ThermalMode(Auto)
1 SensorReadFailed
1 FanControllerInitialized
1 SetFanWatchdogOk
NDX LINE GEN COUNT PAYLOAD
28 1352 1 1 ControlPwm(0xa)
29 1352 1 1 ControlPwm(0x6)
30 1352 1 2 ControlPwm(0x4)
31 1352 1 3 ControlPwm(0x5)
0 1352 2 1 ControlPwm(0x6)
1 1352 2 8 ControlPwm(0x5)
2 1352 2 1 ControlPwm(0x4)
3 1352 2 1 ControlPwm(0x6)
4 1352 2 6 ControlPwm(0x5)
5 1352 2 1 ControlPwm(0x6)
6 1352 2 8 ControlPwm(0x5)
7 1352 2 14 ControlPwm(0x4)
8 1352 2 2 ControlPwm(0x3)
9 1352 2 1 ControlPwm(0x50)
10 1352 2 1 ControlPwm(0x41)
11 1352 2 1 ControlPwm(0x34)
12 1352 2 1 ControlPwm(0x2b)
13 1352 2 1 ControlPwm(0x26)
14 1352 2 1 ControlPwm(0x21)
15 1352 2 1 ControlPwm(0x1e)
16 1352 2 1 ControlPwm(0x19)
17 1352 2 1 ControlPwm(0x16)
18 1352 2 1 ControlPwm(0x13)
19 1352 2 1 ControlPwm(0x10)
20 1352 2 1 ControlPwm(0xc)
21 1352 2 1 ControlPwm(0x7)
22 1352 2 1 ControlPwm(0x3)
23 1352 2 72 ControlPwm(0x0)
24 1100 2 1 PowerModeChanged(PowerBitmask(0b10))
25 924 2 1 AutoState(Boot)
26 1388 2 1 AutoState(Running)
27 1352 2 11 ControlPwm(0x0)
eliza@alfred ~ $ |
@rmustacc informs me that this does indeed only apply on Cosmo, as: > Yeah, Gimlet has the TMP451 on V3P3_NIC_A0HP. > This was folded into the A0 domain by the FPGA logic. So, we can ignore this on Gimlet, which simplifies things there a bit.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Presently, the thermal loop on Cosmo will attempt to read the T6 temperature sensors whilst in the A0 or A0+HP power states. This is wrong, as the NIC is only known to be powered in the A0+HP state on Cosmo (since its sequencing is controlled by the host OS).1
This means that if the T6 MAPOs but the host processor is still up, the thermal loop will continue trying to read temperatures from the T6 whilst it is not powered. Because it's not powered, we will just read I2C NACKs from it. This causes the logic in the thermal loop that attempts to extrapolate a worst-case temperature increase for devices which have transient sensor read errors to imagine a T6 temperature which continues to increase, eventually leading to a thermal shutdown. This is described in #2384.
This commit adds a bit in the Cosmo thermal task's
PowerBitmasks to differentiate between the A0 and A0+HP power states, and only consider the T6 temperature if the A0+HP bit is set. This will fix #2384...alongside #2390, which is necessary to fix Cosmo ignoring the sequencer IRQ that would tell it to transition from A0+HP back to A0 if the NIC MAPOs.Footnotes
This is not the case on Gimlet, as the TMP451's
V3P3_NIC_A0HPrail was folded into the A0 domain by the FPGA logic on Gimlet. Per @rmustacc this was while debugging the T6 PERST issues. If this ever were to change in the Gimlet FPGA logic, we would want to do a similar thing in the thermal task, but for now, let's avoid the spurious state transition and always care about T6 temps in A0 on Gimlets. ↩