Skip to content

sidecar thermal shutdown on mfg line #2369

@nathanaelhuffman

Description

@nathanaelhuffman

I got called to debug a system that had a sidecar in A2 after mupdate to 17.2.

 humility -a /data/local/images/sidecar/d/sp/build-sidecar-d-image-default-v1.0.56.zip --ip  fe80::aa40:25ff:fe05:8e00%dut2 ringbuf seq
humility: connecting to fe80::aa40:25ff:fe05:8e00%5
humility: ring buffer drv_oxide_vpd::__RINGBUF in sequencer:
humility: ring buffer drv_packrat_vpd_loader::__RINGBUF in sequencer:
humility: ring buffer drv_sidecar_seq_server::__RINGBUF in sequencer:
 NDX LINE      GEN    COUNT PAYLOAD
  17  368      406        1 TofinoSequencerPolicyUpdate(Disabled)
  18  350      406        1 TofinoSequencerTick(Disabled, A2 { error: None })
  19  368      406        1 TofinoSequencerPolicyUpdate(Disabled)
  20  350      406        1 TofinoSequencerTick(Disabled, A2 { error: None })
  21  368      406        1 TofinoSequencerPolicyUpdate(Disabled)
  22  350      406        1 TofinoSequencerTick(Disabled, A2 { error: None })
  23  368      406        1 TofinoSequencerPolicyUpdate(Disabled)
  24  350      406        1 TofinoSequencerTick(Disabled, A2 { error: None })
  25  368      406        1 TofinoSequencerPolicyUpdate(Disabled)
  26  350      406        1 TofinoSequencerTick(Disabled, A2 { error: None })
  27  368      406        1 TofinoSequencerPolicyUpdate(Disabled)
  28  350      406        1 TofinoSequencerTick(Disabled, A2 { error: None })
  29  368      406        1 TofinoSequencerPolicyUpdate(Disabled)
  30  350      406        1 TofinoSequencerTick(Disabled, A2 { error: None })
  31  368      406        1 TofinoSequencerPolicyUpdate(Disabled)
   0  350      407        1 TofinoSequencerTick(Disabled, A2 { error: None })
   1  368      407        1 TofinoSequencerPolicyUpdate(Disabled)
   2  350      407        1 TofinoSequencerTick(Disabled, A2 { error: None })
   3  368      407        1 TofinoSequencerPolicyUpdate(Disabled)
   4  350      407        1 TofinoSequencerTick(Disabled, A2 { error: None })
   5  368      407        1 TofinoSequencerPolicyUpdate(Disabled)
   6  350      407        1 TofinoSequencerTick(Disabled, A2 { error: None })
   7  368      407        1 TofinoSequencerPolicyUpdate(Disabled)
   8  350      407        1 TofinoSequencerTick(Disabled, A2 { error: None })
   9  368      407        1 TofinoSequencerPolicyUpdate(Disabled)
  10  350      407        1 TofinoSequencerTick(Disabled, A2 { error: None })
  11  368      407        1 TofinoSequencerPolicyUpdate(Disabled)
  12  350      407        1 TofinoSequencerTick(Disabled, A2 { error: None })
  13  368      407        1 TofinoSequencerPolicyUpdate(Disabled)
  14  350      407        1 TofinoSequencerTick(Disabled, A2 { error: None })
  15  368      407        1 TofinoSequencerPolicyUpdate(Disabled)
  16  350      407        1 TofinoSequencerTick(Disabled, A2 { error: None })

Looks liked a commanded power off of some kind, let's check thermal

humility -a /data/local/images/sidecar/d/sp/build-sidecar-d-image-default-v1.0.56.zip --ip  fe80::aa40:25ff:fe05:8e00%dut2 ringbuf thermal
humility: connecting to fe80::aa40:25ff:fe05:8e00%5
humility: ring buffer drv_i2c_devices::emc2305::__RINGBUF in thermal:
humility: ring buffer drv_i2c_devices::max31790::__RINGBUF in thermal:
humility: ring buffer task_thermal::__RINGBUF in thermal:
   TOTAL VARIANT
    6678 PowerDownAt
      97 ControlPwm
      11 AutoState(Boot)
       2 AutoState(Running)
       2 AutoState(Uncontrollable)
       1 AutoState(Overheated)
       8 FanAdded
       8 AddedDynamicInput
       2 PowerDownDueTo
       2 PowerModeChanged
       2 FanControllerInitialized
       1 Start
       1 ThermalMode(Auto)
       1 CriticalDueTo
       1 SetFanWatchdogOk
 NDX LINE      GEN    COUNT PAYLOAD
   8 1210      211        1 PowerDownAt(0x66e840)
   9 1210      211        1 PowerDownAt(0x66ec28)
  10 1210      211        1 PowerDownAt(0x66f010)
  11 1210      211        1 PowerDownAt(0x66f3f8)
  12 1210      211        1 PowerDownAt(0x66f7e8)
  13 1210      211        1 PowerDownAt(0x66fbc8)
  14 1210      211        1 PowerDownAt(0x66ffb0)
  15 1210      211        1 PowerDownAt(0x670398)
  16 1210      211        1 PowerDownAt(0x670780)
  17 1210      211        1 PowerDownAt(0x670b68)
  18 1210      211        1 PowerDownAt(0x670f50)
  19 1210      211        1 PowerDownAt(0x671338)
  20 1210      211        1 PowerDownAt(0x671720)
  21 1210      211        1 PowerDownAt(0x671b08)
  22 1210      211        1 PowerDownAt(0x671ef0)
  23 1210      211        1 PowerDownAt(0x6722d8)
  24 1210      211        1 PowerDownAt(0x6726c0)
  25 1210      211        1 PowerDownAt(0x672aa8)
  26 1210      211        1 PowerDownAt(0x672e90)
  27 1210      211        1 PowerDownAt(0x673278)
  28 1210      211        1 PowerDownAt(0x673660)
  29 1210      211        1 PowerDownAt(0x673a48)
  30 1210      211        1 PowerDownAt(0x673e30)
  31 1210      211        1 PowerDownAt(0x674218)
   0 1210      212        1 PowerDownAt(0x674600)
   1 1210      212        1 PowerDownAt(0x6749e8)
   2 1210      212        1 PowerDownAt(0x674dd0)
   3 1210      212        1 PowerDownAt(0x6751b8)
   4 1210      212        1 PowerDownAt(0x6755a0)
   5 1210      212        1 PowerDownAt(0x675988)
   6 1210      212        1 PowerDownAt(0x675d70)
   7 1210      212        1 PowerDownAt(0x676158)

This looks like a thermal shutdown. 2 things of note here: the repeated PowerDownAt seems like a bug, and it pushes the other more useful stuff out of the ring buffer. We did have a CriticalDueTo logged in history so I'm postulating that this was due to a xcvr read error. Additional attempts to go back to A0 did not work, as we were immediately shut down again:

humility -a /data/local/images/sidecar/d/sp/build-sidecar-d-image-default-v1.0.56.zip --ip  fe80::aa40:25ff:fe05:8e00%dut2 ringbuf seq
humility: connecting to fe80::aa40:25ff:fe05:8e00%5
humility: ring buffer drv_oxide_vpd::__RINGBUF in sequencer:
humility: ring buffer drv_packrat_vpd_loader::__RINGBUF in sequencer:
humility: ring buffer drv_sidecar_seq_server::__RINGBUF in sequencer:
 NDX LINE      GEN    COUNT PAYLOAD
  17  368      409        1 TofinoSequencerPolicyUpdate(Disabled)
  18  350      409        1 TofinoSequencerTick(Disabled, A2 { error: None })
  19  368      409        1 TofinoSequencerPolicyUpdate(Disabled)
  20  350      409        1 TofinoSequencerTick(Disabled, A2 { error: None })
  21  368      409        1 TofinoSequencerPolicyUpdate(Disabled)
  22  350      409        1 TofinoSequencerTick(Disabled, A2 { error: None })
  23  368      409        1 TofinoSequencerPolicyUpdate(Disabled)
  24  350      409        1 TofinoSequencerTick(Disabled, A2 { error: None })
  25  368      409        1 TofinoSequencerPolicyUpdate(Disabled)
  26  350      409        1 TofinoSequencerTick(Disabled, A2 { error: None })
  27  368      409        1 TofinoSequencerPolicyUpdate(Disabled)
  28  350      409        1 TofinoSequencerTick(Disabled, A2 { error: None })
  29  368      409        1 TofinoSequencerPolicyUpdate(Disabled)
  30  350      409        1 TofinoSequencerTick(Disabled, A2 { error: None })
  31  368      409        1 TofinoSequencerPolicyUpdate(Disabled)
   0  350      410        1 TofinoSequencerTick(Disabled, A2 { error: None })
   1  368      410        1 TofinoSequencerPolicyUpdate(Disabled)
   2  350      410        1 TofinoSequencerTick(Disabled, A2 { error: None })
   3  368      410        1 TofinoSequencerPolicyUpdate(Disabled)
   4  350      410        1 TofinoSequencerTick(Disabled, A2 { error: None })
   5  368      410        1 TofinoSequencerPolicyUpdate(Disabled)
   6  350      410        1 TofinoSequencerTick(Disabled, A2 { error: None })
   7  368      410        1 TofinoSequencerPolicyUpdate(Disabled)
   8  350      410        1 TofinoSequencerTick(Disabled, A2 { error: None })
   9  403      410        1 ClearingTofinoSequencerFault(None)
  10  368      410        1 TofinoSequencerPolicyUpdate(LatchOffOnFault)
  11  368      410        1 TofinoSequencerPolicyUpdate(Disabled)
  12  350      410        1 TofinoSequencerTick(Disabled, A2 { error: None })
  13  368      410        1 TofinoSequencerPolicyUpdate(Disabled)
  14  350      410        1 TofinoSequencerTick(Disabled, A2 { error: None })
  15  368      410        1 TofinoSequencerPolicyUpdate(Disabled)
  16  350      410        1 TofinoSequencerTick(Disabled, A2 { error: None })

I took a hubris dump for further debug stored at /staff/core/hubris-2369

and proceed to ignition cycle the sidecar at which point it came back online with no problem.

Some thoughts:

  1. This shouldn't be so sticky as to require an ignition cycle.
  2. Spamming the ring buf with PowerDownAt and a new timestamp is really unhelpful
  3. If this is a xcvr temp issue as suspected, it is most probably incorrect behavior to totally shut down the switch!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions