Skip to content

thermal: distinguish between real and made up temperatures #2386

Open
hawkw wants to merge 2 commits intomasterfrom
eliza/fix-thermal-gaslighting
Open

thermal: distinguish between real and made up temperatures #2386
hawkw wants to merge 2 commits intomasterfrom
eliza/fix-thermal-gaslighting

Conversation

@hawkw
Copy link
Member

@hawkw hawkw commented Feb 12, 2026

Presently, the thermal task's ringbuf attempts to gaslight the reader
by presenting totally fake temperatures as though they were real
measurements from a sensor. This is because the thermal loop contains
logic which attempts to extrapolate a worst-case temperature projection
based on the last temperature reading received from a device and the
time which has elapsed since that reading:

/// Returns the worst-case temperature, given a current time and thermal
/// model for this part.
///
/// This only matters when samples are dropped or if there is significant
/// lag in the sensors system; if we received a reading on this control
/// cycle, then time_ms ≈ now_ms, so this is close to v.value (i.e. the most
/// recent reading).
///
/// Typically, time_ms is earlier (less) than now_ms, so this subtraction is
/// safe. If there's invalid data in the sensors task (i.e. readings
/// claiming to be from the future), then this will saturate instead of
/// underflowing.
fn worst_case(&self, now_ms: u64, model: &ThermalProperties) -> Celsius {
Celsius(
self.value.0
+ now_ms.saturating_sub(self.time_ms) as f32 / 1000.0
* model.temperature_slew_deg_per_sec,
)
}

This logic is not wrong, per se. Extrapolating a worst-case
temperature is the correct behavior in the absence of sensor data, even
when it leads to a thermal shutdown which is not actually necessary: the
thermal loop is correct to assume the worst in the absence of real data.
However, the way we presently present these fake temperatures in the
ringbuf is very confusing to the reader, as they appear to be real
life temperature measurements that actually came from the device.
Recently (see #2385), this reporting of fake temperatures as though they
were real lead @nathanaelhuffman and I to hypothesize about a
manufacturing defect with T6 heatsinks when the actual issue was just
that we couldn't read the NIC's temperature sensor as it had MAPO-ed a
while ago (see #2384).

This commit is my attempt to make the ringbuf entries less misleading.
Now, in addition to recording the worst case estimate that lead to a
thermal control state transition (to the Critical or Uncontrollable
states), we also record the real temperature reading from the device
alongside it. The names of the ringbuf fields now attempt to indicate
which is a real temperature and which is a worst-case estimate.

Note that the fake temperatures are only presented to a reader as though
they were real in the thermal loop's ringbuf; they are not sent to the
sensors task and will not be reported externally through the
control-plane-agent's component_details or read_sensor_value APIs.
Instead, we will just report nodata_now to the sensors task when
encountering an error reading a sensor measurement --- which is correct.
See:

match s.sensor.read_temp(self.i2c_task) {
Ok(v) => self.sensor_api.post_now(s.sensor.sensor_id, v.0),
Err(e) => {
// Record an error errors if the sensor is not removable
// or we get a unexpected error from a removable sensor
if !(matches!(
s.ty,
ChannelType::Removable
| ChannelType::RemovableAndErrorProne
) && e
== SensorReadError::I2cError(
ResponseCode::NoDevice,
))
{
ringbuf_entry!(Trace::SensorReadFailed(
s.sensor.sensor_id,
e
));
self.err_blackbox.push(s.sensor.sensor_id, e);
}
self.sensor_api.nodata_now(s.sensor.sensor_id, e.into())

So, only the ringbuf must be changed to make this less misleading.

Fixes #2385

Presently, the `thermal` task's ringbuf attempts to gaslight the reader
by presenting totally fake temperatures as though they were real
measurements from a sensor. This is because the thermal loop contains
logic which attempts to extrapolate a worst-case temperature projection
based on the last temperature reading received from a device and the
time which has elapsed since that reading:
https://github.com/oxidecomputer/hubris/blob/ddfb8782336a86cf5c04e48eedc05d6a0ee25756/task/thermal/src/control.rs#L508-L526

This logic is not _wrong_, per se. Extrapolating a worst-case
temperature is the correct behavior in the absence of sensor data, even
when it leads to a thermal shutdown which is not actually necessary: the
thermal loop is correct to assume the worst in the absence of real data.
However, the way we presently present these fake temperatures in the
ringbuf is *very* confusing to the reader, as they appear to be real
life temperature measurements that actually came from the device.
Recently (see #2385), this reporting of fake temperatures as though they
were real lead @nathanaelhuffman and I to hypothesize about a
manufacturing defect with T6 heatsinks when the actual issue was just
that we couldn't read the NIC's temperature sensor as it had MAPO-ed a
while ago (see #2384).

This commit is my attempt to make the ringbuf entries less misleading.
Now, in addition to recording the worst case estimate that lead to a
thermal control state transition (to the `Critical` or `Uncontrollable`
states), we *also* record the real temperature reading from the device
alongside it. The names of the ringbuf fields now attempt to indicate
which is a real temperature and which is a worst-case estimate.

Note that the fake temperatures are only presented to a reader as though
they were real in the thermal loop's ringbuf; they are _not_ sent to the
`sensors` task and will not be reported externally through the
`control-plane-agent`'s `component_details` or `read_sensor_value` APIs.
Instead, we will just report `nodata_now` to the `sensors` task when
encountering an error reading a sensor measurement --- which is correct.
See:
https://github.com/oxidecomputer/hubris/blob/ddfb8782336a86cf5c04e48eedc05d6a0ee25756/task/thermal/src/control.rs#L881-L901

So, only the ringbuf must be changed to make this less misleading.

Fixes #2385
@hawkw hawkw self-assigned this Feb 12, 2026
pub(crate) struct WorstCaseTemperature {
/// The worst-case temperature estimate from the thermal model, projected
/// from the `last_reading`.
worst_case_temp: Celsius,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +113 to 128
/// The last actual temperature measurement reported by a sensor.
///
/// This is recorded after every [`CriticalDueTo`] or [`PowerDownDueTo`]
/// entry so that the last known real life temperature can be compared to
/// the worst-case temperature projection that caused a thermal loop state
/// transition.
#[count(skip)]
LastRealTemperature {
sensor_id: SensorId,
/// The most recent real life (not fake) temperature measurement from
/// the sensor.
temperature: units::Celsius,
/// The (approximate) time, in seconds, since the real life temperature
/// measurement was received.
age_s: f32,
},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split this into its own ringbuf entries because adding two more 32-bit fields to CriticalDoTo and PowerDownDueTo would have probably made them the largest ringbuf entry and made each ringbuf slot a word bigger, which I wanted to avoid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Thermal loop doesn't distinguish between "real" and "fake" temperatures

1 participant