thermal: distinguish between real and made up temperatures by hawkw · Pull Request #2386 · oxidecomputer/hubris

hawkw · 2026-02-12T18:26:20Z

Presently, the thermal task's ringbuf attempts to gaslight the reader
by presenting totally fake temperatures as though they were real
measurements from a sensor. This is because the thermal loop contains
logic which attempts to extrapolate a worst-case temperature projection
based on the last temperature reading received from a device and the
time which has elapsed since that reading:

hubris/task/thermal/src/control.rs

Lines 508 to 526 in ddfb878

    
               /// Returns the worst-case temperature, given a current time and thermal 
        
               /// model for this part. 
        
               /// 
        
               /// This only matters when samples are dropped or if there is significant 
        
               /// lag in the sensors system; if we received a reading on this control 
        
               /// cycle, then time_ms ≈ now_ms, so this is close to v.value (i.e. the most 
        
               /// recent reading). 
        
               /// 
        
               /// Typically, time_ms is earlier (less) than now_ms, so this subtraction is 
        
               /// safe.  If there's invalid data in the sensors task (i.e. readings 
        
               /// claiming to be from the future), then this will saturate instead of 
        
               /// underflowing. 
        
               fn worst_case(&self, now_ms: u64, model: &ThermalProperties) -> Celsius { 
        
                   Celsius( 
        
                       self.value.0 
        
                           + now_ms.saturating_sub(self.time_ms) as f32 / 1000.0 
        
                               * model.temperature_slew_deg_per_sec, 
        
                   ) 
        
               }

This logic is not wrong, per se. Extrapolating a worst-case
temperature is the correct behavior in the absence of sensor data, even
when it leads to a thermal shutdown which is not actually necessary: the
thermal loop is correct to assume the worst in the absence of real data.
However, the way we presently present these fake temperatures in the
ringbuf is very confusing to the reader, as they appear to be real
life temperature measurements that actually came from the device.
Recently (see #2385), this reporting of fake temperatures as though they
were real lead @nathanaelhuffman and I to hypothesize about a
manufacturing defect with T6 heatsinks when the actual issue was just
that we couldn't read the NIC's temperature sensor as it had MAPO-ed a
while ago (see #2384).

This commit is my attempt to make the ringbuf entries less misleading.
Now, in addition to recording the worst case estimate that lead to a
thermal control state transition (to the Critical or Uncontrollable
states), we also record the real temperature reading from the device
alongside it. The names of the ringbuf fields now attempt to indicate
which is a real temperature and which is a worst-case estimate.

Note that the fake temperatures are only presented to a reader as though
they were real in the thermal loop's ringbuf; they are not sent to the
sensors task and will not be reported externally through the
control-plane-agent's component_details or read_sensor_value APIs.
Instead, we will just report nodata_now to the sensors task when
encountering an error reading a sensor measurement --- which is correct.
See:

hubris/task/thermal/src/control.rs

Lines 881 to 901 in ddfb878

    
           match s.sensor.read_temp(self.i2c_task) { 
        
               Ok(v) => self.sensor_api.post_now(s.sensor.sensor_id, v.0), 
        
               Err(e) => { 
        
                   // Record an error errors if the sensor is not removable 
        
                   // or we get a unexpected error from a removable sensor 
        
                   if !(matches!( 
        
                       s.ty, 
        
                       ChannelType::Removable 
        
                           | ChannelType::RemovableAndErrorProne 
        
                   ) && e 
        
                       == SensorReadError::I2cError( 
        
                           ResponseCode::NoDevice, 
        
                       )) 
        
                   { 
        
                       ringbuf_entry!(Trace::SensorReadFailed( 
        
                           s.sensor.sensor_id, 
        
                           e 
        
                       )); 
        
                       self.err_blackbox.push(s.sensor.sensor_id, e); 
        
                   } 
        
                   self.sensor_api.nodata_now(s.sensor.sensor_id, e.into())

So, only the ringbuf must be changed to make this less misleading.

Fixes #2385

@nathanaelhuffman

Presently, the `thermal` task's ringbuf attempts to gaslight the reader by presenting totally fake temperatures as though they were real measurements from a sensor. This is because the thermal loop contains logic which attempts to extrapolate a worst-case temperature projection based on the last temperature reading received from a device and the time which has elapsed since that reading: https://github.com/oxidecomputer/hubris/blob/ddfb8782336a86cf5c04e48eedc05d6a0ee25756/task/thermal/src/control.rs#L508-L526 This logic is not _wrong_, per se. Extrapolating a worst-case temperature is the correct behavior in the absence of sensor data, even when it leads to a thermal shutdown which is not actually necessary: the thermal loop is correct to assume the worst in the absence of real data. However, the way we presently present these fake temperatures in the ringbuf is *very* confusing to the reader, as they appear to be real life temperature measurements that actually came from the device. Recently (see #2385), this reporting of fake temperatures as though they were real lead @nathanaelhuffman and I to hypothesize about a manufacturing defect with T6 heatsinks when the actual issue was just that we couldn't read the NIC's temperature sensor as it had MAPO-ed a while ago (see #2384). This commit is my attempt to make the ringbuf entries less misleading. Now, in addition to recording the worst case estimate that lead to a thermal control state transition (to the `Critical` or `Uncontrollable` states), we *also* record the real temperature reading from the device alongside it. The names of the ringbuf fields now attempt to indicate which is a real temperature and which is a worst-case estimate. Note that the fake temperatures are only presented to a reader as though they were real in the thermal loop's ringbuf; they are _not_ sent to the `sensors` task and will not be reported externally through the `control-plane-agent`'s `component_details` or `read_sensor_value` APIs. Instead, we will just report `nodata_now` to the `sensors` task when encountering an error reading a sensor measurement --- which is correct. See: https://github.com/oxidecomputer/hubris/blob/ddfb8782336a86cf5c04e48eedc05d6a0ee25756/task/thermal/src/control.rs#L881-L901 So, only the ringbuf must be changed to make this less misleading. Fixes #2385

hawkw · 2026-02-12T18:27:23Z

task/thermal/src/control.rs

+pub(crate) struct WorstCaseTemperature {
+    /// The worst-case temperature estimate from the thermal model, projected
+    /// from the `last_reading`.
+    worst_case_temp: Celsius,


https://www.youtube.com/watch?v=GM-e46xdcUo

hawkw · 2026-02-12T18:28:33Z

task/thermal/src/main.rs

+    /// The last actual temperature measurement reported by a sensor.
+    ///
+    /// This is recorded after every [`CriticalDueTo`] or [`PowerDownDueTo`]
+    /// entry so that the last known real life temperature can be compared to
+    /// the worst-case temperature projection that caused a thermal loop state
+    /// transition.
+    #[count(skip)]
+    LastRealTemperature {
+        sensor_id: SensorId,
+        /// The most recent real life (not fake) temperature measurement from
+        /// the sensor.
        temperature: units::Celsius,
+        /// The (approximate) time, in seconds, since the real life temperature
+        /// measurement was received.
+        age_s: f32,
    },


I split this into its own ringbuf entries because adding two more 32-bit fields to CriticalDoTo and PowerDownDueTo would have probably made them the largest ringbuf entry and made each ringbuf slot a word bigger, which I wanted to avoid.

hawkw added 2 commits February 12, 2026 10:19

oh cool we can also fit the age of the reading in the ringbuf

34ec66b

hawkw requested review from labbott, nathanaelhuffman and sdonnan February 12, 2026 18:26

hawkw self-assigned this Feb 12, 2026

hawkw commented Feb 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

thermal: distinguish between real and made up temperatures #2386

thermal: distinguish between real and made up temperatures #2386
hawkw wants to merge 2 commits intomasterfrom
eliza/fix-thermal-gaslighting

hawkw commented Feb 12, 2026

Uh oh!

hawkw Feb 12, 2026

Uh oh!

hawkw Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	/// Returns the worst-case temperature, given a current time and thermal
	/// model for this part.
	///
	/// This only matters when samples are dropped or if there is significant
	/// lag in the sensors system; if we received a reading on this control
	/// cycle, then time_ms ≈ now_ms, so this is close to v.value (i.e. the most
	/// recent reading).
	///
	/// Typically, time_ms is earlier (less) than now_ms, so this subtraction is
	/// safe. If there's invalid data in the sensors task (i.e. readings
	/// claiming to be from the future), then this will saturate instead of
	/// underflowing.
	fn worst_case(&self, now_ms: u64, model: &ThermalProperties) -> Celsius {
	Celsius(
	self.value.0
	+ now_ms.saturating_sub(self.time_ms) as f32 / 1000.0
	* model.temperature_slew_deg_per_sec,
	)
	}

	match s.sensor.read_temp(self.i2c_task) {
	Ok(v) => self.sensor_api.post_now(s.sensor.sensor_id, v.0),
	Err(e) => {
	// Record an error errors if the sensor is not removable
	// or we get a unexpected error from a removable sensor
	if !(matches!(
	s.ty,
	ChannelType::Removable
	\| ChannelType::RemovableAndErrorProne
	) && e
	== SensorReadError::I2cError(
	ResponseCode::NoDevice,
	))
	{
	ringbuf_entry!(Trace::SensorReadFailed(
	s.sensor.sensor_id,
	e
	));
	self.err_blackbox.push(s.sensor.sensor_id, e);
	}
	self.sensor_api.nodata_now(s.sensor.sensor_id, e.into())

Conversation

hawkw commented Feb 12, 2026

Uh oh!

hawkw Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

hawkw Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant