thermal: distinguish between real and made up temperatures #2386
Open
thermal: distinguish between real and made up temperatures #2386
Conversation
Presently, the `thermal` task's ringbuf attempts to gaslight the reader by presenting totally fake temperatures as though they were real measurements from a sensor. This is because the thermal loop contains logic which attempts to extrapolate a worst-case temperature projection based on the last temperature reading received from a device and the time which has elapsed since that reading: https://github.com/oxidecomputer/hubris/blob/ddfb8782336a86cf5c04e48eedc05d6a0ee25756/task/thermal/src/control.rs#L508-L526 This logic is not _wrong_, per se. Extrapolating a worst-case temperature is the correct behavior in the absence of sensor data, even when it leads to a thermal shutdown which is not actually necessary: the thermal loop is correct to assume the worst in the absence of real data. However, the way we presently present these fake temperatures in the ringbuf is *very* confusing to the reader, as they appear to be real life temperature measurements that actually came from the device. Recently (see #2385), this reporting of fake temperatures as though they were real lead @nathanaelhuffman and I to hypothesize about a manufacturing defect with T6 heatsinks when the actual issue was just that we couldn't read the NIC's temperature sensor as it had MAPO-ed a while ago (see #2384). This commit is my attempt to make the ringbuf entries less misleading. Now, in addition to recording the worst case estimate that lead to a thermal control state transition (to the `Critical` or `Uncontrollable` states), we *also* record the real temperature reading from the device alongside it. The names of the ringbuf fields now attempt to indicate which is a real temperature and which is a worst-case estimate. Note that the fake temperatures are only presented to a reader as though they were real in the thermal loop's ringbuf; they are _not_ sent to the `sensors` task and will not be reported externally through the `control-plane-agent`'s `component_details` or `read_sensor_value` APIs. Instead, we will just report `nodata_now` to the `sensors` task when encountering an error reading a sensor measurement --- which is correct. See: https://github.com/oxidecomputer/hubris/blob/ddfb8782336a86cf5c04e48eedc05d6a0ee25756/task/thermal/src/control.rs#L881-L901 So, only the ringbuf must be changed to make this less misleading. Fixes #2385
hawkw
commented
Feb 14, 2026
| pub(crate) struct WorstCaseTemperature { | ||
| /// The worst-case temperature estimate from the thermal model, projected | ||
| /// from the `last_reading`. | ||
| worst_case_temp: Celsius, |
Member
Author
There was a problem hiding this comment.
Comment on lines
+113
to
128
| /// The last actual temperature measurement reported by a sensor. | ||
| /// | ||
| /// This is recorded after every [`CriticalDueTo`] or [`PowerDownDueTo`] | ||
| /// entry so that the last known real life temperature can be compared to | ||
| /// the worst-case temperature projection that caused a thermal loop state | ||
| /// transition. | ||
| #[count(skip)] | ||
| LastRealTemperature { | ||
| sensor_id: SensorId, | ||
| /// The most recent real life (not fake) temperature measurement from | ||
| /// the sensor. | ||
| temperature: units::Celsius, | ||
| /// The (approximate) time, in seconds, since the real life temperature | ||
| /// measurement was received. | ||
| age_s: f32, | ||
| }, |
Member
Author
There was a problem hiding this comment.
I split this into its own ringbuf entries because adding two more 32-bit fields to CriticalDoTo and PowerDownDueTo would have probably made them the largest ringbuf entry and made each ringbuf slot a word bigger, which I wanted to avoid.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Presently, the
thermaltask's ringbuf attempts to gaslight the readerby presenting totally fake temperatures as though they were real
measurements from a sensor. This is because the thermal loop contains
logic which attempts to extrapolate a worst-case temperature projection
based on the last temperature reading received from a device and the
time which has elapsed since that reading:
hubris/task/thermal/src/control.rs
Lines 508 to 526 in ddfb878
This logic is not wrong, per se. Extrapolating a worst-case
temperature is the correct behavior in the absence of sensor data, even
when it leads to a thermal shutdown which is not actually necessary: the
thermal loop is correct to assume the worst in the absence of real data.
However, the way we presently present these fake temperatures in the
ringbuf is very confusing to the reader, as they appear to be real
life temperature measurements that actually came from the device.
Recently (see #2385), this reporting of fake temperatures as though they
were real lead @nathanaelhuffman and I to hypothesize about a
manufacturing defect with T6 heatsinks when the actual issue was just
that we couldn't read the NIC's temperature sensor as it had MAPO-ed a
while ago (see #2384).
This commit is my attempt to make the ringbuf entries less misleading.
Now, in addition to recording the worst case estimate that lead to a
thermal control state transition (to the
CriticalorUncontrollablestates), we also record the real temperature reading from the device
alongside it. The names of the ringbuf fields now attempt to indicate
which is a real temperature and which is a worst-case estimate.
Note that the fake temperatures are only presented to a reader as though
they were real in the thermal loop's ringbuf; they are not sent to the
sensorstask and will not be reported externally through thecontrol-plane-agent'scomponent_detailsorread_sensor_valueAPIs.Instead, we will just report
nodata_nowto thesensorstask whenencountering an error reading a sensor measurement --- which is correct.
See:
hubris/task/thermal/src/control.rs
Lines 881 to 901 in ddfb878
So, only the ringbuf must be changed to make this less misleading.
Fixes #2385