The Coded Frame Processing algorithm does not define what happens when the incoming coded frame's presentation timestamp coincides (exactly, or within the 1-microsecond tolerance the algorithm already recognizes elsewhere) with an existing coded frame's presentation timestamp in the track buffer, in mid-stream appending.
The gap is reachable only when all three of the following hold at the time the incoming frame is processed:
last decode timestamp is set — so the "last decode timestamp unset" overlap check is gated out.
highest end timestamp is set and strictly greater than presentation timestamp — so neither branch of "Remove existing coded frames in track buffer" fires (the first branch requires highest end timestamp unset; the second requires highest end timestamp ≤ presentation timestamp).
- An existing coded frame has presentation timestamp equal (or within 1 microsecond of) the incoming coded frame's.
Under those three conditions, none of steps 1.13, 1.14, or 1.15 targets the colliding existing frame. The "Add the coded frame" step then inserts the incoming frame alongside it, leaving two coded frames at the same presentation timestamp. The track buffer model and the "Add the coded frame" algorithm do not currently specify what happens when an incoming coded frame's presentation timestamp compares equal (including within the algorithm's 1-microsecond tolerance) to an existing coded frame's presentation timestamp.
Two implementations making opposite choices — one enforcing a single coded frame per presentation timestamp (by replacing the existing one, dropping the incoming one, or signalling an error) and one allowing coded frames with identical presentation timestamps to coexist — are both conformant under the current text, and produce different user-visible behaviour on the same content.
Downstream algorithms keyed on presentation timestamp (seeking, buffered reporting, duration change) return ambiguous results when two coded frames share the same presentation timestamp.
abort() and other paths that run the Reset Parser State algorithm unset last decode timestamp and highest end timestamp (spec §"Reset Parser State", bullets 2, 3, and 4), so any incoming frame after such a reset goes through the "last decode timestamp unset" overlap check, which resolves the collision correctly via the 1-microsecond window. Those paths do not reach this gap. Only continuous mid-stream appending, or other paths that leave last decode timestamp and highest end timestamp both set, can produce the collision.
Triggering shapes include:
- Fragmented MP4 whose edit-list or mid-stream composition-time-offset structure lands an incoming frame's effective presentation timestamp on an already-buffered frame's presentation timestamp, across a fragment boundary but within a single coded frame group (no DTS discontinuity).
- Open-GOP / bidirectional-prediction content where a new GOP's random-access point has a presentation timestamp coinciding with a still-buffered frame from the prior GOP.
- Content produced by segmenters that, for alignment or splicing reasons, emit overlapping presentation timestamps at fragment boundaries without resetting state.
This is a separate spec gap from #187 (which is scoped to "SAP Type 2" decode-shadowed orphans and does not cover presentation-timestamp coincidence at positions where the existing cleanup branches do not fire).
Worked example
Timestamps are seconds-valued doubles (the representation the algorithm defines). Frame durations are 40 ms; no two presentation intervals overlap.
Segment 1 — two adjacent GOPs already in the track buffer:
| Frame |
Decode Timestamp |
Presentation Timestamp |
Duration |
Type |
Presentation Interval |
| F1 |
1.000 |
1.080 |
0.040 |
sync (I) |
[1.080, 1.120) |
| F2 |
1.040 |
1.040 |
0.040 |
non-sync |
[1.040, 1.080) |
| F3 |
1.080 |
1.120 |
0.040 |
sync (next-GOP I) |
[1.120, 1.160) |
| F4 |
1.120 |
1.160 |
0.040 |
non-sync |
[1.160, 1.200) |
The MSE algorithm distinguishes only sync vs non-sync frames; the codec-side interpretation of why each frame's presentation timestamp stands in a given relationship to its decode timestamp is immaterial to the algorithm's behaviour. Any frame shape that satisfies the three triggering conditions (listed above) reaches this gap.
After Segment 1 (no reset has run since): last decode timestamp = 1.120, last frame duration = 0.040, highest end timestamp = 1.200, need random access point flag = false (cleared when F1 was processed).
Segment 2 — the incoming coded frame, with presentation timestamp colliding with F2's:
| Frame |
Decode Timestamp |
Presentation Timestamp |
Duration |
Type |
| incoming |
1.160 |
1.040 |
0.040 |
sync |
The incoming frame's decode timestamp (1.160) > last decode timestamp (1.120), and 1.160 − 1.120 = 0.040 is not greater than 2 × last frame duration = 0.080 — so the DTS-discontinuity step does not fire. No state variables are reset. The incoming frame's presentation timestamp (1.040) equals F2's.
Current spec trace
need random access point flag is false (cleared when F1 was processed earlier in Segment 1) → no gate.
last decode timestamp is set → skip overlap check.
- Remove existing coded frames:
- First branch requires
highest end timestamp unset — it is set (1.200). Skip.
- Second branch requires
highest end timestamp ≤ presentation timestamp, i.e., 1.200 ≤ 1.040. False. Skip.
- Remove decoding dependencies: no frames removed above → no-op.
- Add the coded frame: the incoming frame is added.
Resulting buffer (presentation order):
F2 (pres=1.040, dur=0.040, non-sync) ┐
incoming (pres=1.040, dur=0.040, sync) ┘ both at presentation timestamp 1.040
F1 (pres=1.080, dur=0.040, sync)
F3 (pres=1.120, dur=0.040, sync)
F4 (pres=1.160, dur=0.040, non-sync)
F2 and the incoming frame share presentation timestamp 1.040. Algorithms that refer to "the coded frame in track buffer with a presentation interval that contains t" (used by the overlap step and by several sibling algorithms) return a singular coded frame — but two now exist at presentation timestamp 1.040. Conformant implementations resolve this differently, producing different user-visible behaviour on the same content.
Amended spec trace
The new step runs before "Remove existing coded frames":
Remove all coded frames whose presentation timestamp is within 1 microsecond of 1.040.
Applied to each existing frame:
| Frame |
Presentation Timestamp |
within 1µs of 1.040? |
Action |
| F1 |
1.080 |
no |
preserved |
| F2 |
1.040 |
yes |
removed |
| F3 |
1.120 |
no |
preserved |
| F4 |
1.160 |
no |
preserved |
Dependency sweep: next random access point after F2 in decode order is F3. Decode-order range strictly between {F2} and F3 is empty; no new removals.
Resulting buffer (presentation order):
incoming (pres=1.040, dur=0.040, sync) ← single frame at presentation timestamp 1.040
F1 (pres=1.080, dur=0.040, sync)
F3 (pres=1.120, dur=0.040, sync)
F4 (pres=1.160, dur=0.040, non-sync)
Exactly one coded frame at presentation timestamp 1.040 — the incoming frame. Downstream algorithms that refer to "the coded frame … containing t = 1.040" have an unambiguous answer.
Implementations converge on identical buffer state.
Proposed amendment
Insert a new step into the per-coded-frame loop, immediately after the existing "If last decode timestamp for track buffer is unset and presentation timestamp falls within the presentation
interval of a coded frame in track buffer" step, and before the "Remove existing coded frames in track buffer" step:
<li>Remove all coded frames from |track buffer| whose [=presentation timestamp=] is within 1 microsecond of |presentation timestamp|.
<p class="note">
This uses the same 1-microsecond tolerance as the `last decode timestamp unset` overlap step earlier in this algorithm, and for the same reason: to compensate for minor errors in frame
timestamp computations that can appear when converting back and forth between double precision floating point numbers and rationals. After this step, the track buffer cannot hold two coded
frames sharing the same [=presentation timestamp=], so the subsequent "Add the coded frame … to the track buffer" step unambiguously makes the incoming coded frame the one at that [=presentation
timestamp=].
</p>
</li>
Update the immediately-following "Remove all possible decoding dependencies …" step to extend its sweep to include frames removed by this new step:
- Remove all possible decoding dependencies on the coded frames removed in the previous two steps by removing all coded frames from |track buffer| between those frames removed in the previous
two steps and the next random access point after those removed frames.
+ Remove all possible decoding dependencies on the coded frames removed in the previous three steps by removing all coded frames from |track buffer| between those frames removed in the previous
three steps and the next random access point after those removed frames.
Interaction with #187
If the amendment proposed in #187 lands first, merge the predicate from this issue into the step added there (combining both clauses into a single <li> list of conditions) and keep the dependency-cleanup step's "previous three steps" wording. If this issue lands first, the #187 amendment should do the same in reverse.
Scope and side effects
The step is a no-op when no existing coded frame is within 1 microsecond of the incoming presentation timestamp, which is the common case in continuous appending.
As with the other removal steps, step 1.15's conservative dependency sweep may remove additional non-sync coded frames that lie between the removed frame and the next random access point in decode order. This is a pre-existing property of step 1.15, not introduced by this amendment.
What this does not cover
Coded frames with the same decode timestamp but different presentation timestamps are not addressed by this amendment. The spec's current storage model tolerates them, and they do not produce cross-implementation divergence of the same kind as the presentation-timestamp collision this issue describes.
The Coded Frame Processing algorithm does not define what happens when the incoming coded frame's presentation timestamp coincides (exactly, or within the 1-microsecond tolerance the algorithm already recognizes elsewhere) with an existing coded frame's presentation timestamp in the track buffer, in mid-stream appending.
The gap is reachable only when all three of the following hold at the time the incoming frame is processed:
last decode timestampis set — so the "last decode timestampunset" overlap check is gated out.highest end timestampis set and strictly greater thanpresentation timestamp— so neither branch of "Remove existing coded frames in track buffer" fires (the first branch requireshighest end timestampunset; the second requireshighest end timestamp ≤ presentation timestamp).Under those three conditions, none of steps 1.13, 1.14, or 1.15 targets the colliding existing frame. The "Add the coded frame" step then inserts the incoming frame alongside it, leaving two coded frames at the same presentation timestamp. The track buffer model and the "Add the coded frame" algorithm do not currently specify what happens when an incoming coded frame's presentation timestamp compares equal (including within the algorithm's 1-microsecond tolerance) to an existing coded frame's presentation timestamp.
Two implementations making opposite choices — one enforcing a single coded frame per presentation timestamp (by replacing the existing one, dropping the incoming one, or signalling an error) and one allowing coded frames with identical presentation timestamps to coexist — are both conformant under the current text, and produce different user-visible behaviour on the same content.
Downstream algorithms keyed on presentation timestamp (seeking,
bufferedreporting, duration change) return ambiguous results when two coded frames share the same presentation timestamp.abort()and other paths that run the Reset Parser State algorithm unsetlast decode timestampandhighest end timestamp(spec §"Reset Parser State", bullets 2, 3, and 4), so any incoming frame after such a reset goes through the "last decode timestampunset" overlap check, which resolves the collision correctly via the 1-microsecond window. Those paths do not reach this gap. Only continuous mid-stream appending, or other paths that leavelast decode timestampandhighest end timestampboth set, can produce the collision.Triggering shapes include:
This is a separate spec gap from #187 (which is scoped to "SAP Type 2" decode-shadowed orphans and does not cover presentation-timestamp coincidence at positions where the existing cleanup branches do not fire).
Worked example
Timestamps are seconds-valued doubles (the representation the algorithm defines). Frame durations are 40 ms; no two presentation intervals overlap.
Segment 1 — two adjacent GOPs already in the track buffer:
[1.080, 1.120)[1.040, 1.080)[1.120, 1.160)[1.160, 1.200)The MSE algorithm distinguishes only sync vs non-sync frames; the codec-side interpretation of why each frame's presentation timestamp stands in a given relationship to its decode timestamp is immaterial to the algorithm's behaviour. Any frame shape that satisfies the three triggering conditions (listed above) reaches this gap.
After Segment 1 (no reset has run since):
last decode timestamp= 1.120,last frame duration= 0.040,highest end timestamp= 1.200,need random access point flag= false (cleared when F1 was processed).Segment 2 — the incoming coded frame, with presentation timestamp colliding with F2's:
The incoming frame's decode timestamp (1.160) >
last decode timestamp(1.120), and1.160 − 1.120 = 0.040is not greater than2 × last frame duration = 0.080— so the DTS-discontinuity step does not fire. No state variables are reset. The incoming frame's presentation timestamp (1.040) equals F2's.Current spec trace
need random access point flagis false (cleared when F1 was processed earlier in Segment 1) → no gate.last decode timestampis set → skip overlap check.highest end timestampunset — it is set (1.200). Skip.highest end timestamp ≤ presentation timestamp, i.e.,1.200 ≤ 1.040. False. Skip.Resulting buffer (presentation order):
F2 and the incoming frame share presentation timestamp 1.040. Algorithms that refer to "the coded frame in track buffer with a presentation interval that contains t" (used by the overlap step and by several sibling algorithms) return a singular coded frame — but two now exist at presentation timestamp 1.040. Conformant implementations resolve this differently, producing different user-visible behaviour on the same content.
Amended spec trace
The new step runs before "Remove existing coded frames":
Applied to each existing frame:
Dependency sweep: next random access point after F2 in decode order is F3. Decode-order range strictly between
{F2}and F3 is empty; no new removals.Resulting buffer (presentation order):
Exactly one coded frame at presentation timestamp 1.040 — the incoming frame. Downstream algorithms that refer to "the coded frame … containing t = 1.040" have an unambiguous answer.
Implementations converge on identical buffer state.
Proposed amendment
Insert a new step into the per-coded-frame loop, immediately after the existing "If
last decode timestampfortrack bufferis unset andpresentation timestampfalls within the presentationinterval of a coded frame in track buffer" step, and before the "Remove existing coded frames in track buffer" step:
Update the immediately-following "Remove all possible decoding dependencies …" step to extend its sweep to include frames removed by this new step:
Interaction with #187
If the amendment proposed in #187 lands first, merge the predicate from this issue into the step added there (combining both clauses into a single
<li>list of conditions) and keep the dependency-cleanup step's "previous three steps" wording. If this issue lands first, the #187 amendment should do the same in reverse.Scope and side effects
The step is a no-op when no existing coded frame is within 1 microsecond of the incoming presentation timestamp, which is the common case in continuous appending.
As with the other removal steps, step 1.15's conservative dependency sweep may remove additional non-sync coded frames that lie between the removed frame and the next random access point in decode order. This is a pre-existing property of step 1.15, not introduced by this amendment.
What this does not cover
Coded frames with the same decode timestamp but different presentation timestamps are not addressed by this amendment. The spec's current storage model tolerates them, and they do not produce cross-implementation divergence of the same kind as the presentation-timestamp collision this issue describes.