fix: rollback malformed handshake consumption#143
Conversation
Snapshot handshake consumption state before each state-machine step and roll back transcript, DTLS 1.2 peer sequence, and tentative queued handshakes when a transient parse error rejects the message body. Add DTLS 1.2 and DTLS 1.3 regression tests for malformed ClientHello extensions followed by clean retransmissions. Co-Authored-By: Claude <noreply@anthropic.com>
zRedShift
left a comment
There was a problem hiding this comment.
Two cases look like blockers to me:
rollback_handshake_progress()decides whether to drop anIncomingby looking only atincoming.first().first_handshake().
That misses packets where the malformed/consumed handshake is later in the same incoming packet. For example:
- DTLS 1.2
CCS + Finished: the first record is CCS, so rollback keeps the packet even ifFinishedwas marked handled before body parse failed. - DTLS 1.3 full-flight retransmits: if
ServerHellowas already consumed and a later handshake fails, the incoming still starts with the stale/handled first handshake. A clean retransmitted full flight can then be dropped before the expected later handshake is visible.
So I think rollback needs to scan all records/handshakes in the Incoming, or otherwise rebuild/drop based on the first unprocessed handshake, not only the physical first record.
- The DTLS 1.3 HRR path can still mutate client-local state before a later transient parse error.
In await_server_hello, hrr_selected_group can be written while parsing KeyShare, then a later malformed Cookie extension returns a transient parse error. The new rollback restores engine transcript/queue progress, but it does not restore that local field. A later clean HRR that doesn’t overwrite the group could then use stale data from the rejected packet.
I think the HRR extension data should be parsed into locals and committed only after the whole HRR is validated.
The current tests cover the initial malformed ClientHello server path, but not these mixed/full-flight or HRR-local-state cases.
Scan all handshakes in each incoming datagram when deciding rollback rejection, including mixed records where the handshake is not first. Mark rejected datagrams handled and let the existing purge path recycle buffers back into the reusable pool instead of rebuilding the receive queue. Key queued handshakes by the first unhandled relevant handshake while preserving duplicate resend behavior, and delay DTLS 1.3 HRR client-state commits until validation completes. Extend regressions for non-handshake-leading malformed handshakes and clean HRR retry after a rejected HRR. Co-Authored-By: Claude <noreply@anthropic.com>
zRedShift
left a comment
There was a problem hiding this comment.
Rollback now scans all handshakes in an incoming packet, and HRR parsing now
stages KeyShare/Cookie/SupportedVersions locally before committing client state.
Some additional issues came up:
- DTLS 1.3 rollback can still drop valid app data
Lines 870 to 884 in 7f64bdc
Lines 2549 to 2561 in 7f64bdc
Lines 554 to 575 in 7f64bdc
Lines 221 to 229 in 7f64bdc
rollback_handshake_progress() now drops an entire Incoming when any
handshake in it is at or beyond the snapshot sequence, and
mark_incoming_handled() marks non-handshake records handled too.
In post-handshake DTLS 1.3, this can consume a valid application-data record
coalesced with a malformed KeyUpdate before poll_output() ever gets a chance
to deliver it.
I think rollback needs to preserve non-handshake records after app data is
released, or only mark the rejected handshake fragments handled. A regression
should cover:
- valid application data + malformed KeyUpdate in one datagram;
- clean KeyUpdate retransmission afterward;
- the application data is still delivered.
- DTLS 1.2 full-flight retransmits can still be discarded
Lines 281 to 298 in 7f64bdc
Lines 1270 to 1277 in 7f64bdc
Lines 1057 to 1073 in 7f64bdc
Lines 959 to 975 in 7f64bdc
The first_relevant_handshake() scan fixes part of the mixed-flight issue, but
DTLS 1.2 still drops the whole incoming when peer encryption is enabled and the
physical first record is epoch 0.
After CCS has been consumed, a clean retransmitted full flight can start with
stale epoch-0 handshake material and contain the expected encrypted Finished
later in the same datagram. That later Finished never becomes visible because
the incoming is discarded based on the first record.
I think stale plaintext discard needs to be per-record/per-handshake once a
later relevant protected handshake exists. The useful regression is:
- malformed Finished after CCS;
- clean
CCS + Finishedretransmit; - clean full
ClientKeyExchange + CCS + Finishedretransmit.
- HRR KeyShare + bad Cookie is still under-proved
Lines 486 to 536 in 7f64bdc
Lines 119 to 155 in 7f64bdc
Lines 557 to 569 in 7f64bdc
The code shape looks right now, but the test only proves malformed Cookie retry.
It does not force a KeyShare extension before the bad Cookie, so it does not
prove that hrr_selected_group is not poisoned by the rejected HRR.
I would add a test where HRR contains KeyShare + malformed Cookie, assert no CH2
is sent, then replay the clean HRR and assert CH2 is emitted.
Two related transactionality concerns also showed up in review: valid ACK
prefixes can mutate retransmit state before a later malformed handshake, and
SRTP extension parsing can commit local state before a later extension parse
error. I would not pull those into this PR automatically if the intended scope
is only handshake queue rollback, but they are the same shape of issue and
probably worth tracking separately.
Summary
Fixes #142.
Snapshot handshake consumption state before each state-machine step, then roll back transcript bytes, DTLS 1.2 peer sequence advancement, and tentative queued handshakes when a transient parse error rejects a malformed handshake body.
This keeps malformed UDP input discard-only: corrupt ClientHello extensions no longer poison the transcript or make clean retransmissions look stale/duplicate.
Add DTLS 1.2 and DTLS 1.3 regression coverage for malformed ClientHello extensions followed by clean retransmissions.