Skip to content

fix: rollback malformed handshake consumption#143

Open
algesten wants to merge 2 commits into
mainfrom
fix-handshake-validate-before-consume
Open

fix: rollback malformed handshake consumption#143
algesten wants to merge 2 commits into
mainfrom
fix-handshake-validate-before-consume

Conversation

@algesten

Copy link
Copy Markdown
Owner

Summary

Fixes #142.

Snapshot handshake consumption state before each state-machine step, then roll back transcript bytes, DTLS 1.2 peer sequence advancement, and tentative queued handshakes when a transient parse error rejects a malformed handshake body.

This keeps malformed UDP input discard-only: corrupt ClientHello extensions no longer poison the transcript or make clean retransmissions look stale/duplicate.

Add DTLS 1.2 and DTLS 1.3 regression coverage for malformed ClientHello extensions followed by clean retransmissions.

Snapshot handshake consumption state before each state-machine step and roll back transcript, DTLS 1.2 peer sequence, and tentative queued handshakes when a transient parse error rejects the message body.

Add DTLS 1.2 and DTLS 1.3 regression tests for malformed ClientHello extensions followed by clean retransmissions.

Co-Authored-By: Claude <noreply@anthropic.com>

@zRedShift zRedShift left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two cases look like blockers to me:

  1. rollback_handshake_progress() decides whether to drop an Incoming by looking only at incoming.first().first_handshake().

That misses packets where the malformed/consumed handshake is later in the same incoming packet. For example:

  • DTLS 1.2 CCS + Finished: the first record is CCS, so rollback keeps the packet even if Finished was marked handled before body parse failed.
  • DTLS 1.3 full-flight retransmits: if ServerHello was already consumed and a later handshake fails, the incoming still starts with the stale/handled first handshake. A clean retransmitted full flight can then be dropped before the expected later handshake is visible.

So I think rollback needs to scan all records/handshakes in the Incoming, or otherwise rebuild/drop based on the first unprocessed handshake, not only the physical first record.

  1. The DTLS 1.3 HRR path can still mutate client-local state before a later transient parse error.

In await_server_hello, hrr_selected_group can be written while parsing KeyShare, then a later malformed Cookie extension returns a transient parse error. The new rollback restores engine transcript/queue progress, but it does not restore that local field. A later clean HRR that doesn’t overwrite the group could then use stale data from the rejected packet.

I think the HRR extension data should be parsed into locals and committed only after the whole HRR is validated.

The current tests cover the initial malformed ClientHello server path, but not these mixed/full-flight or HRR-local-state cases.

Scan all handshakes in each incoming datagram when deciding rollback rejection, including mixed records where the handshake is not first.

Mark rejected datagrams handled and let the existing purge path recycle buffers back into the reusable pool instead of rebuilding the receive queue.

Key queued handshakes by the first unhandled relevant handshake while preserving duplicate resend behavior, and delay DTLS 1.3 HRR client-state commits until validation completes.

Extend regressions for non-handshake-leading malformed handshakes and clean HRR retry after a rejected HRR.

Co-Authored-By: Claude <noreply@anthropic.com>

@zRedShift zRedShift left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rollback now scans all handshakes in an incoming packet, and HRR parsing now
stages KeyShare/Cookie/SupportedVersions locally before committing client state.

Some additional issues came up:

  1. DTLS 1.3 rollback can still drop valid app data

dimpl/src/dtls13/engine.rs

Lines 870 to 884 in 7f64bdc

pub(crate) fn rollback_handshake_progress(&mut self, snapshot: HandshakeProgressSnapshot) {
self.transcript.resize(snapshot.transcript_len, 0);
for incoming in self.queue_rx.iter() {
let reject_incoming = incoming
.records()
.iter()
.flat_map(|r| r.handshakes())
.any(|h| h.header.message_seq >= snapshot.peer_handshake_seq_no);
if reject_incoming {
mark_incoming_handled(incoming);
}
}
}

dimpl/src/dtls13/engine.rs

Lines 2549 to 2561 in 7f64bdc

fn mark_incoming_handled(incoming: &Incoming) {
for record in incoming.records().iter() {
if record.handshakes().is_empty() {
if !record.is_handled() {
record.set_handled();
}
} else {
for handshake in record.handshakes() {
handshake.set_handled();
}
}
}
}

dimpl/src/dtls13/engine.rs

Lines 554 to 575 in 7f64bdc

fn poll_app_data<'a>(&mut self, buf: &'a mut [u8]) -> Result<&'a [u8], &'a mut [u8]> {
if !self.release_app_data {
return Err(buf);
}
let mut unhandled = self
.queue_rx
.iter()
.flat_map(|i| i.records().iter())
.filter(|r| r.record().content_type == ContentType::ApplicationData)
.skip_while(|r| r.is_handled());
let Some(next) = unhandled.next() else {
return Err(buf);
};
let record_buffer = next.buffer();
let fragment = next.record().fragment(record_buffer);
let len = fragment.len();
assert!(
len <= buf.len(),

dimpl/src/dtls13/client.rs

Lines 221 to 229 in 7f64bdc

pub fn handle_packet(&mut self, packet: &[u8]) -> Result<(), Error> {
match self
.engine
.parse_packet(packet)
.and_then(|_| self.make_progress())
{
Ok(()) => Ok(()),
Err(e) => e.into_public_error().map_or(Ok(()), Err),
}

rollback_handshake_progress() now drops an entire Incoming when any
handshake in it is at or beyond the snapshot sequence, and
mark_incoming_handled() marks non-handshake records handled too.

In post-handshake DTLS 1.3, this can consume a valid application-data record
coalesced with a malformed KeyUpdate before poll_output() ever gets a chance
to deliver it.

I think rollback needs to preserve non-handshake records after app data is
released, or only mark the rejected handshake fragments handled. A regression
should cover:

  • valid application data + malformed KeyUpdate in one datagram;
  • clean KeyUpdate retransmission afterward;
  • the application data is still delivered.
  1. DTLS 1.2 full-flight retransmits can still be discarded

dimpl/src/dtls12/engine.rs

Lines 281 to 298 in 7f64bdc

mark_stale_handshakes(&incoming, self.peer_handshake_seq_no);
let Some(handshake) = first_relevant_handshake(&incoming, self.peer_handshake_seq_no)
else {
return Ok(());
};
let key_current = (
handshake.header.message_seq,
handshake.header.fragment_offset,
);
if self.peer_encryption_enabled && incoming.first().record().sequence.epoch == 0 {
// Keep old plaintext handshake records available long enough to
// trigger flight resends above, but never queue or process them as
// new messages after peer encryption is enabled.
return Ok(());
}

dimpl/src/dtls12/engine.rs

Lines 1270 to 1277 in 7f64bdc

fn first_relevant_handshake(incoming: &Incoming, peer_handshake_seq_no: u16) -> Option<&Handshake> {
incoming
.records()
.iter()
.flat_map(|r| r.handshakes())
.filter(|h| !h.is_handled())
.find(|h| h.header.message_seq >= peer_handshake_seq_no)
}

dimpl/src/dtls12/client.rs

Lines 1057 to 1073 in 7f64bdc

fn await_change_cipher_spec(self, client: &mut Client) -> Result<Self, InternalError> {
let maybe = client.engine.next_record(ContentType::ChangeCipherSpec);
let Some(_) = maybe else {
// Stay in same state
return Ok(self);
};
// Drop any extra CCS resends to avoid being blocked
trace!("Dropping any pending CCS resends from peer");
client.engine.drop_pending_ccs();
// Expect every record to be decrypted from now on.
trace!("Received ChangeCipherSpec; enabling peer encryption");
client.engine.enable_peer_encryption()?;
Ok(Self::AwaitNewSessionTicket)

dimpl/src/dtls12/server.rs

Lines 959 to 975 in 7f64bdc

fn await_change_cipher_spec(self, server: &mut Server) -> Result<Self, InternalError> {
let maybe = server.engine.next_record(ContentType::ChangeCipherSpec);
let Some(_) = maybe else {
// Stay in same state
return Ok(self);
};
// Drop any extra CCS resends to avoid being blocked
trace!("Dropping any pending CCS resends from peer");
server.engine.drop_pending_ccs();
// Expect every record to be decrypted from now on.
trace!("Received ChangeCipherSpec; enabling peer encryption");
server.engine.enable_peer_encryption()?;
Ok(Self::AwaitFinished)

The first_relevant_handshake() scan fixes part of the mixed-flight issue, but
DTLS 1.2 still drops the whole incoming when peer encryption is enabled and the
physical first record is epoch 0.

After CCS has been consumed, a clean retransmitted full flight can start with
stale epoch-0 handshake material and contain the expected encrypted Finished
later in the same datagram. That later Finished never becomes visible because
the incoming is discarded based on the first record.

I think stale plaintext discard needs to be per-record/per-handshake once a
later relevant protected handshake exists. The useful regression is:

  • malformed Finished after CCS;
  • clean CCS + Finished retransmit;
  • clean full ClientKeyExchange + CCS + Finished retransmit.
  1. HRR KeyShare + bad Cookie is still under-proved

dimpl/src/dtls13/client.rs

Lines 486 to 536 in 7f64bdc

let mut hrr_selected_group = None;
let mut saved_cookie = None;
let mut hrr_version_ok = false;
if let Some(ref extensions) = server_hello.extensions {
for ext in extensions {
match ext.extension_type {
ExtensionType::KeyShare => {
let ext_data = ext.extension_data(&client.defragment_buffer);
let (_, hrr_ks) = KeyShareHelloRetryRequest::parse(ext_data)
.map_err(InternalError::from)?;
hrr_selected_group = Some(hrr_ks.selected_group);
}
ExtensionType::Cookie => {
let ext_data = ext.extension_data(&client.defragment_buffer);
parse_cookie_extension(ext_data).map_err(InternalError::from)?;
let mut cookie = Buf::new();
cookie.extend_from_slice(ext_data);
saved_cookie = Some(cookie);
}
ExtensionType::SupportedVersions => {
let ext_data = ext.extension_data(&client.defragment_buffer);
let (_, sv) = SupportedVersionsServerHello::parse(ext_data)
.map_err(InternalError::from)?;
hrr_version_ok = sv.selected_version == ProtocolVersion::DTLS1_3;
}
_ => {}
}
}
}
// Validate HRR cipher suite
if !client
.engine
.is_cipher_suite_allowed(server_hello.cipher_suite)
{
return Err((Error::SecurityError(
crate::SecurityError::HrrSelectedDisallowedCipherSuite,
))
.into());
}
if !hrr_version_ok {
return Err(
(Error::SecurityError(crate::SecurityError::HrrDidNotSelectDtls13)).into(),
);
}
client.engine.set_cipher_suite(server_hello.cipher_suite);
client.hrr_selected_group = hrr_selected_group;
client.saved_cookie = saved_cookie;

server.handle_timeout(now).expect("server timeout");
let server_out = drain_outputs(&mut server);
let hrr = server_out
.packets
.into_iter()
.next()
.expect("server should emit HRR");
let mut malformed_hrr = hrr.clone();
assert!(
shrink_dtls13_cookie_extension_inner_len(&mut malformed_hrr),
"fixture should contain a Cookie extension"
);
client
.handle_packet(&malformed_hrr)
.expect("malformed HRR Cookie extension should be discarded");
client
.handle_timeout(now)
.expect("client timeout after error");
let client_out = drain_outputs(&mut client);
assert!(
client_out.packets.is_empty(),
"client must not send CH2 after malformed HRR Cookie"
);
client
.handle_packet(&hrr)
.expect("clean HRR retransmission should be accepted");
client
.handle_timeout(now)
.expect("client timeout after clean HRR");
let client_out = drain_outputs(&mut client);
assert!(
!client_out.packets.is_empty(),
"client should send CH2 after clean HRR retransmission"
);

dimpl/src/dtls13/server.rs

Lines 557 to 569 in 7f64bdc

// Determine which group to request if key_share is missing
let hrr_group = if !has_matching_key_share {
let client_groups = client_supported_groups
.as_ref()
.map(|v| v.as_slice())
.unwrap_or(&[]);
our_groups
.iter()
.find(|g| client_groups.contains(g))
.copied()
} else {
None
};

The code shape looks right now, but the test only proves malformed Cookie retry.
It does not force a KeyShare extension before the bad Cookie, so it does not
prove that hrr_selected_group is not poisoned by the rejected HRR.

I would add a test where HRR contains KeyShare + malformed Cookie, assert no CH2
is sent, then replay the clean HRR and assert CH2 is emitted.

Two related transactionality concerns also showed up in review: valid ACK
prefixes can mutate retransmit state before a later malformed handshake, and
SRTP extension parsing can commit local state before a later extension parse
error. I would not pull those into this PR automatically if the intended scope
is only handshake queue rollback, but they are the same shape of issue and
probably worth tracking separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Malformed handshake extensions poison the transcript: validate full message before consuming/transcript-append

2 participants