Skip to content

vnc: diagnostics for client negotiation and the dirty-rect channel#3628

Closed
bitranox wants to merge 5 commits into
microsoft:mainfrom
bitranox:vnc-diagnostics
Closed

vnc: diagnostics for client negotiation and the dirty-rect channel#3628
bitranox wants to merge 5 commits into
microsoft:mainfrom
bitranox:vnc-diagnostics

Conversation

@bitranox

@bitranox bitranox commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Two diagnostics for the VNC server, both off by default and gated on OPENVMM_LOG. They answer two questions you currently cannot answer without a debugger: what did a connecting viewer actually negotiate, and does the device-to-worker dirty-rect channel ever back up?

Client negotiation report (vnc)

Right now the server logs three booleans about client encodings (zlib, desktop_resize, cursor) and a partial pixel format line. That does not tell you what the viewer offered or what the server picked.

This adds encoding_name, which decodes RFB encoding numbers: the standard encodings, the Tight compression and JPEG quality ranges, the common pseudo-encodings, and the VMware and extended-clipboard vendor blocks. Anything it does not recognize falls back to its number and hex, so nothing a client sends is lost. With that, the server logs once per connection, at debug:

  • the full offered encoding list, by name
  • the selected security type
  • the full requested pixel format
  • what the server will actually use: the wire encoding (zlib or raw), whether the host-native no-convert fast path applies, and which capabilities are honored

The existing info-level lines stay; the detail is at debug.

Dirty-rect channel backlog metric (vnc_worker)

The device-to-worker dirty-rect channel is an unbounded mesh channel, and there was no way to know whether its queue grows. The coordinator now drains it on each wakeup (receive one, then try_recv the rest) and records how many were waiting, which is the backlog at that moment. It keeps a session high-water mark, logs a backlog above one at debug, a backlog of one at trace, and the high-water on channel close.

Behavior does not change: each drained message is still broadcast to clients individually, exactly as if they had arrived one at a time. The only difference is the receive shape, one-at-a-time becomes drain-all.

In practice the queue sits at 1 under normal load, and only grew (to a few hundred) when the worker was CPU-starved during boot. I also ran it with 16 simultaneous clients, 8 negotiating zlib and 8 falling back to raw, and the shared queue barely moved:

Backlog depth Count over 90s (16 clients)
2 441
3 115
4 21
5 1
7 1

It sat at 2 almost the whole time and peaked at 7. The single consumer's fan-out to the per-client channels is non-blocking, so adding clients does not grow the shared queue. That is the relevant data point for the broadcast-ring discussion.

Cost when off

The negotiation report runs once per connection, and the field expressions live inside the tracing macros, so they are not evaluated when the level is off. The backlog count reuses values already computed and only formats inside the macros. Nothing measurable when the levels are off.

Enabling

OPENVMM_LOG=info,vnc=debug,vnc_worker=debug

Add vnc_worker=trace to see each drain confirm a backlog of 1.

Testing

  • Unit tests for encoding_name: known encodings, the ranges, the vendor blocks, and the unknown fallback.
  • The existing vnc and vnc_worker tests pass, including the dirty-channel-close path that the drain change touches.
  • Ran live against a Windows guest with RealVNC and TigerVNC. RealVNC negotiated zlib; TigerVNC fell back to raw because it does not offer zlib(6). The backlog metric matched what is described above.
  • The 16-client run (8 zlib, 8 raw) above confirmed client count does not grow the shared queue.

Copilot AI review requested due to automatic review settings June 2, 2026 13:26
@bitranox bitranox requested a review from a team as a code owner June 2, 2026 13:26

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR improves diagnostics and observability for the VNC worker by adding richer debug logging during client negotiation and tracking the dirty-rect channel backlog.

Changes:

  • Added encoding_name helper in rfb.rs to map RFB encoding numbers to human-readable names, with unit tests.
  • Added detailed debug logs in server.rs for security type selection, negotiated pixel format, and offered/negotiated encodings.
  • Reworked the dirty-rect handling in lib.rs to drain queued messages in a batch, track high-water backlog, and log when the producer outpaces the consumer.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
workers/vnc_worker/vnc/src/rfb.rs Adds encoding_name decoder for diagnostics, plus unit tests covering known/pseudo/vendor/unknown cases.
workers/vnc_worker/vnc/src/server.rs Emits debug logs for chosen security type, negotiated pixel format, and offered/agreed encodings.
workers/vnc_worker/src/lib.rs Drains the dirty-rect channel into batches to measure backlog while preserving per-message broadcast semantics.

Comment thread workers/vnc_worker/vnc/src/rfb.rs Outdated
Copilot AI review requested due to automatic review settings June 2, 2026 13:58

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread workers/vnc_worker/vnc/src/rfb.rs Outdated
Comment thread workers/vnc_worker/src/lib.rs
bitranox added 2 commits June 2, 2026 16:09
The server logged only a three-flag summary of client encodings (zlib, desktop_resize, cursor) and a partial pixel format line. That is not enough to see what a viewer actually offered or what the server settled on, which matters when comparing viewers and diagnosing compression.

Add encoding_name() to decode RFB encoding numbers: the known encodings, the Tight compression and JPEG quality ranges, the common pseudo-encodings, and the VMware and extended-clipboard vendor blocks, with a numeric and hex fallback so nothing a client sends is lost. Use it to log the complete offered encoding list, the selected security type, the full requested pixel format, and what the server will actually use (wire encoding, whether the host-native no-convert fast path applies, and the honored capabilities).

The report is on the once-per-connection path and the field expressions stay inside the tracing macros, so it costs nothing when the level is off. The full detail is at debug; the existing info summaries stay.
The device-to-worker dirty-rect channel is an unbounded mesh channel, and whether its queue ever grows was unknown. Drain it fully on each wakeup (receive one message, then try_recv the rest) and record how many were queued: that count is the channel backlog. Track a session high-water mark, log a backlog above one at debug, a backlog of one at trace, and the high-water on channel close.

Broadcasting is unchanged: each drained message still goes out individually, so behavior matches receiving them one at a time. This answers the unbounded-memory question for this channel: under normal load the queue stays at one, growing only when the consumer is CPU-starved.
}

#[cfg(test)]
mod tests {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these tests really add much value? Just running a constant through a match statement doesn't feel like it really needs test coverage.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. I dropped encoding_name_known_and_pseudo, that one really was the match arms written twice.

The other three I'd keep. They aren't constants through a match, they check logic that's easy to get wrong:

  • encoding_name_ranges pins the offset math at the range edges: -256..=-247 => TightCompressionLevel(enc + 256) has to give 0 at -256 and 9 at -247. An off-by-one there is a silent bug.
  • encoding_name_vendor_blocks covers the u32 as i32 round-trip for the high-bit encodings like 0x574d5664, which is exactly where it's easy to slip: writing the value as unsigned hex but matching it in the signed space.
  • encoding_name_unknown_keeps_value makes sure the fallback still prints the raw number and hex, so an unknown encoding stays debuggable.

Comment thread workers/vnc_worker/vnc/src/rfb.rs Outdated
let raw = enc as u32;
// Vendor blocks reserved as hex ranges, outside the contiguous
// signed-number space handled by the match below.
if (0x574d5600..=0x574d56ff).contains(&raw) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these ranges move into the match instead of coming first?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of them can't fold into the enc match cleanly. Three of the four are high-bit values that are negative as i32 (0xc0a1e5ce is -1063163442, 0xffff0000 is -65536), and a range pattern can't take a 0x… as i32 bound, so they'd turn into opaque negative decimals or a named const per bound, either way losing the hex that documents what the block is. Only VMware is positive and would fit, and pulling just that one in would split the vendor blocks across both matches.

So instead I turned the if-chain into its own match raw. The vendor blocks are now a single dispatch on the value where the hex reads naturally, with the signed encodings still in match enc. Still two passes, but it reads as a match now instead of a stack of ifs.

@jstarks

jstarks commented Jun 3, 2026

Copy link
Copy Markdown
Member

does the device-to-worker dirty-rect channel ever back up?

I am wondering whether we care whether it backs up in practice--even if it doesn't, we still need to ensure that it's structurally impossible to back up.

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

@bitranox

bitranox commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

@jstarks :

In practice it stays at single digits, and the ways it could grow are either self-healing or already stop the producer. It is not a hard constant bound by construction, and I don't think it's worth making it one. Details below.

Measured. I instrumented the channel and left the instrumentation in. The worker drains the whole queue each wakeup, so backlog is the number of messages pulled in one pass: trace at depth 1, debug once it exceeds 1 (carrying a running high-water backlog_max), and a rate-limited error if it ever crosses 1000. The trace lines and the error are in the code, so you can switch them on and watch the channel yourself under any workload rather than relying on my numbers.

  • 1 client, full boot + fullscreen video + guest reboot: peak 5.
  • 16 clients (8 TigerVNC + 8 RealVNC), fullscreen video: steady at 1, peak 7. Fan-out width does not grow the queue.
  • The only real excursion was ~505, once, at guest boot under heavvy host CPU contention. The worker thread was descheduled and then drained it in a single pass.

Why it stays small. The two sides are asymmetric. Producing a message is a vmbus ring read, a packet parse, and a rect map on the device, paced by the guest's report rate. Draining is moving Vecs out of the channel plus an Arc and a non-blocking try_send per client; the encode/zlib/socket write sits behind the per-client bounded(4), not in the drain loop. So the consumer returns to recv() immediately and outpaces the producer except when it is starved of CPU.

Why it can't run away. Two properties already hold:

  • The device only forwards dirt while a client needs updates (the updates_needed/FeatureChange gate). With no demand the producer is silent.
  • If the worker goes away the receiver drops, the channel closes, and the device stops sending (it handles the closed sender). So a dead consumer cannot grow the queue. Only a live-but-descheduled one can, transiently, and it self-heals on the next drain. Each entry is a Vec at 16 bytes per rect, so even the 505 spike was some kilobytes and freed on the next pass.

On "structurally impossible." You're right that it is an unbounded mesh::channel(), so the bound is behavioral (producer rate times scheduling latency), not a constant. Making it constant by construction means a bounded shared buffer: a fixed ring the device copies dirt into on every update, plus overflow and wraparound handling. That puts a copy in the hot path and adds real complexity to defend against a condition that is transient, self-healing, sub-megabyte, and now logged. I don't think that trade is worth it. And in the unlikely case it would bite us, we still have the option to implement the ringbuffer.

The drain reads the whole device dirty-rect channel each wakeup, so the live backlog stays in single digits (measured peak 5 across boot, fullscreen video, and guest reboot; peak 7 with 16 clients). Add a rate-limited error if it ever crosses 1000, so a wedged consumer letting the unbounded mesh queue climb is surfaced loudly instead of buffering silently. No change on the normal path.
Copilot AI review requested due to automatic review settings June 4, 2026 01:50

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 2 comments.

Comment thread workers/vnc_worker/src/lib.rs
Comment thread workers/vnc_worker/src/lib.rs
bitranox added 2 commits June 4, 2026 03:58
encoding_name_known_and_pseudo just restated the match arms, so it tracked the code rather than checking it. Keep the range/cast/fallback tests, which pin the offset arithmetic, the u32-as-i32 vendor-block handling, and the unknown fallback format.
The vendor blocks are high u32 hex ranges; most are negative as i32 and a range pattern can't take a `0x.. as i32` bound, so they can't fold into the signed enc match without losing the hex or adding const bounds. Group them in their own match raw instead of a chain of ifs, so they read as one dispatch on the value where the hex is natural.
@bitranox

bitranox commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

merged into #3665

@bitranox bitranox closed this Jun 4, 2026
@bitranox bitranox deleted the vnc-diagnostics branch June 9, 2026 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants