vnc: diagnostics for client negotiation and the dirty-rect channel by bitranox · Pull Request #3628 · microsoft/openvmm

bitranox · 2026-06-02T13:26:42Z

Two diagnostics for the VNC server, both off by default and gated on OPENVMM_LOG. They answer two questions you currently cannot answer without a debugger: what did a connecting viewer actually negotiate, and does the device-to-worker dirty-rect channel ever back up?

Client negotiation report (`vnc`)

Right now the server logs three booleans about client encodings (zlib, desktop_resize, cursor) and a partial pixel format line. That does not tell you what the viewer offered or what the server picked.

This adds encoding_name, which decodes RFB encoding numbers: the standard encodings, the Tight compression and JPEG quality ranges, the common pseudo-encodings, and the VMware and extended-clipboard vendor blocks. Anything it does not recognize falls back to its number and hex, so nothing a client sends is lost. With that, the server logs once per connection, at debug:

the full offered encoding list, by name
the selected security type
the full requested pixel format
what the server will actually use: the wire encoding (zlib or raw), whether the host-native no-convert fast path applies, and which capabilities are honored

The existing info-level lines stay; the detail is at debug.

Dirty-rect channel backlog metric (`vnc_worker`)

The device-to-worker dirty-rect channel is an unbounded mesh channel, and there was no way to know whether its queue grows. The coordinator now drains it on each wakeup (receive one, then try_recv the rest) and records how many were waiting, which is the backlog at that moment. It keeps a session high-water mark, logs a backlog above one at debug, a backlog of one at trace, and the high-water on channel close.

Behavior does not change: each drained message is still broadcast to clients individually, exactly as if they had arrived one at a time. The only difference is the receive shape, one-at-a-time becomes drain-all.

In practice the queue sits at 1 under normal load, and only grew (to a few hundred) when the worker was CPU-starved during boot. I also ran it with 16 simultaneous clients, 8 negotiating zlib and 8 falling back to raw, and the shared queue barely moved:

Backlog depth	Count over 90s (16 clients)
2	441
3	115
4	21
5	1
7	1

It sat at 2 almost the whole time and peaked at 7. The single consumer's fan-out to the per-client channels is non-blocking, so adding clients does not grow the shared queue. That is the relevant data point for the broadcast-ring discussion.

Cost when off

The negotiation report runs once per connection, and the field expressions live inside the tracing macros, so they are not evaluated when the level is off. The backlog count reuses values already computed and only formats inside the macros. Nothing measurable when the levels are off.

Enabling

OPENVMM_LOG=info,vnc=debug,vnc_worker=debug

Add vnc_worker=trace to see each drain confirm a backlog of 1.

Testing

Unit tests for encoding_name: known encodings, the ranges, the vendor blocks, and the unknown fallback.
The existing vnc and vnc_worker tests pass, including the dirty-channel-close path that the drain change touches.
Ran live against a Windows guest with RealVNC and TigerVNC. RealVNC negotiated zlib; TigerVNC fell back to raw because it does not offer zlib(6). The backlog metric matched what is described above.
The 16-client run (8 zlib, 8 raw) above confirmed client count does not grow the shared queue.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR improves diagnostics and observability for the VNC worker by adding richer debug logging during client negotiation and tracking the dirty-rect channel backlog.

Changes:

Added encoding_name helper in rfb.rs to map RFB encoding numbers to human-readable names, with unit tests.
Added detailed debug logs in server.rs for security type selection, negotiated pixel format, and offered/negotiated encodings.
Reworked the dirty-rect handling in lib.rs to drain queued messages in a batch, track high-water backlog, and log when the producer outpaces the consumer.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
workers/vnc_worker/vnc/src/rfb.rs	Adds `encoding_name` decoder for diagnostics, plus unit tests covering known/pseudo/vendor/unknown cases.
workers/vnc_worker/vnc/src/server.rs	Emits debug logs for chosen security type, negotiated pixel format, and offered/agreed encodings.
workers/vnc_worker/src/lib.rs	Drains the dirty-rect channel into batches to measure backlog while preserving per-message broadcast semantics.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

The server logged only a three-flag summary of client encodings (zlib, desktop_resize, cursor) and a partial pixel format line. That is not enough to see what a viewer actually offered or what the server settled on, which matters when comparing viewers and diagnosing compression. Add encoding_name() to decode RFB encoding numbers: the known encodings, the Tight compression and JPEG quality ranges, the common pseudo-encodings, and the VMware and extended-clipboard vendor blocks, with a numeric and hex fallback so nothing a client sends is lost. Use it to log the complete offered encoding list, the selected security type, the full requested pixel format, and what the server will actually use (wire encoding, whether the host-native no-convert fast path applies, and the honored capabilities). The report is on the once-per-connection path and the field expressions stay inside the tracing macros, so it costs nothing when the level is off. The full detail is at debug; the existing info summaries stay.

The device-to-worker dirty-rect channel is an unbounded mesh channel, and whether its queue ever grows was unknown. Drain it fully on each wakeup (receive one message, then try_recv the rest) and record how many were queued: that count is the channel backlog. Track a session high-water mark, log a backlog above one at debug, a backlog of one at trace, and the high-water on channel close. Broadcasting is unchanged: each drained message still goes out individually, so behavior matches receiving them one at a time. This answers the unbounded-memory question for this channel: under normal load the queue stays at one, growing only when the consumer is CPU-starved.

smalis-msft · 2026-06-03T15:25:06Z

 }
+
+#[cfg(test)]
+mod tests {


Do these tests really add much value? Just running a constant through a match statement doesn't feel like it really needs test coverage.

Fair point. I dropped encoding_name_known_and_pseudo, that one really was the match arms written twice.

The other three I'd keep. They aren't constants through a match, they check logic that's easy to get wrong:

encoding_name_ranges pins the offset math at the range edges: -256..=-247 => TightCompressionLevel(enc + 256) has to give 0 at -256 and 9 at -247. An off-by-one there is a silent bug.

encoding_name_vendor_blocks covers the u32 as i32 round-trip for the high-bit encodings like 0x574d5664, which is exactly where it's easy to slip: writing the value as unsigned hex but matching it in the signed space.

encoding_name_unknown_keeps_value makes sure the fallback still prints the raw number and hex, so an unknown encoding stays debuggable.

smalis-msft · 2026-06-03T15:25:29Z

+    let raw = enc as u32;
+    // Vendor blocks reserved as hex ranges, outside the contiguous
+    // signed-number space handled by the match below.
+    if (0x574d5600..=0x574d56ff).contains(&raw) {


Should these ranges move into the match instead of coming first?

Most of them can't fold into the enc match cleanly. Three of the four are high-bit values that are negative as i32 (0xc0a1e5ce is -1063163442, 0xffff0000 is -65536), and a range pattern can't take a 0x… as i32 bound, so they'd turn into opaque negative decimals or a named const per bound, either way losing the hex that documents what the block is. Only VMware is positive and would fit, and pulling just that one in would split the vendor blocks across both matches.

So instead I turned the if-chain into its own match raw. The vendor blocks are now a single dispatch on the value where the hex reads naturally, with the signed encodings still in match enc. Still two passes, but it reads as a match now instead of a stack of ifs.

jstarks · 2026-06-03T16:18:55Z

does the device-to-worker dirty-rect channel ever back up?

I am wondering whether we care whether it backs up in practice--even if it doesn't, we still need to ensure that it's structurally impossible to back up.

github-actions · 2026-06-03T16:59:43Z

2 Petri tests failed (2 unstable)

bitranox · 2026-06-04T01:37:32Z

@jstarks :

In practice it stays at single digits, and the ways it could grow are either self-healing or already stop the producer. It is not a hard constant bound by construction, and I don't think it's worth making it one. Details below.

Measured. I instrumented the channel and left the instrumentation in. The worker drains the whole queue each wakeup, so backlog is the number of messages pulled in one pass: trace at depth 1, debug once it exceeds 1 (carrying a running high-water backlog_max), and a rate-limited error if it ever crosses 1000. The trace lines and the error are in the code, so you can switch them on and watch the channel yourself under any workload rather than relying on my numbers.

1 client, full boot + fullscreen video + guest reboot: peak 5.
16 clients (8 TigerVNC + 8 RealVNC), fullscreen video: steady at 1, peak 7. Fan-out width does not grow the queue.
The only real excursion was ~505, once, at guest boot under heavvy host CPU contention. The worker thread was descheduled and then drained it in a single pass.

Why it stays small. The two sides are asymmetric. Producing a message is a vmbus ring read, a packet parse, and a rect map on the device, paced by the guest's report rate. Draining is moving Vecs out of the channel plus an Arc and a non-blocking try_send per client; the encode/zlib/socket write sits behind the per-client bounded(4), not in the drain loop. So the consumer returns to recv() immediately and outpaces the producer except when it is starved of CPU.

Why it can't run away. Two properties already hold:

The device only forwards dirt while a client needs updates (the updates_needed/FeatureChange gate). With no demand the producer is silent.
If the worker goes away the receiver drops, the channel closes, and the device stops sending (it handles the closed sender). So a dead consumer cannot grow the queue. Only a live-but-descheduled one can, transiently, and it self-heals on the next drain. Each entry is a Vec at 16 bytes per rect, so even the 505 spike was some kilobytes and freed on the next pass.

On "structurally impossible." You're right that it is an unbounded mesh::channel(), so the bound is behavioral (producer rate times scheduling latency), not a constant. Making it constant by construction means a bounded shared buffer: a fixed ring the device copies dirt into on every update, plus overflow and wraparound handling. That puts a copy in the hot path and adds real complexity to defend against a condition that is transient, self-healing, sub-megabyte, and now logged. I don't think that trade is worth it. And in the unlikely case it would bite us, we still have the option to implement the ringbuffer.

The drain reads the whole device dirty-rect channel each wakeup, so the live backlog stays in single digits (measured peak 5 across boot, fullscreen video, and guest reboot; peak 7 with 16 clients). Add a rate-limited error if it ever crosses 1000, so a wedged consumer letting the unbounded mesh queue climb is surfaced loudly instead of buffering silently. No change on the normal path.

Copilot

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 2 comments.

encoding_name_known_and_pseudo just restated the match arms, so it tracked the code rather than checking it. Keep the range/cast/fallback tests, which pin the offset arithmetic, the u32-as-i32 vendor-block handling, and the unknown fallback format.

The vendor blocks are high u32 hex ranges; most are negative as i32 and a range pattern can't take a `0x.. as i32` bound, so they can't fold into the signed enc match without losing the hex or adding const bounds. Group them in their own match raw instead of a chain of ifs, so they read as one dispatch on the value where the hex is natural.

bitranox · 2026-06-04T18:47:55Z

merged into #3665

Copilot AI review requested due to automatic review settings June 2, 2026 13:26

bitranox requested a review from a team as a code owner June 2, 2026 13:26

Copilot AI reviewed Jun 2, 2026

View reviewed changes

Comment thread workers/vnc_worker/vnc/src/rfb.rs Outdated

bitranox force-pushed the vnc-diagnostics branch from 25477c3 to e6cad5f Compare June 2, 2026 13:43

Copilot AI review requested due to automatic review settings June 2, 2026 13:58

bitranox force-pushed the vnc-diagnostics branch from e6cad5f to 669af8b Compare June 2, 2026 13:58

Copilot AI reviewed Jun 2, 2026

View reviewed changes

Comment thread workers/vnc_worker/vnc/src/rfb.rs Outdated

Comment thread workers/vnc_worker/src/lib.rs

bitranox added 2 commits June 2, 2026 16:09

bitranox force-pushed the vnc-diagnostics branch from 669af8b to f8ecc08 Compare June 2, 2026 14:10

bitranox mentioned this pull request Jun 2, 2026

vnc: stop guest screen/pointer reporting when no client is connected #3614

Closed

smalis-msft reviewed Jun 3, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings June 4, 2026 01:50

Copilot AI reviewed Jun 4, 2026

View reviewed changes

Comment thread workers/vnc_worker/src/lib.rs

Comment thread workers/vnc_worker/src/lib.rs

bitranox added 2 commits June 4, 2026 03:58

bitranox requested a review from smalis-msft June 4, 2026 02:10

bitranox marked this pull request as draft June 4, 2026 14:33

bitranox mentioned this pull request Jun 4, 2026

vnc_worker: add ZRLE and a configurable tile size, reduce redundant screen-update work, identify clients #3665

Open

bitranox closed this Jun 4, 2026

bitranox deleted the vnc-diagnostics branch June 9, 2026 18:08

Conversation

bitranox commented Jun 2, 2026

Client negotiation report (vnc)

Dirty-rect channel backlog metric (vnc_worker)

Cost when off

Enabling

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

smalis-msft Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

bitranox Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

smalis-msft Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

bitranox Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

jstarks commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

bitranox commented Jun 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

bitranox commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Client negotiation report (`vnc`)

Dirty-rect channel backlog metric (`vnc_worker`)