Skip to content

AE sink: silent write-drop (CH 200 OK + X-ClickHouse-Summary.written_rows=0) → http volume lost; guard re-loops #58

@entlein

Description

@entlein

Summary

The AE ClickHouse sink intermittently silently drops writes: ClickHouse returns HTTP 200 with X-ClickHouse-Summary.written_rows=0 even though AE sent N>0 rows. This zeroes out protocol volume (esp. http_events) in the AE-filtered arm, and it is the real cause of the volproof "AE attrition" we chased to dx (it is NOT a dx active-set bug — see entlein/dx#62).

Evidence (live, 2026-06-09, clean3 = vizier-adaptive_export_image:0.14.19-aeprod-clean3, PG 6a2850d0)

streaming.BatchWriter: flush failed error="sink: pixie write to dns_events reported 19 rows_sent but CH summary written_rows=0 (silent drop): {...\"written_rows\":\"0\",\"written_bytes\":\"0\"...}" reason=timer
sink: pixie write completed body_bytes=17304 ch_summary={...\"written_rows\":\"0\",\"written_bytes\":\"0\"...} table=...
streaming.TableScanner: query failed; backing off error="pixieapi: ExecuteScript: rpc error: code = DeadlineExceeded desc = context deadline exceeded" table=tls_events/mysql_events  (WINDOW/REFRESH=20)
  • Live CH read: last 10 min http_events rows=0 / 0 pods, but dns_events rows=54 / 3 pods — i.e. the active-set HAS pods; http writes specifically vanish. (Cross-rep f518: AE http pods 2→1→0 while dns pods 3→3→7 — protocol-divergent, so NOT active-set aging.)
  • CH read errors on AE/vector-written tables: Code 432 UNKNOWN_CODEC: Unknown codec family code: 0 on adaptive_attribution, trigger_watermark, kubescape_logs — possible related part corruption.

Recurring

The guard's own comment cites a prior occurrence: 2026-05-23T20:58Z redis_events: rows_sent=1658, written_rows=0. So this is at least the second sighting.

Code

  • src/vizier/services/adaptive_export/internal/sink/clickhouse.go:241INSERT INTO db.tbl FORMAT JSONEachRow, Content-Type application/x-ndjson.
  • :85 setFailLoudSettings sets input_format_skip_unknown_fields=0 etc → a column/format mismatch would HTTP-error, NOT silently drop. AE sets no async_insert.
  • :287 the silent-drop guard returns an error → streaming.BatchWriter flush fails → backoff + re-loop (CPU burn; see also the 3.2-core note in the file header comment).

Leading hypothesis (to verify CH-side)

async_insert enabled server-side (CHI user/profile config) → the INSERT response returns before the async buffer flushes → X-ClickHouse-Summary.written_rows=0 at response time even though rows later land. That would make the silent-drop guard a false positive that triggers re-loops. Verify: SELECT name,value,changed FROM system.settings WHERE name IN ('async_insert','wait_for_async_insert') + CHI users.xml. If async: either set wait_for_async_insert=1 on AE's INSERTs, or read the post-flush summary, or check system.asynchronous_insert_log instead of the response header. If NOT async: capture one rejected body + run the INSERT manually to see why CH parses 0 rows.

Impact

Any volproof data-volume measurement is invalid while this is live: AE http→0 reads as "adaptive filtering reduced volume to nothing" when it is actually dropped writes. Secondary: DeadlineExceeded at WINDOW=20 adds throughput loss; consider per-refresh single-heavy-table scheduling or larger windows.

Repro

clean3 AE, streaming overflow MAX_WHITELIST=500, WINDOW/REFRESH=20, broad http+dns load → tail AE logs for silent drop + compare rows_sent vs CH written_rows.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions