feat(data-pipeline): add agentless export#2081
Conversation
# Motivation Send traces to the agentless endpoint to replace the JS and python exporters # What changes * Additional JSON agentless encoder * Additional configuration (agentless enablement, endpoint, API key) * Agentless traces need top level computation, but not dropping spans
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bf49210179
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if let Some(ref config) = self.agentless_config { | ||
| // For agentless we want to tag top level spans, but not perform | ||
| // stats aggregation or span drops | ||
| if !self.client_computed_top_level { | ||
| for chunk in traces.iter_mut() { | ||
| libdd_trace_utils::span::trace_utils::compute_top_level_span(chunk); | ||
| } | ||
| } | ||
|
|
||
| return self.send_agentless_traces_inner(traces, config).await; |
There was a problem hiding this comment.
Honor sampling decisions before agentless export
When agentless export is enabled, this early return bypasses stats::process_traces_for_stats and never calls drop_chunks; unlike the agent path there is no downstream trace-agent to enforce _sampling_priority_v1. In the same context I checked, the OTLP branch explicitly calls drop_chunks before sending for this reason, so an unsampled chunk such as one with _sampling_priority_v1 = -1 will now be serialized and POSTed to the public intake instead of being dropped. Please apply the same sampling filter before send_agentless_traces_inner (while preserving single-span/analyzed-span keeps).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
No, not dropping traces and aggregating stats is expected
Clippy Allow Annotation ReportComparing clippy allow annotations between branches:
Summary by Rule
Annotation Counts by File
Annotation Stats by Crate
About This ReportThis report tracks Clippy allow annotations for specific rules, showing how they've changed in this PR. Decreasing the number of these annotations generally improves code quality. |
📚 Documentation Check Results📦
|
🔒 Cargo Deny Results📦
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2081 +/- ##
==========================================
+ Coverage 73.58% 73.73% +0.14%
==========================================
Files 470 474 +4
Lines 78449 79241 +792
==========================================
+ Hits 57729 58425 +696
- Misses 20720 20816 +96
🚀 New features to boost your workflow:
|
Artifact Size Benchmark Reportaarch64-alpine-linux-musl
aarch64-unknown-linux-gnu
libdatadog-x64-windows
libdatadog-x86-windows
x86_64-alpine-linux-musl
x86_64-unknown-linux-gnu
|
🎉 All green!🧪 All tests passed 🎯 Code Coverage (details) 🔗 Commit SHA: f208a39 | Docs | Datadog PR Page | Give us feedback! |
| self.agentless_endpoint.as_ref(), | ||
| ) { | ||
| (Some(_), Some(_)) => { | ||
| tracing::warn!( |
There was a problem hiding this comment.
Should the builder fail if it has conflicting configuration? SDK configuration is complex. I'm of the opinion that it's better for the complexity to be concentrated in the SDKs and let them handle things like precedence.
There was a problem hiding this comment.
On the same topic, the build should probably error out if set_url and output_format is used and agentless is set at the same time.
| ), | ||
| }; | ||
|
|
||
| let agentless_config = match (agentless_endpoint, agentless_api_key) { |
There was a problem hiding this comment.
agentless config is going to happen after AgentInfoFetcher is spawned here. Won't that fail every 5 minutes when it polls an agent that doesn't exist?
There was a problem hiding this comment.
I think the same applies to telemetry? It's built against the agent url.
There was a problem hiding this comment.
I think all of this also applies to otlp mode for the exporter.
| }; | ||
| use tracing::error; | ||
|
|
||
| const AGENTLESS_MAX_ATTEMPTS: u32 = 3; |
There was a problem hiding this comment.
After #2047 this should be renamed AGENTLESS_MAX_RETRIES
| for (k, v) in span.meta_struct.iter() { | ||
| let key: &str = k.borrow(); | ||
| let bytes: &[u8] = v.borrow(); | ||
| // Encode as a JSON array of u8 (default serde behavior for &[u8]). |
There was a problem hiding this comment.
Is this the right thing to do? If i'm reading js correctly (a big if) they just send meta_struct values as json, not msgpack encoded bytes.
| let val: &str = v.borrow(); | ||
| meta.serialize_entry(key, val)?; | ||
| } | ||
| if !p_tid_seen && upper_bits != 0 { |
There was a problem hiding this comment.
Is it intentional that _dd.p.tid is the only key that's deduped?
| } | ||
|
|
||
| /// Sends trace chunks to the Datadog agentless intake (`/v1/input`) as JSON. | ||
| async fn send_agentless_traces_inner<T: TraceData>( |
There was a problem hiding this comment.
Where is max payload size handled? In the SDKs? I assume the endpoint has a max size that we have to respect?
There was a problem hiding this comment.
It has a 5MB size limit I believe. I don't think this limit should be enforced in the trace exporter though.
It should probably be at the trace buffer.
Or maybe we could split incoming list of trace chunks in multiple payloads 🤔
There was a problem hiding this comment.
Ah, that's a good point. The trace buffer should handle flushing below the limit.
Although, looking at the trace buffer in data pipeline we don't enforce limits or split up chunks. We check after a chunk is added, and flush above flush_trigger_bytes. It's theoretically possible that if we receive a significantly large chunk we flush something that exceeds the intake limit. In a similar vain, we only drop chunks after max_buffered_bytes is exceeded, which can also trigger a flush that's beyond intake's limit.
What may make this an issue for agentless is that the max size for the agent is significantly larger than intake's limit. And the agent handles splitting up the payloads for intake's limits. So the trace buffers can usually get away with "trigger a flush once we exceed a limit".
I'm not sure this is a practical problem today that we need to deal with in this PR? I think we can tackle it separately?
There was a problem hiding this comment.
I believe llm obs uses the test agent as a stand-in for intake in tests. Could we do the same and add integration tests for agentless payloads?
|
I have some questions:
This PR feels like a lot of re-inventing the wheel. |
Because it is not the same schema. The agentless endpoint expects something like As to why I do the conversion at encoding, creating an
I mean, I agree that the api key should be passed in the |
|
Makes sense, there is quite some overhead. The v1 protocol is much better at being transformed from what agent expects to what intake expects. I would prefer though if this weren't part of data-pipeline, but SendData as well. (i.e. just moving the code there). |
Motivation
Send traces to the agentless endpoint to replace the JS and python exporters
What changes