feat(policy): unified failure accrual and response-penalty load biasing by unleashed · Pull Request #15374 · linkerd/linkerd2

unleashed · 2026-06-10T23:51:23Z

This adds control-plane support for two opt-in outbound-policy features served to proxies: unified failure accrual and response-penalty load biasing. Both are configured through annotations on a Service (and, for failure accrual, an EgressNetwork), and both require a data plane built against linkerd2-proxy-api v0.20.0.

A Service that sets none of the new annotations serializes to a wire policy identical to the prior release, so nothing changes for operators who do not opt in.

Unified failure accrual

Linkerd already supports consecutive-failure accrual via balancer.linkerd.io/failure-accrual: consecutive. This PR adds a second mode, unified, that trips a breaker on either a run of consecutive failures or a low success ratio measured over a trailing window. The window is a ring of fixed-duration buckets.

Unified mode runs both breaker policies. The consecutive dimension stays active at its default of 7 even when only success-rate parameters are set. To run success-rate-only breaking, set failure-accrual-consecutive-max-failures: 0.

Response-penalty load biasing

When enabled with penalize-failures, peak-EWMA load balancing steers traffic away from endpoints that return failures (HTTP 429, 503, every other 5xx, and the gRPC failure trailer codes). The proxy uses a PenaltyPeakEwma load estimator that two annotations tune. This applies to Service backends only. EgressNetwork uses a forwarding backend with no balancer, so the load-biaser annotations have no effect there.

Annotations

Annotation	Status	Scope	Description	Default
`balancer.linkerd.io/failure-accrual`	Changed	Service, EgressNetwork	Breaker mode. Existing value `consecutive`; this PR adds `unified`.	unset (no breaker)
`balancer.linkerd.io/failure-accrual-consecutive-max-failures`	Existing	Service, EgressNetwork	Consecutive failures that trip the breaker. `0` disables the consecutive dimension.	`7`
`balancer.linkerd.io/failure-accrual-consecutive-max-penalty`	Existing	Service, EgressNetwork	Maximum probation backoff before a tripped endpoint is retried.	`60s`
`balancer.linkerd.io/failure-accrual-consecutive-min-penalty`	Existing	Service, EgressNetwork	Minimum probation backoff.	`1s`
`balancer.linkerd.io/failure-accrual-consecutive-jitter-ratio`	Existing	Service, EgressNetwork	Jitter ratio applied to the probation backoff.	`0.5`
`balancer.alpha.linkerd.io/failure-accrual-success-rate-threshold`	Added	Service, EgressNetwork	Success ratio (0.0 to 1.0) below which the breaker trips in `unified` mode. `0` disables the success-rate dimension.	`0.8`
`balancer.alpha.linkerd.io/failure-accrual-success-rate-window`	Added	Service, EgressNetwork	Trailing window over which the success ratio is measured.	`10s`
`balancer.alpha.linkerd.io/failure-accrual-success-rate-min-requests`	Added	Service, EgressNetwork	Minimum requests in the window before the success-rate dimension can trip.	`5`
`balancer.alpha.linkerd.io/failure-accrual-honor-retry-after`	Added	Service, EgressNetwork	Let a tripped endpoint's probe schedule honor a server `Retry-After` or gRPC pushback hint. Stays bounded by the breaker backoff maximum.	`false`
`balancer.alpha.linkerd.io/penalize-failures`	Added	Service	Enable response-penalty load biasing (`PenaltyPeakEwma`).	`false`
`balancer.alpha.linkerd.io/load-biaser-penalty`	Added	Service	Penalty weight applied to a failing endpoint's load estimate.	`5s`
`balancer.alpha.linkerd.io/load-biaser-max-retry-after`	Added	Service	Upper bound on how long the penalty estimator honors a `Retry-After` hint.	`300s`

Annotation stability

The new experimental surface (success-rate parameters, penalty load biasing, and the retry-after honoring toggle) uses the balancer.alpha.linkerd.io/ prefix. The inherited consecutive-failure knobs keep the stable balancer.linkerd.io/ prefix, and the unified value extends the existing stable failure-accrual key.

Backwards compatibility

The control plane validates every annotation and rejects any value the proxy would reject, so one bad input never invalidates the whole client policy. With no new annotations set, the emitted outbound-policy proto is identical to the prior release. The new defaults (penalty 5s, max-retry-after 300s, success-rate window 10s, and so on) match what the proxy used before, so enabling a feature without tuning it keeps the previous behavior.

Validation and failure handling

The new annotations are scoped to Services (and EgressNetwork). Malformed values are logged and the fields fall back to their defaults. A malformed accrual sub-value drops the whole accrual configuration for that Service rather than the single field.

Route objects do not hold effective failure-accrual configuration. Accrual is scoped to the parent Service or EgressNetwork. The route admission webhook still rejected invalid-value accrual annotations on routes, implying the setting was meaningful when nothing reads it. Stop validating it. This is an upstream-visible behavior change. A route object with an invalid-value accrual annotation that apply-time validation rejected before now admits silently. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>

Add control-plane support for two opt-in outbound-policy features served to proxies. Both require a data plane built against linkerd2-proxy-api v0.20.0, so a Service that sets none of the new annotations serializes to a wire policy identical to the prior release. The control plane validates every annotation and rejects any value the proxy would reject wholesale, so one bad input never invalidates the whole client policy. Response-penalty load biasing steers peak-EWMA load balancing away from endpoints that return failures: HTTP 429, 503, every other 5xx, and the gRPC failure trailer codes. An operator enables it per Service with the penalize-failures annotation, and the proxy then has a PenaltyPeakEwma load estimator. Two annotations tune that estimator. The load-biaser-penalty annotation sets the penalty weight, default 5s, and load-biaser-max-retry-after caps how long the estimator honors a Retry-After hint, default 300s. Both defaults match what the proxy used before, so an unset Service keeps the prior wire. The penalty decay has no annotation, since the proxy folds it into its single RTT EWMA. The separate honor-retry-after annotation lets a tripped endpoint's probe schedule respect a server Retry-After or gRPC pushback hint. That schedule stays bounded by the breaker's own backoff maximum. Unified failure accrual adds a breaker that trips on either a run of consecutive failures or a low success ratio measured over a trailing window, selected with the value unified on the existing failure-accrual key. That window is a ring of fixed-duration buckets rather than an exponential decay. The consecutive mode keeps its prior behavior, and the new success-rate parameters take the alpha annotation prefix to mark the surface experimental. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>

Add integration coverage for the unified and consecutive accrual modes, the penalize-failures and honor-retry-after annotations, the parent-scoped balancer inheritance in both directions, and the mode-conflict and inert-configuration diagnostics. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>

unleashed · 2026-06-11T00:19:14Z

Rebased to fix clippy and add minor fixes.

adleong

Some non-blocking suggestions and questions, but otherwise this looks good.

There are some slight annotation name differences between this and what's in linkerd/website#2126 but I'll update that docs PR to match what's here.

adleong · 2026-06-11T00:42:54Z

-                        jitter_ratio: backoff.jitter,
-                        respect_retry_after_hint: false,
-                    }),
+                    backoff: Some(convert_backoff(backoff, honor_retry_after)),


the consecutive failures accrual doesn't support honor_retry_after so we might as well pass false here.

Done in cb180e0.

adleong · 2026-06-11T00:52:37Z

+            .keys()
+            .any(|k| k.starts_with(success_rate_key!("")))
+        {
+            tracing::warn!(


If there is any invalidly configured Service in the cluster, I believe that this warning will be logged in the policy controller every time the service is updated and every time it is reindexed. This could mean a steady stream of regular repeated warnings for as long as the invalid service exists. This may be more verbose than we want.

This comment also applies to all warnings in the parsing code.

Yes, I agree it will be very noisy. Addressed in 1af90b0 (still using warn level for the rejection/drop cases).

adleong · 2026-06-11T00:57:36Z

+/// parser every boolean control-plane annotation already passes through.
+/// An unrecognized value is rejected. A typo surfaces rather than silently
+/// flipping the feature.
+fn parse_balancer_toggle(


This is great that we're matching the behavior here with the bool parsing that already happens in the go controller. This isn't balancer specific and I expect we'd want to re-use this function for any boolean annotation parsing we add in the future.

Renamed it to parse_bool_annotation in 5202e3d.

adleong · 2026-06-11T01:01:38Z

 "linkerd-policy-controller-k8s-api",
 "linkerd2-proxy-api",
 "maplit",
+ "prost-types",


Why does this PR require adding a dependency?

It's adding a test dependency for policy-test because PenaltyPeakEwma exposes prost_types::Duration directly. There's nothing new, because this dependency already exists elsewhere (for runtime, in policy-controller/grpc), and this is just adding the edge policy-test -> test_depends-on -> prost-types.

A small clean-up we can do is declaring the version requirement in the root Cargo.toml and refer to the workspace version from both policy-test and policy-controller, but I thought of doing that in a separate PR.

The helper accepts the same boolean tokens as the Go controller's strconv.ParseBool and reads whatever annotation key the caller passes. This drops the balancer framing to use a more generic, reusable name. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>

Use debug level for redudant or no-op annotations to avoid log spamming. Keep warn level for the values that get rejected and dropped. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>

Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>

unleashed requested review from adleong and cratelyn June 10, 2026 23:51

unleashed requested a review from a team as a code owner June 10, 2026 23:51

unleashed added 3 commits June 11, 2026 02:18

unleashed force-pushed the amr/load-biaser-circuit-breaker-config branch from f981bfe to 0c5c2e8 Compare June 11, 2026 00:18

adleong mentioned this pull request Jun 11, 2026

feat(policy): support load-bias and retry-after balancer annotations #15317

Closed

adleong approved these changes Jun 11, 2026

View reviewed changes

unleashed added 3 commits June 11, 2026 12:15

fix(policy): use debug level for redundant annotation warnings

1af90b0

Use debug level for redudant or no-op annotations to avoid log spamming. Keep warn level for the values that get rejected and dropped. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>

fix(policy): drop Retry-After on consecutive failures breaker

cb180e0

Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>

adleong approved these changes Jun 11, 2026

View reviewed changes

adleong merged commit bd2a1cd into main Jun 11, 2026
97 of 103 checks passed

adleong deleted the amr/load-biaser-circuit-breaker-config branch June 11, 2026 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(policy): unified failure accrual and response-penalty load biasing#15374

feat(policy): unified failure accrual and response-penalty load biasing#15374
adleong merged 6 commits into
mainfrom
amr/load-biaser-circuit-breaker-config

unleashed commented Jun 10, 2026 •

edited

Loading

Uh oh!

unleashed commented Jun 11, 2026

Uh oh!

adleong left a comment

Uh oh!

adleong Jun 11, 2026

Uh oh!

unleashed Jun 11, 2026

Uh oh!

adleong Jun 11, 2026

Uh oh!

adleong Jun 11, 2026

Uh oh!

unleashed Jun 11, 2026

Uh oh!

adleong Jun 11, 2026

Uh oh!

unleashed Jun 11, 2026

Uh oh!

adleong Jun 11, 2026

Uh oh!

unleashed Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

unleashed commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unified failure accrual

Response-penalty load biasing

Annotations

Annotation stability

Backwards compatibility

Validation and failure handling

Uh oh!

unleashed commented Jun 11, 2026

Uh oh!

adleong left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

unleashed commented Jun 10, 2026 •

edited

Loading