Skip to content

fix(http): avoid marking nodes down on client errors#360

Open
lemonel wants to merge 1 commit into
apolloconfig:developfrom
lemonel:dev/fix_unexpected_down_mark
Open

fix(http): avoid marking nodes down on client errors#360
lemonel wants to merge 1 commit into
apolloconfig:developfrom
lemonel:dev/fix_unexpected_down_mark

Conversation

@lemonel

@lemonel lemonel commented May 28, 2026

Copy link
Copy Markdown

Summary

  • 将 4xx 响应包装为 clientRequestInvalidError
  • RequestRecovery() 遇到这类错误时不再调用 SetDownNode
  • 保留 5xx、超时、网络错误的原有降级和下线逻辑

Why

  • 之前 404 等 4xx 会被当成节点故障处理,导致单个 namespace 的请求失败污染整台节点状态
  • 这会影响后续对其他 namespace 的拉取,出现空配置且无任何提示(如启动apollo后后续通过GetConfig(namespace string) *storage.Config,如果先获取不存在的namespace,获取其他已经存在的namespace仍然会获取到nil config,且无法感知)

Summary by CodeRabbit

  • Bug Fixes

    • Improved handling of client-side HTTP errors (400, 401, 404, 405). The system now correctly distinguishes between client errors and server errors, preventing servers from being incorrectly marked as unavailable when clients submit invalid requests.
  • Tests

    • Enhanced test coverage for HTTP error status codes with additional validation scenarios.

Review Change Stack

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1700fde7-8280-4e77-9a1e-7be9bc176c85

📥 Commits

Reviewing files that changed from the base of the PR and between d122893 and bf28929.

📒 Files selected for processing (2)
  • protocol/http/request.go
  • protocol/http/request_test.go

📝 Walkthrough

Walkthrough

This PR introduces a typed error for client-side HTTP failures (400/401/404/405) to prevent unnecessary server downtime marking. The Request function now returns clientRequestInvalidError for these status codes, and RequestRecovery detects this error and returns immediately instead of marking the server as down. Tests are expanded to validate error classification and expected recovery behavior across multiple status codes.

Changes

Client Request Error Differentiation

Layer / File(s) Summary
Typed client request error and fail-fast detection
protocol/http/request.go
Adds clientRequestInvalidError type with status code and isClientRequestInvalidError helper. Request returns this typed error for 400/401/404/405 responses. RequestRecovery detects the typed error and returns immediately without marking the server down.
Test validation of error classification and recovery behavior
protocol/http/request_test.go
TestFailFastStatusCode refactored into parameterized table-driven test with expectations for error classification, duration reporting, and server downtime. Helper assertions validate isClientRequestInvalidError match, duration values, and IsDown state.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A rabbit hops through error states so clear,
Client faults now fail fast without a tear,
No servers marked down for bad requests in sight,
Just typed errors pointing the true path right! 🌿

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(http): avoid marking nodes down on client errors' directly and accurately describes the main change in the PR - preventing 4xx client errors from marking nodes as down.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mergify

mergify Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

感谢您提出Pull Request,我会尽快Review。我会在1-2日内进行查看或者回复,如果遇到节假日可能会处理较慢,敬请谅解。

@nobodyiam

Copy link
Copy Markdown
Member

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bf289294da

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread protocol/http/request.go
Comment on lines 187 to +189
case http.StatusBadRequest, http.StatusUnauthorized, http.StatusNotFound, http.StatusMethodNotAllowed:
log.Errorf("Connect Apollo Server Fail, url:%s, StatusCode:%d", requestURL, res.StatusCode)
return nil, errors.New(fmt.Sprintf("Connect Apollo Server Fail, StatusCode:%d", res.StatusCode))
return nil, &clientRequestInvalidError{statusCode: res.StatusCode}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat forbidden responses as client errors

When Apollo returns another client-side status such as 403 Forbidden for an unauthorized app/namespace, this still falls through to the retrying default case and RequestRecovery will call SetDownNode, marking a healthy node down for a bad request/permission issue. Apollo documents 403 as a client authorization error alongside 400/401/404/405 (https://github.com/apolloconfig/apollo/wiki/Apollo%E5%BC%80%E6%94%BE%E5%B9%B3%E5%8F%B0#%E5%9B%9B%E9%94%99%E8%AF%AF%E7%A0%81%E8%AF%B4%E6%98%8E), so this fix should classify the full non-retryable 4xx range (or at least include 403) before marking nodes down.

Useful? React with 👍 / 👎.

@zouyx zouyx changed the base branch from master to develop June 15, 2026 01:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants