Skip to content

fix(auto-recall): add timeout + fail-open so slow LLMs cannot stall startup or first turn#1673

Open
Sanjays2402 wants to merge 2 commits intoMemTensor:mainfrom
Sanjays2402:fix/issue-1452
Open

fix(auto-recall): add timeout + fail-open so slow LLMs cannot stall startup or first turn#1673
Sanjays2402 wants to merge 2 commits intoMemTensor:mainfrom
Sanjays2402:fix/issue-1452

Conversation

@Sanjays2402
Copy link
Copy Markdown

Summary

With auto-recall enabled and an existing memory database, a slow LLM on the recall/filter path could block the gateway critical path for 30-40 seconds — long enough to trip health checks and cause restart loops (#1452).

This PR wraps the recall/filter LLM work in a configurable timeout and ensures any exception in the auto-recall path fails open rather than propagating to the gateway top level.

Changes

  • New withTimeout helper that resolves to null on timeout (clean fail-open semantics).
  • Auto-recall LLM filter now races against recall.autoRecallTimeoutMs (default 8000 ms).
  • Top-level try/catch around the auto-recall block; on any error or timeout we log a warning and return an empty memory set so the prompt build proceeds normally.
  • New config key documented: recall.autoRecallTimeoutMs.

Behavior

  • Healthy LLM: indistinguishable from before — recall + filter happen, memories injected.
  • Slow LLM (timeout): warning logged, prompt builds with no auto-injected memories, gateway proceeds.
  • LLM error: same as timeout — warning logged, fail open.
  • Startup ready is unaffected; auto-recall was never on that path, but the timeout + fail-open guarantees it stays that way.

Fixes #1452

…tartup or first turn

When auto-recall was enabled with an existing memory database and a slow
LLM on the recall/filter path, the before-prompt-build hook could block
the critical path for 30-40 seconds — long enough to trip gateway
health checks and contribute to restart loops.

Wrap the recall/filter work in a configurable timeout (default 8s, via
`recall.autoRecallTimeoutMs`) and a top-level try/catch that fails open
to an empty memory set. Auto-recall is best-effort enrichment; it must
never delay readiness or destabilize the gateway.

Fixes MemTensor#1452
Copilot AI review requested due to automatic review settings May 9, 2026 19:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents before_prompt_build auto-recall from stalling the gateway critical path by adding a hard timeout and fail-open behavior around the recall search + LLM filter work (addressing #1452).

Changes:

  • Added a withTimeout helper that resolves to null on timeout (fail-open).
  • Wrapped auto-recall Phase 1 (parallel local + hub search) and Phase 2 (LLM filtering) in the configured timeout, with fail-open behavior.
  • Introduced recall.autoRecallTimeoutMs (default documented as 8000ms) and added unit tests for withTimeout.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
apps/memos-local-openclaw/tests/with-timeout.test.ts Adds Vitest coverage for the new withTimeout helper and timeout semantics.
apps/memos-local-openclaw/src/types.ts Documents recall.autoRecallTimeoutMs and adds a default value to DEFAULTS.
apps/memos-local-openclaw/src/shared/with-timeout.ts Implements withTimeout to race promises against a timeout and return null on timeout.
apps/memos-local-openclaw/index.ts Applies withTimeout to auto-recall search and filter steps to prevent long stalls.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1903 to +1905
const autoRecallTimeoutMs =
ctx.config.recall?.autoRecallTimeoutMs ?? DEFAULTS.autoRecallTimeoutMs;
const phase1 = await withTimeout(
Comment on lines +37 to +48
it("simulates the auto-recall hang path: a 30s LLM call falls back in 8s", async () => {
// Mimic a slow recall LLM that would hang the gateway critical path.
const hangingLLM = new Promise<{ relevant: number[]; sufficient: boolean }>(
(resolve) => setTimeout(() => resolve({ relevant: [1, 2], sufficient: true }), 30_000),
);
const t0 = Date.now();
const result = await withTimeout(hangingLLM, 50, "auto-recall.filter");
const elapsed = Date.now() - t0;
expect(result).toBeNull();
// Must give up well under the 30s LLM completion time.
expect(elapsed).toBeLessThan(500);
});
…lake timer tests

Address Copilot review on MemTensor#1673:
- index.ts: resolveConfig now passes through cfg.recall.autoRecallTimeoutMs
  to the resolved recall block. Without this the user-facing config key
  was effectively dead — the default always won.
- with-timeout.test.ts: switch to vi.useFakeTimers() and
  vi.advanceTimersByTimeAsync so the timeout assertions are deterministic
  under CI load. The previous wall-clock 'elapsed < 500ms' check was the
  most likely flake source.
@Sanjays2402
Copy link
Copy Markdown
Author

Thanks — both addressed in the latest commit:

  • index.ts:1905 (config not wired): resolveConfig() (in src/config.ts) now passes cfg.recall.autoRecallTimeoutMs through to the resolved recall block. Without this the user-facing config key was effectively dead and the default always won — good catch.
  • with-timeout.test.ts:48 (flaky timers): Switched to vi.useFakeTimers() + vi.advanceTimersByTimeAsync() so the timeout behavior is asserted deterministically. The previous wall-clock elapsed < 500ms check was the obvious flake risk under CI load. Test suite now finishes in ~5ms and the 30s-hang case is verified by advancing fake time past 8001ms instead of measuring real elapsed time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

auto-recall can block gateway startup / first-turn path long enough to fail health checks

2 participants