feat: report self-bootstrap status via Health RPC by couragehong · Pull Request #6 · CryptoLabInc/runed

couragehong · 2026-05-28T08:35:35Z

Previously, the gRPC UDS opened after self-bootstrap completed, so clients dialing during the multi-minute install window only saw dial failures. They now connect immediately and observe STATUS_LOADING with the current Phase (FETCHING_LLAMA_SERVER / FETCHING_MODEL / STARTING_LLAMA_SERVER) and download bytes_done / bytes_total — these proto fields were already defined; this PR is the matching implementation.
Embed/EmbedBatch return codes.FailedPrecondition before the backend is wired so retry policies don't burn budget against a non-ready daemon. Once SetBackend wires the backend, Health flips to STATUS_OK and embed requests are accepted.
Every termination path (Shutdown RPC, idle timeout, OS signals, serve error, early-fail during bootstrap) flips Health to
STATUS_SHUTTING_DOWN before the listener closes — clients no longer see an abrupt connection drop with no signal. backend.Start failure also reaps a possibly-orphan child via b.Stop.
Client code (rune-mcp) doesn't need to change to remain working — surfacing Phase / bytes_done in a polling UI is a follow-up on the rune-mcp side; the runed surface is now ready for it.

Until self-bootstrap completes, Embed/EmbedBatch return codes.FailedPrecondition rather than dialing into a nil backend, and Health reports STATUS_LOADING with the current Phase / bytes_done / bytes_total / message that bootstrap can feed in. Mechanics: - backend reference moves to atomic.Pointer[backend.LlamaBackend]; modelIdentity becomes atomic.Value; bootstrapStatus is an atomic.Pointer[bootstrapState] so {phase, bytes, message} publish and observe as one tuple. - New(version) constructs a Server with nil backend. SetBackend(b, modelID) wires it after bootstrap, writing maxTextLength / modelIdentity / backend in that order so a reader seeing backend necessarily sees the other two. - SetBootstrapStatus(phase, bytesDone, bytesTotal, message) is the loading-state sink consumed by the next Health call while backend is still nil. Health priority is SHUTTING_DOWN (shutdownCh closed) → LOADING (backend nil) → DEGRADED (IsHealthy false) → OK. SHUTTING_DOWN outranks LOADING so a drain-in-progress daemon doesn't advertise itself as "still loading" mid-drain. Tests cover the LOADING / SetBootstrapStatus reflection / FAILED_PRECONDITION before-SetBackend / SHUTTING_DOWN priority cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Threads an optional reporter through EnsureAll / EnsureLlamaServer / EnsureModel so download-byte progress and per-stage transitions reach callers without coupling the bootstrap package to a specific status sink (e.g. server.SetBootstrapStatus). Callback shape: type StatusReporter func(stage string, bytesDone, bytesTotal int64) stage is "llama_server" or "model"; the caller maps that to whichever domain enum it cares about (cmd/runed routes to HealthResponse_Phase). Reporter calls run inline on the download goroutine and share the existing 2-second throttle inside makeProgress, so the status sink isn't flooded at full chunk cadence. Stage-transition ticks are emitted by the public entry points *before* AcquireLock so a trailer waiting on the install lock still surfaces the correct stage to clients during the lock-wait window. The internal ensure* helpers no longer emit their own ticks; under the new arrangement they would have produced duplicate transitions, and the EnsureAll path explicitly issues the llama-server → model transition between the two internal calls (still inside the same lock). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The gRPC UDS used to open *after* self-bootstrap completed, so clients connecting during the multi-minute install window only saw dial failures. They now connect immediately and observe STATUS_LOADING with the current Phase + bytes_done / bytes_total / message — exactly what the proto envisioned when these fields were originally defined. Flow rewrite: paths → daemon-check → server.New (backend unset) → ipc.Listen → grpc.Serve [bg] → SetBootstrapStatus(UNSPECIFIED, "fetching manifest") → selfBootstrap with reporter (Phase flips per stage tick) → SetBootstrapStatus(STARTING_LLAMA_SERVER) → backend.Start → srv.SetBackend(b, modelID) ← Health flips to OK → idle ticker / signal wait / drain reporter is a closure that maps each bootstrap stage to its proto Phase + message via stagePhase() and forwards to srv.SetBootstrapStatus. The proto omits PHASE_FETCHING_MANIFEST, so the manifest-fetch interval reports PHASE_UNSPECIFIED with a "fetching manifest" message — clients that surface message render correctly without depending on enum recognition for that brief stage. A new bailBoot(logger, srv, gs, b) helper centralises early-failure cleanup: any boot-time error (selfBootstrap, sha256File, backend.Start, parseIdleTimeout) drives the same TriggerShutdown + GracefulStop + best-effort b.Stop sequence. Clients see one final STATUS_SHUTTING_DOWN before the listener closes instead of an abrupt connection drop, which matches the experience on the normal exit path. main's exit-select also calls srv.TriggerShutdown() unconditionally so OS-signal and serve-error exits flip Health to SHUTTING_DOWN too (sync.Once makes a follow-up Shutdown RPC a no-op). backend.Start failure additionally calls b.Stop — b.Start may have spawned a child that failed health-probe, leaving an orphan llama- server holding ~470MB; b.Stop is idempotent on never-spawned backends. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The three preceding commits accumulated a lot of explanatory prose that restates what well-named identifiers already convey. This pass prunes WHAT-style narration, "added for X" meta notes, and historical context, keeping only the lines that capture a non-obvious WHY: publish order in SetBackend, the FAILED_PRECONDITION-vs-Unavailable rationale in Embed, the SHUTTING_DOWN-outranks-LOADING priority in Health, the chars==tokens conservativism in maxTextLength, the trailer-wait reason for emitting stage ticks before AcquireLock, etc. No behaviour change; ~145 lines net removed across server / bootstrap / runed plus their tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

couragehong requested review from esifea and jh-lee-cryptolab May 28, 2026 08:35

couragehong and others added 3 commits May 28, 2026 17:40

couragehong force-pushed the feat/bootstrap-status branch from a339f8d to f01512a Compare May 28, 2026 08:42

jh-lee-cryptolab reviewed May 28, 2026

View reviewed changes

Comment thread cmd/runed/main.go

fix(runed): route serve-error exit through shared shutdown cleanup

c6839d5

jh-lee-cryptolab approved these changes May 29, 2026

View reviewed changes

esifea merged commit d04f8d3 into CryptoLabInc:main May 31, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: report self-bootstrap status via Health RPC#6

feat: report self-bootstrap status via Health RPC#6
esifea merged 5 commits into
CryptoLabInc:mainfrom
couragehong:feat/bootstrap-status

couragehong commented May 28, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

couragehong commented May 28, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants