feat(backend): add llama.cpp (llama-server) backend wrapper#97
Open
bilersan wants to merge 1 commit into
Open
Conversation
Add llamacpp as a named backend alongside vllm, openai, anthropic,
ollama, and lmstudio. Follows the established vLLM pattern exactly:
- llamacpp struct embedding *openAICompat with cold-start retry
- Ping() delegates to coldStartRetry (reuses vllm_internal.go helper)
- Default endpoint: http://localhost:8080 (llama-server default)
- No API key required by default (llama-server runs unauthenticated)
.ctxrc usage (once factory wiring + ctx ai are implemented):
backends:
- name: local
type: llamacpp
endpoint: http://localhost:8080
timeout: 60s
default_backend: local
Note: this commit adds the backend type and tests only. Factory
registration wiring and consumer CLI commands (ctx ai, ctx compact
--emit, ctx ingest) are not yet implemented in the vllm-integration
branch either — they are expected in a future phase once the
backend abstraction layer stabilizes. When that wiring lands,
llamacpp will be available as a backend type with zero additional
work.
Validated against a live llama-server (Qwen3-4B-Q4_K_M):
- Ping: GET /v1/models returns 200
- Complete: POST /v1/chat/completions returns model response
- Cold-start retry: unit-tested via shared coldStartRetry helper
Files:
internal/config/backend/backend.go +2 constants
internal/backend/types.go +1 struct (llamacpp)
internal/backend/llamacpp.go constructor + Ping override
internal/backend/llamacpp_test.go 3 unit tests (mock server)
internal/backend/llamacpp_e2e_test.go 2 e2e tests (build tag: e2e)
c2a127c to
5210fca
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Contributes to #92
Summary
Adds
llamacppas a named backend alongside vllm, openai, anthropic, ollama, and lmstudio. Follows the established vLLM pattern exactly:llamacppstruct embedding*openAICompatwith cold-start retry onECONNREFUSEDPing()delegates tocoldStartRetry(reusesvllm_internal.gohelper — no duplication)http://localhost:8080(llama-server default).ctxrc usage (once factory wiring + consumer commands land)
What this PR does NOT include
This PR adds the backend type and tests only. It intentionally does not include:
ctx ai,ctx compact --emit,ctx ingest) — expected in a future phaseWhen the factory wiring and consumer commands land,
llamacppwill be available as a backend type with zero additional work.Why llama.cpp needs its own wrapper (vs generic openai-compatible)
llama-server behaves like vLLM during model loading: the TCP listener is not yet bound while weights load, so the OS returns
ECONNREFUSEDat the socket level (not HTTP 503). ThecoldStartRetrylogic fromvllm_internal.gohandles exactly this case. A genericopenai-compatiblebackend would fail immediately on connection refused instead of retrying.Validation
Tested against a live llama-server running Qwen3-4B-Q4_K_M:
Ping→GET /v1/modelsComplete→POST /v1/chat/completionsFiles changed
internal/config/backend/backend.goNameLlamaCpp,DefaultEndpointLlamaCpp)internal/backend/types.gollamacpp)internal/backend/llamacpp.gointernal/backend/llamacpp_test.gointernal/backend/llamacpp_e2e_test.goe2e, requires live server)