Skip to content

feat(backend): add llama.cpp (llama-server) backend wrapper#97

Open
bilersan wants to merge 1 commit into
ActiveMemory:feat/vllm-integrationfrom
bilersan:feat/llamacpp-backend
Open

feat(backend): add llama.cpp (llama-server) backend wrapper#97
bilersan wants to merge 1 commit into
ActiveMemory:feat/vllm-integrationfrom
bilersan:feat/llamacpp-backend

Conversation

@bilersan
Copy link
Copy Markdown
Contributor

@bilersan bilersan commented May 23, 2026

Contributes to #92

Summary

Adds llamacpp as a named backend alongside vllm, openai, anthropic, ollama, and lmstudio. Follows the established vLLM pattern exactly:

  • llamacpp struct embedding *openAICompat with cold-start retry on ECONNREFUSED
  • Ping() delegates to coldStartRetry (reuses vllm_internal.go helper — no duplication)
  • Default endpoint: http://localhost:8080 (llama-server default)
  • No API key required by default (llama-server typically runs unauthenticated)

.ctxrc usage (once factory wiring + consumer commands land)

backends:
  - name: local
    type: llamacpp
    endpoint: http://localhost:8080
    timeout: 60s
default_backend: local

What this PR does NOT include

This PR adds the backend type and tests only. It intentionally does not include:

  • Factory registration wiring — not yet implemented for any backend in this branch
  • Consumer CLI commands (ctx ai, ctx compact --emit, ctx ingest) — expected in a future phase

When the factory wiring and consumer commands land, llamacpp will be available as a backend type with zero additional work.

Why llama.cpp needs its own wrapper (vs generic openai-compatible)

llama-server behaves like vLLM during model loading: the TCP listener is not yet bound while weights load, so the OS returns ECONNREFUSED at the socket level (not HTTP 503). The coldStartRetry logic from vllm_internal.go handles exactly this case. A generic openai-compatible backend would fail immediately on connection refused instead of retrying.

Validation

Tested against a live llama-server running Qwen3-4B-Q4_K_M:

Test Result
PingGET /v1/models HTTP 200, model listed
CompletePOST /v1/chat/completions Model response received
Cold-start retry (unit) Retries ECONNREFUSED, stops on non-dial errors
Non-dial error (unit) HTTP 500 returns immediately, no retry
All existing backend tests PASS (no regressions)

Files changed

File Change
internal/config/backend/backend.go +2 constants (NameLlamaCpp, DefaultEndpointLlamaCpp)
internal/backend/types.go +1 struct (llamacpp)
internal/backend/llamacpp.go Constructor + Ping override (~60 lines)
internal/backend/llamacpp_test.go 3 unit tests (mock httptest server)
internal/backend/llamacpp_e2e_test.go 2 e2e tests (build tag: e2e, requires live server)

@bilersan bilersan requested a review from josealekhine as a code owner May 23, 2026 14:35
Add llamacpp as a named backend alongside vllm, openai, anthropic,
ollama, and lmstudio. Follows the established vLLM pattern exactly:

- llamacpp struct embedding *openAICompat with cold-start retry
- Ping() delegates to coldStartRetry (reuses vllm_internal.go helper)
- Default endpoint: http://localhost:8080 (llama-server default)
- No API key required by default (llama-server runs unauthenticated)

.ctxrc usage (once factory wiring + ctx ai are implemented):

    backends:
      - name: local
        type: llamacpp
        endpoint: http://localhost:8080
        timeout: 60s
    default_backend: local

Note: this commit adds the backend type and tests only. Factory
registration wiring and consumer CLI commands (ctx ai, ctx compact
--emit, ctx ingest) are not yet implemented in the vllm-integration
branch either — they are expected in a future phase once the
backend abstraction layer stabilizes. When that wiring lands,
llamacpp will be available as a backend type with zero additional
work.

Validated against a live llama-server (Qwen3-4B-Q4_K_M):
- Ping: GET /v1/models returns 200
- Complete: POST /v1/chat/completions returns model response
- Cold-start retry: unit-tested via shared coldStartRetry helper

Files:
  internal/config/backend/backend.go  +2 constants
  internal/backend/types.go           +1 struct (llamacpp)
  internal/backend/llamacpp.go        constructor + Ping override
  internal/backend/llamacpp_test.go   3 unit tests (mock server)
  internal/backend/llamacpp_e2e_test.go  2 e2e tests (build tag: e2e)
@bilersan bilersan force-pushed the feat/llamacpp-backend branch from c2a127c to 5210fca Compare May 23, 2026 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant