Skip to content

fix(provider): apply MergeSystemMessages for vLLM provider#3196

Open
tmchow wants to merge 1 commit intotailcallhq:mainfrom
tmchow:fix/3128-vllm-system-message-ordering
Open

fix(provider): apply MergeSystemMessages for vLLM provider#3196
tmchow wants to merge 1 commit intotailcallhq:mainfrom
tmchow:fix/3128-vllm-system-message-ordering

Conversation

@tmchow
Copy link
Copy Markdown
Contributor

@tmchow tmchow commented Apr 29, 2026

Summary

Apply MergeSystemMessages to the vLLM built-in provider in addition to NVIDIA. vLLM rejects requests where the system message is not first, which is exactly what this transformer was built to handle.

Why this matters

Issue #3128 reports vLLM serving mlx-community/Qwen3.6-27B-bf16 rejects every request with:

ERROR: POST http://1.2.3.4:8000/v1/chat/completions
Caused by: 404 Not Found Reason: {"error": "System message must be at the beginning."}

MergeSystemMessages already exists for this case. The transformer doc comment even says "Some providers (e.g. NVIDIA) reject requests with multiple system messages or system messages that are not positioned at the start". The "e.g." anticipated more cases, but the gating predicate at crates/forge_app/src/dto/openai/transformers/pipeline.rs:86 only matched NVIDIA.

vLLM is a built-in provider (registered in crates/forge_repo/src/provider/provider.json as id "vllm"), so it has its own ProviderId and goes through the openai pipeline. The fix adds it to the same predicate.

Changes

  • crates/forge_app/src/dto/openai/transformers/pipeline.rs: predicate now matches ProviderId::NVIDIA or the vLLM provider id (resolved via ProviderId::from_str("vllm") in the same style as the existing kimi_coding binding a few lines below).
  • crates/forge_app/src/dto/openai/transformers/ensure_system_first.rs: doc comment now lists vLLM alongside NVIDIA.

The transformer is a no-op for already-compliant requests (single system message at the start), so no behavior changes for users whose vLLM server already gets correct input.

Testing

  • New pipeline-level test test_vllm_provider_merges_system_messages asserts that a vLLM provider with mixed system messages produces a single merged system message at the front.
  • cargo test -p forge_app: 696 passed, 0 failed.
  • cargo clippy -p forge_app --all-targets --all-features -- -D warnings: clean.

Closes #3128

Compound Engineering

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 29, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions Bot added the type: fix Iterations on existing features or infrastructure. label Apr 29, 2026
@tmchow tmchow force-pushed the fix/3128-vllm-system-message-ordering branch from 1b5ec06 to 6bb8c21 Compare April 29, 2026 02:49
Comment thread crates/forge_app/src/dto/openai/transformers/pipeline.rs Outdated
@amitksingh1490
Copy link
Copy Markdown
Contributor

vLLM is a builtin provider

@tmchow tmchow force-pushed the fix/3128-vllm-system-message-ordering branch from 6bb8c21 to f91cfa8 Compare April 29, 2026 15:29
@tmchow tmchow changed the title fix(provider): apply MergeSystemMessages for OPENAI_COMPATIBLE custom providers fix(provider): apply MergeSystemMessages for vLLM provider Apr 29, 2026
@tmchow
Copy link
Copy Markdown
Contributor Author

tmchow commented Apr 29, 2026

Predicate now matches NVIDIA + vLLM (not OPENAI_COMPATIBLE). Updated pr desc.

@YoyoSailer
Copy link
Copy Markdown

YoyoSailer commented May 2, 2026

Just jumping in - I have a problem with this approach as it is hardcoded to have vllm. I have hosted it behind nginx proxy and have a little middlewear and accessing it from a URL and AUTH so i used OpenAI Compatible.

My vLLM worked fine on other tools.

I guess I should use VLLM but for that I do not have PORT :) Let me open a new ticket for that.

Just to put here, I have fixed that in my middleware as well

// Qwen3 chat template requires system message at index 0 only.
// Merge any mid-conversation system messages into a single leading block.
  const systems = req.body.messages.filter(m => m.role === 'system');
  if (systems.length > 1 || (systems.length === 1 && req.body.messages[0].role !== 'system')) {
    const others = req.body.messages.filter(m => m.role !== 'system');
    const mergedContent = systems
      .map(m => (typeof m.content === 'string' ? m.content : JSON.stringify(m.content)))
      .filter(Boolean)
      .join('\n\n');
    req.body.messages = [{ role: 'system', content: mergedContent }, ...others];
  }

It just move all system messages to one place.

@tmchow tmchow force-pushed the fix/3128-vllm-system-message-ordering branch from f91cfa8 to cc6be5b Compare May 5, 2026 03:17
vLLM is a built-in OpenAI-response provider in provider.json, but the
`MergeSystemMessages` predicate in the openai pipeline only matched NVIDIA.
vLLM rejects requests where the system message is not first
(per tailcallhq#3128: 'System message must be at the beginning.'), so add vLLM to
the same predicate using the existing `from_str` pattern that handles
providers without a `ProviderId` constant.

Closes tailcallhq#3128
@tmchow tmchow force-pushed the fix/3128-vllm-system-message-ordering branch from cc6be5b to 34e0e3d Compare May 11, 2026 00:33
@tmchow
Copy link
Copy Markdown
Contributor Author

tmchow commented May 11, 2026

Thanks @YoyoSailer for the context (and the middleware snippet). Sounds like the vLLM path works for you with VLLM_PORT now that #3258 landed.

Rebased onto main in 34e0e3d to clear the merge-behind state. @amitksingh1490 @laststylebender14 anything else on this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type: fix Iterations on existing features or infrastructure.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: System message must be at the beginning. when using vllm

4 participants