Sampling Proxy

A middleware server for OpenAI-compatible backends with passthrough (OpenAI/Anthropic), Anthropic-to-OpenAI conversion, sampling params override, and mid-generation response validation.

Features

Passthrough Modes: OpenAI, Anthropic, and Anthropic-to-OpenAI conversion
Multiple Upstream Servers: Route different models to different backends with glob pattern matching
Parameter Override: Apply custom sampling parameters per model
Parallel Request Limits: Limit concurrent requests per model with automatic queueing
Request Throttling: Configurable cooldown delays between requests per model
Streaming Support: Both streaming and non-streaming responses
Garbage Detection: Validate responses and auto-retry when garbage output is detected
Mid-Stream Validation: Detect garbage during generation (not just at the end)
Flexible Validator API: Supports Anthropic and OpenAI API formats

Quick Start

# Clone and setup
git clone https://github.com/avtc/sampling-proxy.git
cd sampling-proxy
python -m venv sampling-proxy

# Activate venv and install
source sampling-proxy/bin/activate  # Linux/macOS
sampling-proxy\Scripts\activate     # Windows
pip install -r requirements.txt

# Configure and run
cp config_sample.json config.json
python sampling_proxy.py

One-line Scripts (auto-activate venv)

./sampling_proxy.sh        # Linux/macOS
.\sampling_proxy.ps1       # Windows

Both scripts auto-activate the sampling-proxy venv and run the proxy.

Sample Validator Setup (llama.cpp)

Run a small model as a validator for garbage detection:

llama-server --hf-repo unsloth/Qwen3.5-4B-GGUF --hf-file Qwen3.5-4B-UD-Q6_K_XL.gguf --host 127.0.0.1 --port 1235 -ngl 99 --parallel 2 --jinja -fa on -c 40000 --chat-template-kwargs "{\"enable_thinking\": false}" --temp 1 --min-p 0 --top-p 0.95 --top-k 20 --repeat-penalty 1 --presence-penalty 1.5 --cache-ram 0

Command Line Options

python sampling_proxy.py --help

Option	Description
`--config, -c`	Path to config JSON file
`--host`	Proxy server host
`--port`	Proxy server port
`--target-base-url`	Backend URL (repeatable, use `name=url` to target specific upstream)
`--model-bindings`	Model-to-upstream bindings as JSON string
`--debug-logs, -d`	Enable debug logging
`--enforce-params, -e`	Enforce parameters as JSON

Garbage Detection

Enable validation to detect and retry garbage responses.

Detected issue:

Repetition loops: Same phrase repeated 3+ times

Key Options

Option	Description	Default
`validation.enabled`	Enable response validation	`false`
`validation.validator_url`	Validator endpoint URL	`http://127.0.0.1:1235`
`validation.validator_model`	Model name for validation	`Qwen3.5-4B-UD-Q6_K_XL.gguf`
`validation.max_retries`	Max retry attempts	`1`
`validation.mid_stream_validation_enabled`	Validate during streaming	`true`
`validation.mid_stream_validation_interval_words`	Check every N words	`300`

Mid-Stream Validation

When enabled, validates responses periodically during streaming:

Catches repetition loops at ~300 words (configurable)
Interrupts garbage immediately and retries
Reduces latency by not waiting for full garbage responses

{
  "validation": {
    "enabled": true,
    "mid_stream_validation_enabled": true,
    "mid_stream_validation_interval_words": 300
  }
}

Logs

Failed validation responses saved to ~/.sampling-proxy/logs/

Parallel Request Limits

Limit concurrent requests per model to prevent backend overload:

{
  "parallel_limits": {
    "GLM-5": 2,
    "GLM-4.7": 3
  }
}

When limit is reached, additional requests queue automatically. Logs show queue status:

[INFO] Queueing for GLM-5, 2 requests waiting (limit: 2)
[INFO] Slot acquired GLM-5, used: 2/2
[INFO] Slot released GLM-5, used: 1/2

Request Throttling

Add cooldown delays between requests to prevent backend overload:

{
  "throttle": {
    "enabled": true,
    "global": {
      "start_pause_seconds": null,
      "end_pause_seconds": null
    },
    "per_model": {
      "GLM-5-turbo": {
        "start_pause_seconds": 1.0,
        "end_pause_seconds": 5.0
      }
    }
  }
}

Timers:

start_pause_seconds: Cooldown after sending a request (delays the next request)
end_pause_seconds: Cooldown after response completes (delays the next request)
null disables the timer (default)

Per-model settings override global. Both timers default to null (disabled).

API Endpoints

Endpoint	Description
`/chat/completions`	OpenAI chat completions
`/messages`	Anthropic messages (converted)
`/models`	List available models
`/`	Health check (shows upstreams and bindings)

Multiple Upstream Servers

Route different models to different backend servers. Each upstream has its own URL, timeout, and format support.

Configuration

{
  "listen": {
    "host": "0.0.0.0",
    "port": 8001,
    "base_path": ""
  },
  "upstreams": [
    {
      "name": "local-openai",
      "base_url": "http://127.0.0.1:8000/v1",
      "connect_timeout_seconds": 5.0,
      "timeout_seconds": 1200.0,
      "supports_openai": true,
      "supports_anthropic": false
    },
    {
      "name": "zai-anthropic",
      "base_url": "https://api.z.ai/api/anthropic",
      "connect_timeout_seconds": 5.0,
      "timeout_seconds": 1200.0,
      "supports_openai": false,
      "supports_anthropic": true
    }
  ],
  "model_upstream_binding": {
    "glm-*": "zai-anthropic",
    "*": "local-openai"
  }
}

Model Binding Rules

Exact match takes priority: "glm-5": "zai-anthropic"
Glob patterns via fnmatch: "glm-*": "zai-anthropic" matches glm-4.7, glm-5-Turbo, etc.
Wildcard "*" catches everything not matched above
Patterns are evaluated in config order — first match wins
If no pattern matches, the proxy returns a 400 error

Legacy Single-Upstream Config

The old server.target_base_url format still works — it auto-wraps as a single upstream:

{
  "server": {
    "target_base_url": "http://127.0.0.1:8000",
    "supports_openai": true,
    "supports_anthropic": false
  }
}

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.bg-shell		.bg-shell
docs/plans		docs/plans
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config_sample.json		config_sample.json
config_zai_sample.json		config_zai_sample.json
requirements.txt		requirements.txt
sampling_proxy.ps1		sampling_proxy.ps1
sampling_proxy.py		sampling_proxy.py
sampling_proxy.sh		sampling_proxy.sh
throttle_manager.py		throttle_manager.py
validator.py		validator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sampling Proxy

Features

Quick Start

One-line Scripts (auto-activate venv)

Sample Validator Setup (llama.cpp)

Command Line Options

Garbage Detection

Key Options

Mid-Stream Validation

Logs

Parallel Request Limits

Request Throttling

API Endpoints

Multiple Upstream Servers

Configuration

Model Binding Rules

Legacy Single-Upstream Config

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sampling Proxy

Features

Quick Start

One-line Scripts (auto-activate venv)

Sample Validator Setup (llama.cpp)

Command Line Options

Garbage Detection

Key Options

Mid-Stream Validation

Logs

Parallel Request Limits

Request Throttling

API Endpoints

Multiple Upstream Servers

Configuration

Model Binding Rules

Legacy Single-Upstream Config

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages