Smart Embedder is a lightweight, self-hosted embedding server built for hybrid search pipelines. It runs entirely on your own hardware — on an NVIDIA GPU for high throughput, or on CPU when no GPU is available — with no cloud dependency and no data leaving your machine.
Hybrid search combines dense vector similarity, sparse lexical matching (BM25-style), and optional ColBERT late-interaction scoring into a single retrieval pipeline. Smart Embedder exposes all three vector types from a single endpoint, plus a reranking endpoint to re-score candidate passages after retrieval — everything a hybrid search stack needs in one lightweight service.
The server is built on FastAPI and wraps BAAI/bge-m3, a single model that produces dense, sparse, and ColBERT vectors simultaneously. The default stack uses BGE-M3 for all vector types and bge-reranker-v2-m3 for reranking — a solid baseline that runs on 8 GB VRAM or CPU with no extra configuration. For higher retrieval quality, Smart Embedder offers two optional upgrades selectable independently at startup:
- Dense embedding → Qwen3-Embedding-0.6B: replaces only the dense vector path; sparse and ColBERT vectors still come from BGE-M3, keeping the full hybrid signal intact.
- Reranking → Qwen3-Reranker-0.6B: replaces the cross-encoder reranker with a stronger model at the cost of higher inference time.
Both Qwen models are still compact (0.6B parameters) but benefit from a dedicated GPU — a machine with 8 GB+ VRAM will see the best results. CPU execution remains supported for both, with conservative batch sizes applied automatically.
Key properties at a glance:
| Property | Detail |
|---|---|
| Deployment | Local — GPU (NVIDIA CUDA) or CPU, Docker or Python venv |
| Hybrid search vectors | Dense + sparse lexical + ColBERT from one model, one endpoint |
| Reranking | Cross-encoder passage reranking, same service |
| Footprint | BGE-M3 + reranker fit in 8 GB VRAM; CPU mode needs no GPU |
| QDRANT-ready | Sparse vectors in native {indices, values} format via sparse_as_indices |
High-performance FastAPI server for BGE-M3 embeddings and selectable reranking:
| Feature | Detail |
|---|---|
| Embeddings | BGE-M3 dense/sparse/ColBERT by default; optional Qwen3 dense with BGE-M3 sparse and ColBERT |
| Reranking | Interactive startup choice: BAAI/bge-reranker-v2-m3 or Qwen/Qwen3-Reranker-0.6B |
| Authentication | Optional Bearer token on non-public endpoints |
| Rate Limiting | Token bucket, 3600 req/min per IP, burst 120 |
| Backpressure | Embedding queue max 200, rerank slots max 32, HTTP 503 on overflow |
| Graceful Shutdown | 30s drain for in-flight requests |
| Prometheus Metrics | Counters, histograms, gauges for both models |
| Dynamic Batching | Embedding batch size adapts to GPU VRAM and max input length at startup |
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements-gpu.txt
REM CPU-only machine instead: pip install -r requirements-cpu.txtstart_server.bat and start_server.sh are parameterized and choose execution target and device:
start_server.bat [local|docker] [cpu|gpu|auto]./start_server.sh [local|docker] [cpu|gpu|auto]| Command | What it does |
|---|---|
start_server.bat / ./start_server.sh |
Default: docker auto |
start_server.bat docker auto |
Docker startup with device auto-detection |
start_server.bat local gpu |
Local venv, CUDA auto-detect |
start_server.bat local cpu |
Local venv, forces CPU (CUDA_VISIBLE_DEVICES=-1) |
start_server.bat docker gpu |
docker compose -f docker-compose.gpu.yml build && up -d with NVIDIA runtime |
start_server.bat docker cpu |
Compose with override docker-compose.cpu.yml (no GPU) |
Arguments are case-insensitive. Built-in validation: unrecognized parameters print usage and exit with code 1.
Startup asks two independent model questions:
- Dense embedding backend: choose BGE for current all-BGE embeddings, or QWEN to return Qwen dense vectors while keeping BGE sparse and ColBERT vectors.
- Reranker backend: choose BGE or QWEN for
/rerank.
The interactive model selection is provided by the launcher scripts. Direct
uvicorn or docker compose startup does not prompt; it uses environment
variables or .env, falling back to BGE defaults.
local mode: requires .venv already created (see step 1). The script activates venv, checks for uvicorn, installs dependencies if missing, then starts the server.
docker mode: requires Docker Desktop / Engine in PATH. The script builds the image and starts the container in background. For logs:
docker compose -f docker-compose.gpu.yml logs -f embedderIn Docker Desktop the project appears as smart-embedder (containers smart-embedder-gpu / smart-embedder-cpu).
Or directly without wrapper:
uvicorn bge-m3_server:app --host 0.0.0.0 --port 8000Wait for these log lines:
INFO - Reranker ready.
INFO - Server ready to accept requests
In a second terminal (with server running):
python test_server.pyExpected output: 17/17 tests passed. With --token and API_TOKEN
configured, the authentication check is included and the expected output is
18/18 tests passed.
test_server.py accepts --url to point to a different host and --token
when API_TOKEN is configured:
python test_server.py --url http://localhost:8000
python test_server.py --token <token>Measures latency (avg/p50/p95/p99) and throughput on embed_dense, embed_full, rerank scenarios:
python benchmark.py --concurrency 8 --requests 100 --batch-size 4| Flag | Default | Description |
|---|---|---|
--url |
http://localhost:8000 |
Server target |
--token |
API_TOKEN env or empty |
Bearer token if server requires auth |
--concurrency |
8 |
Concurrent requests in-flight |
--requests |
100 |
Requests per scenario |
--batch-size |
4 |
Sentences/passages per request |
--warmup |
5 |
Warmup requests (excluded from metrics) |
--timeout |
60 |
Timeout for single request |
--max-batch-size |
128 |
Local guardrail on payload limits; 0 disables |
--scenarios |
all | CSV: embed_dense,embed_full,rerank |
--sleep-between |
0 |
Pause between scenarios (use 65 if rate-limit active) |
Note: Default rate limits (3600 req/min, burst 120) are tuned for benchmarks on a single client at
conc<=16. For extreme stress testing:RATE_LIMIT_REQUESTS_PER_MINUTE=1000000 docker compose -f docker-compose.gpu.yml up -d.
Output: ASCII table with Reqs / OK / Fail / Conc / Wall / Req/s / Units/s / Avg / P50 / P95 / P99 / Min / Max.
Latest measured run (RTX 4060 Laptop 8GB, batch=4, conc=8, transformers==4.57.3):
| Scenario | Req/s | Units/s | P50 | P95 | P99 |
|---|---|---|---|---|---|
embed_dense |
44.5 | 178 | 176.9ms | 185.4ms | 187.3ms |
embed_full (dense+sparse+colbert) |
28.7 | 115 | 294.1ms | 350.6ms | 498.9ms |
rerank |
37.8 | 151 | 205.7ms | 250.3ms | 263.8ms |
- Docker Desktop / Docker Engine with Compose v2+
- NVIDIA Container Toolkit
nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker# Verify CUDA tag exists before building
docker pull nvidia/cuda:12.6.3-runtime-ubuntu22.04
# Build and GPU startup (first time: downloads selected embedding and reranker models)
docker compose -f docker-compose.gpu.yml up --build
# Or via bat wrapper (Windows)
start_server.bat docker gpu
# CPU execution (compose override)
docker compose -f docker-compose.gpu.yml -f docker-compose.cpu.yml up --build
# Equivalent:
start_server.bat docker cpuWait for these log lines:
INFO - Reranker ready.
INFO - Server ready to accept requests
Server available at http://localhost:8000.
Models are saved in the default named volume
smart-embedder-hf-cache and mounted at /app/model_cache;
subsequent restarts do not re-download them.
The first Docker startup with QWEN selected downloads the selected Qwen model into the Hugging Face cache volume. Startup can take longer than BGE on an empty cache; later runs reuse the cached model.
# Startup in background
docker compose -f docker-compose.gpu.yml up -d
# Real-time logs
docker compose -f docker-compose.gpu.yml logs -f embedder
# Stop
docker compose -f docker-compose.gpu.yml down
# Rebuild after code changes (deps cached if requirements-gpu.txt unchanged)
docker compose -f docker-compose.gpu.yml up --build
# Complete rebuild from scratch
docker compose -f docker-compose.gpu.yml build --no-cachedocker compose -f docker-compose.gpu.yml run --rm embedder python3 -c "
import torch
print('PyTorch:', torch.__version__)
print('CUDA available:', torch.cuda.is_available())
if torch.cuda.is_available():
print('GPU:', torch.cuda.get_device_name(0))
"By default the server is bound to 127.0.0.1:8000 (localhost only).
To change the exposed port (e.g. 8000 already in use), set PORT in .env or
shell before startup. Docker remaps the published host port; the container stays
on 8000 internally:
PORT=8001 docker compose -f docker-compose.gpu.yml up -d
# or persistent: add PORT=8001 to .envIn local mode the launcher passes PORT to uvicorn --port.
For LAN access modify docker-compose.gpu.yml (remove the 127.0.0.1: bind prefix):
ports:
- "${PORT:-8000}:8000"Warning: If exposed on network, add a reverse proxy with authentication (nginx, Traefik).
| File | Description |
|---|---|
bge-m3_server.py |
Main server |
requirements-gpu.txt |
Python dependencies (GPU / CUDA PyTorch wheel) |
requirements-cpu.txt |
Python dependencies (CPU-only PyTorch wheel) |
Dockerfile.gpu |
GPU image build (CUDA 12.6, non-root, hardened) |
Dockerfile.cpu |
CPU-only image build (slim Python base, no CUDA) |
docker-compose.gpu.yml |
Container orchestration with GPU and model volume |
docker-compose.cpu.yml |
Compose override: slim CPU image, removes GPU reservation |
.env.example |
Environment variables template (copy to .env for local override) |
.dockerignore |
Excludes .venv, cache, docs from build context |
start_server.bat |
Windows startup script parameterized (local|docker x cpu|gpu|auto) |
start_server.sh |
Unix shell startup script parameterized (local|docker x cpu|gpu|auto) |
test_server.py |
Runtime test suite (17 checks, 18 with --token) |
benchmark.py |
Benchmark latency/throughput with summary table |
Generates embeddings for a list of texts.
Request:
{
"sentences": ["Hello world!", "Ciao mondo!"],
"return_dense": true,
"return_sparse": true,
"return_colbert": true,
"normalize_dense": false,
"sparse_as_indices": false
}sparse_as_indices (default: false): When true, sparse vectors are returned
in QDRANT-compatible format instead of the default token-id dict:
"sparse": {"indices": [10, 1389, 2349], "values": [0.277, 0.292, 0.313]}Use with SparseVector(indices=..., values=...) when upserting to QDRANT.
The active embedding backends are selected at server startup. With the default
BGE dense backend, dense, sparse, and colbert all come from BAAI/bge-m3.
With Qwen dense selected, only dense changes to Qwen/Qwen3-Embedding-0.6B;
sparse and colbert still come from BAAI/bge-m3.
Response:
{
"data": [
{
"id": 0,
"text": "Hello world!",
"embeddings": {
"dense": [0.021, -0.013, ...],
"sparse": {"12": 0.08, "435": 0.12, ...},
"colbert": [[0.01, ...], ...]
}
}
],
"model_name": "Qwen/Qwen3-Embedding-0.6B",
"dense_model_name": "Qwen/Qwen3-Embedding-0.6B",
"sparse_model_name": "BAAI/bge-m3",
"colbert_model_name": "BAAI/bge-m3",
"processing_time_ms": 104.5,
"warnings": [
{
"code": "input_truncated",
"severity": "warning",
"message": "Input text was truncated to the model token limit.",
"target": {
"field": "sentences",
"index": 0,
"pointer": "/sentences/0"
},
"details": {
"model": "BAAI/bge-m3",
"max_tokens": 8192,
"original_tokens": 9000,
"truncated_tokens": 808,
"truncation_side": "end"
}
}
]
}cURL:
curl -X POST "http://localhost:8000/embeddings/" \
-H "Content-Type: application/json" \
-d '{"sentences": ["Hello world!"], "return_dense": true}'If API_TOKEN is set:
curl -X POST "http://localhost:8000/embeddings/" \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{"sentences": ["Hello world!"], "return_dense": true}'Ranks a list of passages by relevance to a query.
Request:
{
"query": "What is machine learning?",
"passages": [
"Machine learning is a subset of AI.",
"The weather is nice today.",
"Deep learning uses neural networks."
],
"normalize": true
}Response:
{
"results": [
{"index": 0, "passage": "Machine learning is a subset of AI.", "score": 0.987},
{"index": 2, "passage": "Deep learning uses neural networks.", "score": 0.821},
{"index": 1, "passage": "The weather is nice today.", "score": 0.003}
],
"model_name": "BAAI/bge-reranker-v2-m3",
"processing_time_ms": 52.2,
"warnings": []
}normalize: truereturns a score in[0, 1](sigmoid)normalize: falsereturns a raw score (negative values possible)- With QWEN selected, scores are yes-probabilities and
normalizeis kept as an API-compatible no-op - Do not compare BGE
normalize: falseraw logits directly with QWEN scores - Passages are returned sorted by descending score
- The
indexfield returns the original position in the input list model_namereports the reranker selected at startup
For over-token query-passage pairs, the server preserves the query where
possible, truncates passages from the end, returns 200 OK, and includes
query_truncated or passage_truncated entries in warnings.
Warning token counts are computed during server-side preparation. Rerank inputs
are then decoded back to text and tokenized again by the model backend, so
original_tokens, max_tokens, and truncated_tokens should be treated as
diagnostic metadata rather than exact proof of final backend tokenization.
cURL:
curl -X POST "http://localhost:8000/rerank" \
-H "Content-Type: application/json" \
-d '{"query": "machine learning", "passages": ["ML is AI", "Nice weather"], "normalize": true}'If API_TOKEN is set:
curl -X POST "http://localhost:8000/rerank" \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{"query": "machine learning", "passages": ["ML is AI", "Nice weather"], "normalize": true}'curl "http://localhost:8000/health"Returns server status, GPU info, active embedding/reranker models, batch size.
Relevant model fields:
{
"version": "1.2.0",
"model": "BAAI/bge-m3",
"dense_embedding_model": "Qwen/Qwen3-Embedding-0.6B",
"reranker_model": "BAAI/bge-reranker-v2-m3"
}curl "http://localhost:8000/stats"Returns uptime, total requests, total sentences, total batches, rejected requests, hardware.
curl "http://localhost:8000/metrics"Prometheus scraping endpoint in text/plain format.
Interactive Swagger documentation: http://localhost:8000/docs
If API_TOKEN is configured, Swagger shows the lock on POST endpoints.
Use the Authorize button and enter only the token, without Bearer prefix.
Limits are tunable via environment variable (override in docker-compose.gpu.yml or shell before startup):
| Env var | Default | Description |
|---|---|---|
PORT |
8000 |
Host port to expose. Docker: published host port (container stays on 8000). Local: uvicorn --port. Set in .env or shell if 8000 is taken |
BGE_EMBED_MAX_LENGTH |
MAX_INPUT_LENGTH fallback / 8192 |
Max tokens for BGE-M3 embedding input; applies to dense, sparse, and ColBERT outputs |
QWEN_EMBED_MAX_LENGTH |
32768 |
Max tokens for Qwen dense embedding input |
BGE_RERANK_MAX_LENGTH |
8192 |
Max query+passage tokens for BGE rerank; BAAI notes this reranker was fine-tuned at 1024 and recommends 1024 for practical use |
QWEN_RERANK_MAX_LENGTH |
32768 |
Max query+passage tokens for Qwen reranker when QWEN is selected |
MAX_INPUT_LENGTH |
8192 |
Legacy fallback for BGE_EMBED_MAX_LENGTH; prefer the backend-specific variables above |
REQUEST_TIMEOUT |
90 |
Global HTTP timeout (sec); keep above RERANK_GPU_TIMEOUT |
DENSE_EMBEDDING_MODEL |
BAAI/bge-m3 |
Dense embedding backend selected by launcher (BAAI/bge-m3 or Qwen/Qwen3-Embedding-0.6B) |
RERANKER_MODEL |
BAAI/bge-reranker-v2-m3 |
Reranker selected by launcher (BAAI/bge-reranker-v2-m3 or Qwen/Qwen3-Reranker-0.6B) |
QWEN_RERANK_BATCH_SIZE |
launcher-tuned / 16 fallback |
Max query-passage pairs per Qwen reranker micro-batch |
API_TOKEN |
empty | Optional bearer token for non-public endpoints; empty disables authentication |
MAX_QUEUE_SIZE |
200 |
Max requests in queue /embeddings/ (backpressure) |
RERANK_MAX_QUEUE |
32 |
Max concurrent slots for /rerank (backpressure) |
RERANK_GPU_TIMEOUT |
60 |
Hard timeout for a single rerank inference (sec); keep below REQUEST_TIMEOUT |
RATE_LIMIT_REQUESTS_PER_MINUTE |
3600 |
Rate limit per IP (60 req/s) |
RATE_LIMIT_BURST_SIZE |
120 |
Token bucket burst (~2s of traffic) |
PYTORCH_CUDA_ALLOC_CONF |
expandable_segments:True |
CUDA caching-allocator config; reduces fragmentation OOM on variable-length batches (single-GPU, no NCCL). Set in Dockerfile.gpu and docker-compose.gpu.yml |
Texts longer than the active backend-specific token limit are truncated and
reported in the response warnings array. The server no longer rejects requests
based on character-count payload limits.
With API_TOKEN set, all non-public endpoints require:
Authorization: Bearer <token>Service endpoints (/health, /stats, /metrics, /docs, /redoc, /openapi.json) remain accessible without token.
When DENSE_EMBEDDING_MODEL=Qwen/Qwen3-Embedding-0.6B, only dense vectors change. Sparse lexical weights and ColBERT vectors still come from BAAI/bge-m3, so mixed requests are supported through the same /embeddings/ endpoint.
The Qwen dense path intentionally does not add query/document instruction prefixes. This keeps the existing /embeddings/ API transparent, but deployments optimizing retrieval quality should benchmark task-specific Qwen formatting separately before changing request semantics.
When QWEN reranking is selected and GPU mode is used, the launch scripts can
auto-tune Qwen rerank limits from detected GPU VRAM if QWEN_RERANK_BATCH_SIZE
or QWEN_RERANK_MAX_LENGTH are not set in the environment or .env:
| GPU VRAM | QWEN_RERANK_BATCH_SIZE | QWEN_RERANK_MAX_LENGTH |
|---|---|---|
| <= 6 GB | 4 | 4096 |
| <= 8 GB | 8 | 8192 |
| > 8 GB | 16 | 8192 |
When QWEN reranking is selected in CPU mode, the launch scripts use conservative defaults unless overridden:
| Device | QWEN_RERANK_BATCH_SIZE | QWEN_RERANK_MAX_LENGTH |
|---|---|---|
| CPU | 1 | 2048 |
.env.example pins QWEN_RERANK_MAX_LENGTH=32768, the documented model
maximum. To use the launcher VRAM-tuned values above, leave
QWEN_RERANK_MAX_LENGTH unset in your shell and remove or comment it from
your local .env.
Benchmark defaults are tuned for NVIDIA RTX 4060 Laptop 8GB: observed
throughput ~29-45 req/s at conc=8 depending on scenario (see benchmark
table above).
Override:
# Ad-hoc (shell env)
RATE_LIMIT_REQUESTS_PER_MINUTE=10000 docker compose -f docker-compose.gpu.yml up -d
# Persistent - copy .env.example to .env and modify
cp .env.example .env
docker compose -f docker-compose.gpu.yml up -dCompose automatically loads .env in the same directory. .env is in .gitignore; .env.example is the versioned template.
MULTI_GPU_DEVICES = None (in bge-m3_server.py) can be changed to
['cuda:0', 'cuda:1'] for multi-GPU.
Embedding batch size is automatically calculated from available VRAM and the
active embedding max length. With default BGE embeddings, this is
BGE_EMBED_MAX_LENGTH=8192. If Qwen dense embeddings are selected, the tuning
uses the larger of BGE_EMBED_MAX_LENGTH and QWEN_EMBED_MAX_LENGTH, because
mixed dense+sparse/ColBERT requests can exercise both tokenizers:
| Condition | batch_size | MAX_REQUESTS_IN_BATCH |
|---|---|---|
| GPU > 8 GB | 12 | 16 |
| GPU <= 8 GB and > 4 GB | 6 | 16 |
| GPU <= 4 GB | 3 | 16 |
| CPU | 1 | 8 |
If the active embedding tuning length is <=512, the server switches to the
short-sequence profile:
| VRAM | batch_size | MAX_REQUESTS_IN_BATCH |
|---|---|---|
| > 8 GB | 128 | 64 |
| > 6 GB | 64 | 32 |
| > 4 GB | 32 | 16 |
| <= 4 GB | 16 | 16 |
| Metric | Type | Label |
|---|---|---|
embedding_requests_total |
Counter | status, endpoint |
embedding_requests_rejected_total |
Counter | reason |
embedding_sentences_processed_total |
Counter | - |
embedding_request_duration_seconds |
Histogram | endpoint |
embedding_batch_size |
Histogram | - |
embedding_gpu_inference_duration_seconds |
Histogram | - |
embedding_queue_size |
Gauge | - |
embedding_active_requests |
Gauge | - |
embedding_gpu_memory_allocated_bytes |
Gauge | Legacy process GPU allocated memory, kept for existing dashboards |
embedding_gpu_memory_reserved_bytes |
Gauge | Legacy process GPU reserved memory, kept for existing dashboards |
embedding_server_info |
Info | model, dense_embedding_model, bge_embed_max_length, qwen_embed_max_length, bge_rerank_max_length, qwen_rerank_max_length, version, gpu_available, device |
These gauges are process-level CUDA readings updated after embedding and rerank inference. They include all loaded models and both endpoint paths.
| Metric | Type | Label |
|---|---|---|
gpu_memory_allocated_bytes |
Gauge | Process GPU tensor memory allocated by PyTorch |
gpu_memory_reserved_bytes |
Gauge | Process GPU memory reserved by the PyTorch caching allocator |
gpu_memory_free_bytes |
Gauge | CUDA device free memory from torch.cuda.mem_get_info() |
gpu_memory_total_bytes |
Gauge | CUDA device total memory from torch.cuda.mem_get_info() |
| Metric | Type | Label |
|---|---|---|
rerank_requests_total |
Counter | status |
rerank_requests_rejected_total |
Counter | reason |
rerank_pairs_processed_total |
Counter | - |
rerank_request_duration_seconds |
Histogram | - |
rerank_inference_duration_seconds |
Histogram | - |
rerank_active_requests |
Gauge | - |
# Throughput embedding (req/sec)
rate(embedding_requests_total[1m])
# Latency P95
histogram_quantile(0.95, rate(embedding_request_duration_seconds_bucket[5m]))
# Error rate (%)
rate(embedding_requests_total{status="error"}[5m]) / rate(embedding_requests_total[5m]) * 100
# GPU tensor memory allocated by PyTorch (GB)
gpu_memory_allocated_bytes / 1024 / 1024 / 1024
# GPU memory reserved by PyTorch caching allocator (GB)
gpu_memory_reserved_bytes / 1024 / 1024 / 1024
# CUDA device memory visible to the process (GB)
gpu_memory_free_bytes / 1024 / 1024 / 1024
# Reranker throughput (pairs/sec)
rate(rerank_pairs_processed_total[1m])
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'smart-embedder'
static_configs:
- targets: ['localhost:8000']
metrics_path: '/metrics'- Embedding Request Rate -
rate(embedding_requests_total[1m]) - Latency P50/P95/P99 -
histogram_quantile(0.X, ...) - Queue Size -
embedding_queue_size - GPU Memory -
gpu_memory_allocated_bytes,gpu_memory_reserved_bytes,gpu_memory_free_bytes - Rerank Request Rate -
rate(rerank_requests_total[1m]) - Batch Size Distribution -
embedding_batch_size
- Algorithm: Token Bucket per IP
- Limit:
RATE_LIMIT_REQUESTS_PER_MINUTE=3600req/min,RATE_LIMIT_BURST_SIZE=120 - Response: HTTP
429with headerRetry-After: 60
- Embedding and rerank inference share a single-worker GPU executor, so the two paths never run forward passes concurrently. This bounds peak VRAM to the larger resident model instead of the sum, preventing concurrency-driven CUDA OOM on small GPUs. The CUDA default stream already serializes kernels, so this costs effectively no throughput.
- /embeddings/ queue max:
MAX_QUEUE_SIZE=200 - /rerank slots max:
RERANK_MAX_QUEUE=32(admission bound on the shared single-worker GPU executor) - Acquire timeout: 0.5s
- Rejections are reflected in both
/stats(rejected_requests) and Prometheus (embedding_requests_rejected_totalorrerank_requests_rejected_total, depending on endpoint). - Rate limit uses direct connection IP (
request.client.host). If the server is behind a trusted reverse proxy, update the middleware to extract IP fromX-Forwarded-For.
REQUEST_TIMEOUT=90sis the global HTTP timeout (504 to the caller).GPU_PROCESS_TIMEOUT=15s(CUDA) /30s(CPU) limits embedding batch inference on the thread pool.RERANK_GPU_TIMEOUT=60slimits rerank inference and should stay belowREQUEST_TIMEOUT.- Timeouts are tracked in Prometheus as
embedding_requests_total{status="timeout"}orrerank_requests_total{status="timeout"}.
- Blocks new requests (middleware)
- Waits for queue drain
- Completes in-flight requests (max 30s)
- Cancels processing loop and closes the shared GPU executor
python -c "import torch; print(torch.cuda.is_available())"
pip install -r requirements-gpu.txt --upgradeClient exceeds rate limit. Increase RATE_LIMIT_REQUESTS_PER_MINUTE or reduce call frequency.
Queue is full. Increase MAX_QUEUE_SIZE or scale horizontally with a load balancer.
Embedding inference exceeded GPU_PROCESS_TIMEOUT (15s on CUDA, 30s on CPU)
or rerank inference exceeded RERANK_GPU_TIMEOUT. Reduce batch size or check
GPU availability.
curl http://localhost:8000/metricsVerify that target in prometheus.yml is reachable and that port 8000 is not blocked by firewall.
# Verify NVIDIA Container Toolkit
docker run --rm --gpus all nvidia/cuda:12.6.3-runtime-ubuntu22.04 nvidia-smiIf it fails: reinstall NVIDIA Container Toolkit and restart Docker.
Error: manifest for nvidia/cuda:12.6.3-runtime-ubuntu22.04 not found
Search correct tag on hub.docker.com/r/nvidia/cuda/tags and update first line of Dockerfile.gpu.
Default Compose and Dockerfile healthchecks allow a 300s startup period for first-run model downloads. On slow networks or empty caches, increase the healthcheck start period above 300s in your custom Compose override:
start_period: 300s- BAAI/bge-m3 - Hugging Face
- Qwen/Qwen3-Embedding-0.6B - Hugging Face
- BAAI/bge-reranker-v2-m3 - Hugging Face
- Qwen/Qwen3-Reranker-0.6B - Hugging Face
- FlagEmbedding - GitHub
- FastAPI Documentation
- Prometheus Python Client
Follows the selected model licenses (BAAI/bge-m3, BAAI/bge-reranker-v2-m3,
and optionally Qwen/Qwen3-Embedding-0.6B and Qwen/Qwen3-Reranker-0.6B).