Skip to content

smart-models/Smart-Embedder

Repository files navigation

SMART EMBEDDER

Version 1.2.0 GPU Accelerated CPU Support CUDA 12.6 Python 3.10+ FastAPI Docker

Smart Embedder is a lightweight, self-hosted embedding server built for hybrid search pipelines. It runs entirely on your own hardware — on an NVIDIA GPU for high throughput, or on CPU when no GPU is available — with no cloud dependency and no data leaving your machine.

Hybrid search combines dense vector similarity, sparse lexical matching (BM25-style), and optional ColBERT late-interaction scoring into a single retrieval pipeline. Smart Embedder exposes all three vector types from a single endpoint, plus a reranking endpoint to re-score candidate passages after retrieval — everything a hybrid search stack needs in one lightweight service.

The server is built on FastAPI and wraps BAAI/bge-m3, a single model that produces dense, sparse, and ColBERT vectors simultaneously. The default stack uses BGE-M3 for all vector types and bge-reranker-v2-m3 for reranking — a solid baseline that runs on 8 GB VRAM or CPU with no extra configuration. For higher retrieval quality, Smart Embedder offers two optional upgrades selectable independently at startup:

  • Dense embedding → Qwen3-Embedding-0.6B: replaces only the dense vector path; sparse and ColBERT vectors still come from BGE-M3, keeping the full hybrid signal intact.
  • Reranking → Qwen3-Reranker-0.6B: replaces the cross-encoder reranker with a stronger model at the cost of higher inference time.

Both Qwen models are still compact (0.6B parameters) but benefit from a dedicated GPU — a machine with 8 GB+ VRAM will see the best results. CPU execution remains supported for both, with conservative batch sizes applied automatically.

Key properties at a glance:

Property Detail
Deployment Local — GPU (NVIDIA CUDA) or CPU, Docker or Python venv
Hybrid search vectors Dense + sparse lexical + ColBERT from one model, one endpoint
Reranking Cross-encoder passage reranking, same service
Footprint BGE-M3 + reranker fit in 8 GB VRAM; CPU mode needs no GPU
QDRANT-ready Sparse vectors in native {indices, values} format via sparse_as_indices

High-performance FastAPI server for BGE-M3 embeddings and selectable reranking:

Feature Detail
Embeddings BGE-M3 dense/sparse/ColBERT by default; optional Qwen3 dense with BGE-M3 sparse and ColBERT
Reranking Interactive startup choice: BAAI/bge-reranker-v2-m3 or Qwen/Qwen3-Reranker-0.6B
Authentication Optional Bearer token on non-public endpoints
Rate Limiting Token bucket, 3600 req/min per IP, burst 120
Backpressure Embedding queue max 200, rerank slots max 32, HTTP 503 on overflow
Graceful Shutdown 30s drain for in-flight requests
Prometheus Metrics Counters, histograms, gauges for both models
Dynamic Batching Embedding batch size adapts to GPU VRAM and max input length at startup

Quick Start

1. Setup

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements-gpu.txt
REM CPU-only machine instead: pip install -r requirements-cpu.txt

2. Start the Server

start_server.bat and start_server.sh are parameterized and choose execution target and device:

start_server.bat [local|docker] [cpu|gpu|auto]
./start_server.sh [local|docker] [cpu|gpu|auto]
Command What it does
start_server.bat / ./start_server.sh Default: docker auto
start_server.bat docker auto Docker startup with device auto-detection
start_server.bat local gpu Local venv, CUDA auto-detect
start_server.bat local cpu Local venv, forces CPU (CUDA_VISIBLE_DEVICES=-1)
start_server.bat docker gpu docker compose -f docker-compose.gpu.yml build && up -d with NVIDIA runtime
start_server.bat docker cpu Compose with override docker-compose.cpu.yml (no GPU)

Arguments are case-insensitive. Built-in validation: unrecognized parameters print usage and exit with code 1.

Startup asks two independent model questions:

  1. Dense embedding backend: choose BGE for current all-BGE embeddings, or QWEN to return Qwen dense vectors while keeping BGE sparse and ColBERT vectors.
  2. Reranker backend: choose BGE or QWEN for /rerank.

The interactive model selection is provided by the launcher scripts. Direct uvicorn or docker compose startup does not prompt; it uses environment variables or .env, falling back to BGE defaults.

local mode: requires .venv already created (see step 1). The script activates venv, checks for uvicorn, installs dependencies if missing, then starts the server.

docker mode: requires Docker Desktop / Engine in PATH. The script builds the image and starts the container in background. For logs:

docker compose -f docker-compose.gpu.yml logs -f embedder

In Docker Desktop the project appears as smart-embedder (containers smart-embedder-gpu / smart-embedder-cpu).

Or directly without wrapper:

uvicorn bge-m3_server:app --host 0.0.0.0 --port 8000

Wait for these log lines:

INFO - Reranker ready.
INFO - Server ready to accept requests

3. Automatic Test

In a second terminal (with server running):

python test_server.py

Expected output: 17/17 tests passed. With --token and API_TOKEN configured, the authentication check is included and the expected output is 18/18 tests passed.

test_server.py accepts --url to point to a different host and --token when API_TOKEN is configured:

python test_server.py --url http://localhost:8000
python test_server.py --token <token>

4. Benchmark

Measures latency (avg/p50/p95/p99) and throughput on embed_dense, embed_full, rerank scenarios:

python benchmark.py --concurrency 8 --requests 100 --batch-size 4
Flag Default Description
--url http://localhost:8000 Server target
--token API_TOKEN env or empty Bearer token if server requires auth
--concurrency 8 Concurrent requests in-flight
--requests 100 Requests per scenario
--batch-size 4 Sentences/passages per request
--warmup 5 Warmup requests (excluded from metrics)
--timeout 60 Timeout for single request
--max-batch-size 128 Local guardrail on payload limits; 0 disables
--scenarios all CSV: embed_dense,embed_full,rerank
--sleep-between 0 Pause between scenarios (use 65 if rate-limit active)

Note: Default rate limits (3600 req/min, burst 120) are tuned for benchmarks on a single client at conc<=16. For extreme stress testing: RATE_LIMIT_REQUESTS_PER_MINUTE=1000000 docker compose -f docker-compose.gpu.yml up -d.

Output: ASCII table with Reqs / OK / Fail / Conc / Wall / Req/s / Units/s / Avg / P50 / P95 / P99 / Min / Max.

Latest measured run (RTX 4060 Laptop 8GB, batch=4, conc=8, transformers==4.57.3):

Scenario Req/s Units/s P50 P95 P99
embed_dense 44.5 178 176.9ms 185.4ms 187.3ms
embed_full (dense+sparse+colbert) 28.7 115 294.1ms 350.6ms 498.9ms
rerank 37.8 151 205.7ms 250.3ms 263.8ms

Docker

Prerequisites

nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

First Startup

# Verify CUDA tag exists before building
docker pull nvidia/cuda:12.6.3-runtime-ubuntu22.04

# Build and GPU startup (first time: downloads selected embedding and reranker models)
docker compose -f docker-compose.gpu.yml up --build

# Or via bat wrapper (Windows)
start_server.bat docker gpu

# CPU execution (compose override)
docker compose -f docker-compose.gpu.yml -f docker-compose.cpu.yml up --build
# Equivalent:
start_server.bat docker cpu

Wait for these log lines:

INFO - Reranker ready.
INFO - Server ready to accept requests

Server available at http://localhost:8000.
Models are saved in the default named volume smart-embedder-hf-cache and mounted at /app/model_cache; subsequent restarts do not re-download them.

The first Docker startup with QWEN selected downloads the selected Qwen model into the Hugging Face cache volume. Startup can take longer than BGE on an empty cache; later runs reuse the cached model.

Useful Commands

# Startup in background
docker compose -f docker-compose.gpu.yml up -d

# Real-time logs
docker compose -f docker-compose.gpu.yml logs -f embedder

# Stop
docker compose -f docker-compose.gpu.yml down

# Rebuild after code changes (deps cached if requirements-gpu.txt unchanged)
docker compose -f docker-compose.gpu.yml up --build

# Complete rebuild from scratch
docker compose -f docker-compose.gpu.yml build --no-cache

Verify GPU in Container

docker compose -f docker-compose.gpu.yml run --rm embedder python3 -c "
import torch
print('PyTorch:', torch.__version__)
print('CUDA available:', torch.cuda.is_available())
if torch.cuda.is_available():
    print('GPU:', torch.cuda.get_device_name(0))
"

Exposure on Local Network

By default the server is bound to 127.0.0.1:8000 (localhost only).

To change the exposed port (e.g. 8000 already in use), set PORT in .env or shell before startup. Docker remaps the published host port; the container stays on 8000 internally:

PORT=8001 docker compose -f docker-compose.gpu.yml up -d
# or persistent: add PORT=8001 to .env

In local mode the launcher passes PORT to uvicorn --port.

For LAN access modify docker-compose.gpu.yml (remove the 127.0.0.1: bind prefix):

ports:
  - "${PORT:-8000}:8000"

Warning: If exposed on network, add a reverse proxy with authentication (nginx, Traefik).


Project Files

File Description
bge-m3_server.py Main server
requirements-gpu.txt Python dependencies (GPU / CUDA PyTorch wheel)
requirements-cpu.txt Python dependencies (CPU-only PyTorch wheel)
Dockerfile.gpu GPU image build (CUDA 12.6, non-root, hardened)
Dockerfile.cpu CPU-only image build (slim Python base, no CUDA)
docker-compose.gpu.yml Container orchestration with GPU and model volume
docker-compose.cpu.yml Compose override: slim CPU image, removes GPU reservation
.env.example Environment variables template (copy to .env for local override)
.dockerignore Excludes .venv, cache, docs from build context
start_server.bat Windows startup script parameterized (local|docker x cpu|gpu|auto)
start_server.sh Unix shell startup script parameterized (local|docker x cpu|gpu|auto)
test_server.py Runtime test suite (17 checks, 18 with --token)
benchmark.py Benchmark latency/throughput with summary table

API Endpoints

POST /embeddings/

Generates embeddings for a list of texts.

Request:

{
  "sentences": ["Hello world!", "Ciao mondo!"],
  "return_dense": true,
  "return_sparse": true,
  "return_colbert": true,
  "normalize_dense": false,
  "sparse_as_indices": false
}

sparse_as_indices (default: false): When true, sparse vectors are returned in QDRANT-compatible format instead of the default token-id dict:

"sparse": {"indices": [10, 1389, 2349], "values": [0.277, 0.292, 0.313]}

Use with SparseVector(indices=..., values=...) when upserting to QDRANT.

The active embedding backends are selected at server startup. With the default BGE dense backend, dense, sparse, and colbert all come from BAAI/bge-m3. With Qwen dense selected, only dense changes to Qwen/Qwen3-Embedding-0.6B; sparse and colbert still come from BAAI/bge-m3.

Response:

{
  "data": [
    {
      "id": 0,
      "text": "Hello world!",
      "embeddings": {
        "dense": [0.021, -0.013, ...],
        "sparse": {"12": 0.08, "435": 0.12, ...},
        "colbert": [[0.01, ...], ...]
      }
    }
  ],
  "model_name": "Qwen/Qwen3-Embedding-0.6B",
  "dense_model_name": "Qwen/Qwen3-Embedding-0.6B",
  "sparse_model_name": "BAAI/bge-m3",
  "colbert_model_name": "BAAI/bge-m3",
  "processing_time_ms": 104.5,
  "warnings": [
    {
      "code": "input_truncated",
      "severity": "warning",
      "message": "Input text was truncated to the model token limit.",
      "target": {
        "field": "sentences",
        "index": 0,
        "pointer": "/sentences/0"
      },
      "details": {
        "model": "BAAI/bge-m3",
        "max_tokens": 8192,
        "original_tokens": 9000,
        "truncated_tokens": 808,
        "truncation_side": "end"
      }
    }
  ]
}

cURL:

curl -X POST "http://localhost:8000/embeddings/" \
  -H "Content-Type: application/json" \
  -d '{"sentences": ["Hello world!"], "return_dense": true}'

If API_TOKEN is set:

curl -X POST "http://localhost:8000/embeddings/" \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{"sentences": ["Hello world!"], "return_dense": true}'

POST /rerank

Ranks a list of passages by relevance to a query.

Request:

{
  "query": "What is machine learning?",
  "passages": [
    "Machine learning is a subset of AI.",
    "The weather is nice today.",
    "Deep learning uses neural networks."
  ],
  "normalize": true
}

Response:

{
  "results": [
    {"index": 0, "passage": "Machine learning is a subset of AI.", "score": 0.987},
    {"index": 2, "passage": "Deep learning uses neural networks.", "score": 0.821},
    {"index": 1, "passage": "The weather is nice today.", "score": 0.003}
  ],
  "model_name": "BAAI/bge-reranker-v2-m3",
  "processing_time_ms": 52.2,
  "warnings": []
}
  • normalize: true returns a score in [0, 1] (sigmoid)
  • normalize: false returns a raw score (negative values possible)
  • With QWEN selected, scores are yes-probabilities and normalize is kept as an API-compatible no-op
  • Do not compare BGE normalize: false raw logits directly with QWEN scores
  • Passages are returned sorted by descending score
  • The index field returns the original position in the input list
  • model_name reports the reranker selected at startup

For over-token query-passage pairs, the server preserves the query where possible, truncates passages from the end, returns 200 OK, and includes query_truncated or passage_truncated entries in warnings.

Warning token counts are computed during server-side preparation. Rerank inputs are then decoded back to text and tokenized again by the model backend, so original_tokens, max_tokens, and truncated_tokens should be treated as diagnostic metadata rather than exact proof of final backend tokenization.

cURL:

curl -X POST "http://localhost:8000/rerank" \
  -H "Content-Type: application/json" \
  -d '{"query": "machine learning", "passages": ["ML is AI", "Nice weather"], "normalize": true}'

If API_TOKEN is set:

curl -X POST "http://localhost:8000/rerank" \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{"query": "machine learning", "passages": ["ML is AI", "Nice weather"], "normalize": true}'

GET /health

curl "http://localhost:8000/health"

Returns server status, GPU info, active embedding/reranker models, batch size.

Relevant model fields:

{
  "version": "1.2.0",
  "model": "BAAI/bge-m3",
  "dense_embedding_model": "Qwen/Qwen3-Embedding-0.6B",
  "reranker_model": "BAAI/bge-reranker-v2-m3"
}

GET /stats

curl "http://localhost:8000/stats"

Returns uptime, total requests, total sentences, total batches, rejected requests, hardware.


GET /metrics

curl "http://localhost:8000/metrics"

Prometheus scraping endpoint in text/plain format.


GET /docs

Interactive Swagger documentation: http://localhost:8000/docs

If API_TOKEN is configured, Swagger shows the lock on POST endpoints. Use the Authorize button and enter only the token, without Bearer prefix.


Configuration

Limits are tunable via environment variable (override in docker-compose.gpu.yml or shell before startup):

Env var Default Description
PORT 8000 Host port to expose. Docker: published host port (container stays on 8000). Local: uvicorn --port. Set in .env or shell if 8000 is taken
BGE_EMBED_MAX_LENGTH MAX_INPUT_LENGTH fallback / 8192 Max tokens for BGE-M3 embedding input; applies to dense, sparse, and ColBERT outputs
QWEN_EMBED_MAX_LENGTH 32768 Max tokens for Qwen dense embedding input
BGE_RERANK_MAX_LENGTH 8192 Max query+passage tokens for BGE rerank; BAAI notes this reranker was fine-tuned at 1024 and recommends 1024 for practical use
QWEN_RERANK_MAX_LENGTH 32768 Max query+passage tokens for Qwen reranker when QWEN is selected
MAX_INPUT_LENGTH 8192 Legacy fallback for BGE_EMBED_MAX_LENGTH; prefer the backend-specific variables above
REQUEST_TIMEOUT 90 Global HTTP timeout (sec); keep above RERANK_GPU_TIMEOUT
DENSE_EMBEDDING_MODEL BAAI/bge-m3 Dense embedding backend selected by launcher (BAAI/bge-m3 or Qwen/Qwen3-Embedding-0.6B)
RERANKER_MODEL BAAI/bge-reranker-v2-m3 Reranker selected by launcher (BAAI/bge-reranker-v2-m3 or Qwen/Qwen3-Reranker-0.6B)
QWEN_RERANK_BATCH_SIZE launcher-tuned / 16 fallback Max query-passage pairs per Qwen reranker micro-batch
API_TOKEN empty Optional bearer token for non-public endpoints; empty disables authentication
MAX_QUEUE_SIZE 200 Max requests in queue /embeddings/ (backpressure)
RERANK_MAX_QUEUE 32 Max concurrent slots for /rerank (backpressure)
RERANK_GPU_TIMEOUT 60 Hard timeout for a single rerank inference (sec); keep below REQUEST_TIMEOUT
RATE_LIMIT_REQUESTS_PER_MINUTE 3600 Rate limit per IP (60 req/s)
RATE_LIMIT_BURST_SIZE 120 Token bucket burst (~2s of traffic)
PYTORCH_CUDA_ALLOC_CONF expandable_segments:True CUDA caching-allocator config; reduces fragmentation OOM on variable-length batches (single-GPU, no NCCL). Set in Dockerfile.gpu and docker-compose.gpu.yml

Texts longer than the active backend-specific token limit are truncated and reported in the response warnings array. The server no longer rejects requests based on character-count payload limits.

With API_TOKEN set, all non-public endpoints require:

Authorization: Bearer <token>

Service endpoints (/health, /stats, /metrics, /docs, /redoc, /openapi.json) remain accessible without token.

When DENSE_EMBEDDING_MODEL=Qwen/Qwen3-Embedding-0.6B, only dense vectors change. Sparse lexical weights and ColBERT vectors still come from BAAI/bge-m3, so mixed requests are supported through the same /embeddings/ endpoint.

The Qwen dense path intentionally does not add query/document instruction prefixes. This keeps the existing /embeddings/ API transparent, but deployments optimizing retrieval quality should benchmark task-specific Qwen formatting separately before changing request semantics.

When QWEN reranking is selected and GPU mode is used, the launch scripts can auto-tune Qwen rerank limits from detected GPU VRAM if QWEN_RERANK_BATCH_SIZE or QWEN_RERANK_MAX_LENGTH are not set in the environment or .env:

GPU VRAM QWEN_RERANK_BATCH_SIZE QWEN_RERANK_MAX_LENGTH
<= 6 GB 4 4096
<= 8 GB 8 8192
> 8 GB 16 8192

When QWEN reranking is selected in CPU mode, the launch scripts use conservative defaults unless overridden:

Device QWEN_RERANK_BATCH_SIZE QWEN_RERANK_MAX_LENGTH
CPU 1 2048

.env.example pins QWEN_RERANK_MAX_LENGTH=32768, the documented model maximum. To use the launcher VRAM-tuned values above, leave QWEN_RERANK_MAX_LENGTH unset in your shell and remove or comment it from your local .env.

Benchmark defaults are tuned for NVIDIA RTX 4060 Laptop 8GB: observed throughput ~29-45 req/s at conc=8 depending on scenario (see benchmark table above). Override:

# Ad-hoc (shell env)
RATE_LIMIT_REQUESTS_PER_MINUTE=10000 docker compose -f docker-compose.gpu.yml up -d

# Persistent - copy .env.example to .env and modify
cp .env.example .env
docker compose -f docker-compose.gpu.yml up -d

Compose automatically loads .env in the same directory. .env is in .gitignore; .env.example is the versioned template.

MULTI_GPU_DEVICES = None (in bge-m3_server.py) can be changed to ['cuda:0', 'cuda:1'] for multi-GPU.

Embedding batch size is automatically calculated from available VRAM and the active embedding max length. With default BGE embeddings, this is BGE_EMBED_MAX_LENGTH=8192. If Qwen dense embeddings are selected, the tuning uses the larger of BGE_EMBED_MAX_LENGTH and QWEN_EMBED_MAX_LENGTH, because mixed dense+sparse/ColBERT requests can exercise both tokenizers:

Condition batch_size MAX_REQUESTS_IN_BATCH
GPU > 8 GB 12 16
GPU <= 8 GB and > 4 GB 6 16
GPU <= 4 GB 3 16
CPU 1 8

If the active embedding tuning length is <=512, the server switches to the short-sequence profile:

VRAM batch_size MAX_REQUESTS_IN_BATCH
> 8 GB 128 64
> 6 GB 64 32
> 4 GB 32 16
<= 4 GB 16 16

Prometheus Metrics

Embedding

Metric Type Label
embedding_requests_total Counter status, endpoint
embedding_requests_rejected_total Counter reason
embedding_sentences_processed_total Counter -
embedding_request_duration_seconds Histogram endpoint
embedding_batch_size Histogram -
embedding_gpu_inference_duration_seconds Histogram -
embedding_queue_size Gauge -
embedding_active_requests Gauge -
embedding_gpu_memory_allocated_bytes Gauge Legacy process GPU allocated memory, kept for existing dashboards
embedding_gpu_memory_reserved_bytes Gauge Legacy process GPU reserved memory, kept for existing dashboards
embedding_server_info Info model, dense_embedding_model, bge_embed_max_length, qwen_embed_max_length, bge_rerank_max_length, qwen_rerank_max_length, version, gpu_available, device

GPU Process

These gauges are process-level CUDA readings updated after embedding and rerank inference. They include all loaded models and both endpoint paths.

Metric Type Label
gpu_memory_allocated_bytes Gauge Process GPU tensor memory allocated by PyTorch
gpu_memory_reserved_bytes Gauge Process GPU memory reserved by the PyTorch caching allocator
gpu_memory_free_bytes Gauge CUDA device free memory from torch.cuda.mem_get_info()
gpu_memory_total_bytes Gauge CUDA device total memory from torch.cuda.mem_get_info()

Reranker

Metric Type Label
rerank_requests_total Counter status
rerank_requests_rejected_total Counter reason
rerank_pairs_processed_total Counter -
rerank_request_duration_seconds Histogram -
rerank_inference_duration_seconds Histogram -
rerank_active_requests Gauge -

Useful PromQL Queries

# Throughput embedding (req/sec)
rate(embedding_requests_total[1m])

# Latency P95
histogram_quantile(0.95, rate(embedding_request_duration_seconds_bucket[5m]))

# Error rate (%)
rate(embedding_requests_total{status="error"}[5m]) / rate(embedding_requests_total[5m]) * 100

# GPU tensor memory allocated by PyTorch (GB)
gpu_memory_allocated_bytes / 1024 / 1024 / 1024

# GPU memory reserved by PyTorch caching allocator (GB)
gpu_memory_reserved_bytes / 1024 / 1024 / 1024

# CUDA device memory visible to the process (GB)
gpu_memory_free_bytes / 1024 / 1024 / 1024

# Reranker throughput (pairs/sec)
rate(rerank_pairs_processed_total[1m])

Setup Prometheus

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'smart-embedder'
    static_configs:
      - targets: ['localhost:8000']
    metrics_path: '/metrics'

Grafana Dashboard - Recommended Panels

  1. Embedding Request Rate - rate(embedding_requests_total[1m])
  2. Latency P50/P95/P99 - histogram_quantile(0.X, ...)
  3. Queue Size - embedding_queue_size
  4. GPU Memory - gpu_memory_allocated_bytes, gpu_memory_reserved_bytes, gpu_memory_free_bytes
  5. Rerank Request Rate - rate(rerank_requests_total[1m])
  6. Batch Size Distribution - embedding_batch_size

Security and Limits

Rate Limiting

  • Algorithm: Token Bucket per IP
  • Limit: RATE_LIMIT_REQUESTS_PER_MINUTE=3600 req/min, RATE_LIMIT_BURST_SIZE=120
  • Response: HTTP 429 with header Retry-After: 60

GPU Execution

  • Embedding and rerank inference share a single-worker GPU executor, so the two paths never run forward passes concurrently. This bounds peak VRAM to the larger resident model instead of the sum, preventing concurrency-driven CUDA OOM on small GPUs. The CUDA default stream already serializes kernels, so this costs effectively no throughput.

Backpressure

  • /embeddings/ queue max: MAX_QUEUE_SIZE=200
  • /rerank slots max: RERANK_MAX_QUEUE=32 (admission bound on the shared single-worker GPU executor)
  • Acquire timeout: 0.5s
  • Rejections are reflected in both /stats (rejected_requests) and Prometheus (embedding_requests_rejected_total or rerank_requests_rejected_total, depending on endpoint).
  • Rate limit uses direct connection IP (request.client.host). If the server is behind a trusted reverse proxy, update the middleware to extract IP from X-Forwarded-For.

Timeout

  • REQUEST_TIMEOUT=90s is the global HTTP timeout (504 to the caller).
  • GPU_PROCESS_TIMEOUT=15s (CUDA) / 30s (CPU) limits embedding batch inference on the thread pool.
  • RERANK_GPU_TIMEOUT=60s limits rerank inference and should stay below REQUEST_TIMEOUT.
  • Timeouts are tracked in Prometheus as embedding_requests_total{status="timeout"} or rerank_requests_total{status="timeout"}.

Graceful Shutdown

  • Blocks new requests (middleware)
  • Waits for queue drain
  • Completes in-flight requests (max 30s)
  • Cancels processing loop and closes the shared GPU executor

Troubleshooting

Server Won't Start

python -c "import torch; print(torch.cuda.is_available())"
pip install -r requirements-gpu.txt --upgrade

429 Too Many Requests Errors

Client exceeds rate limit. Increase RATE_LIMIT_REQUESTS_PER_MINUTE or reduce call frequency.

503 Service Unavailable Errors

Queue is full. Increase MAX_QUEUE_SIZE or scale horizontally with a load balancer.

504 Gateway Timeout Errors

Embedding inference exceeded GPU_PROCESS_TIMEOUT (15s on CUDA, 30s on CPU) or rerank inference exceeded RERANK_GPU_TIMEOUT. Reduce batch size or check GPU availability.

Prometheus Metrics Not Visible

curl http://localhost:8000/metrics

Verify that target in prometheus.yml is reachable and that port 8000 is not blocked by firewall.

Docker: GPU Not Detected in Container

# Verify NVIDIA Container Toolkit
docker run --rm --gpus all nvidia/cuda:12.6.3-runtime-ubuntu22.04 nvidia-smi

If it fails: reinstall NVIDIA Container Toolkit and restart Docker.

Docker: CUDA Tag Not Found

Error: manifest for nvidia/cuda:12.6.3-runtime-ubuntu22.04 not found

Search correct tag on hub.docker.com/r/nvidia/cuda/tags and update first line of Dockerfile.gpu.

Docker: Container Unhealthy on First Startup

Default Compose and Dockerfile healthchecks allow a 300s startup period for first-run model downloads. On slow networks or empty caches, increase the healthcheck start period above 300s in your custom Compose override:

start_period: 300s

References


License

Follows the selected model licenses (BAAI/bge-m3, BAAI/bge-reranker-v2-m3, and optionally Qwen/Qwen3-Embedding-0.6B and Qwen/Qwen3-Reranker-0.6B).

About

A lightweight, self-hosted embedding server built for hybrid search pipelines. It runs entirely on your own hardware on an NVIDIA GPU for high throughput, or on CPU when no GPU is available with no cloud dependency and no data leaving your machine.

Topics

Resources

License

Stars

Watchers

Forks

Contributors