SMART EMBEDDER

Smart Embedder is a lightweight, self-hosted embedding server built for hybrid search pipelines. It runs entirely on your own hardware — on an NVIDIA GPU for high throughput, or on CPU when no GPU is available — with no cloud dependency and no data leaving your machine.

Hybrid search combines dense vector similarity, sparse lexical matching (BM25-style), and optional ColBERT late-interaction scoring into a single retrieval pipeline. Smart Embedder exposes all three vector types from a single endpoint, plus a reranking endpoint to re-score candidate passages after retrieval — everything a hybrid search stack needs in one lightweight service.

The server is built on FastAPI and wraps BAAI/bge-m3, a single model that produces dense, sparse, and ColBERT vectors simultaneously. The default stack uses BGE-M3 for all vector types and bge-reranker-v2-m3 for reranking — a solid baseline that runs on 8 GB VRAM or CPU with no extra configuration. For higher retrieval quality, Smart Embedder offers two optional upgrades selectable independently at startup:

Dense embedding → Qwen3-Embedding-0.6B: replaces only the dense vector path; sparse and ColBERT vectors still come from BGE-M3, keeping the full hybrid signal intact.
Reranking → Qwen3-Reranker-0.6B: replaces the cross-encoder reranker with a stronger model at the cost of higher inference time.

Both Qwen models are still compact (0.6B parameters) but benefit from a dedicated GPU — a machine with 8 GB+ VRAM will see the best results. CPU execution remains supported for both, with conservative batch sizes applied automatically.

Key properties at a glance:

Property	Detail
Deployment	Local — GPU (NVIDIA CUDA) or CPU, Docker or Python venv
Hybrid search vectors	Dense + sparse lexical + ColBERT from one model, one endpoint
Reranking	Cross-encoder passage reranking, same service
Footprint	BGE-M3 + reranker fit in 8 GB VRAM; CPU mode needs no GPU
QDRANT-ready	Sparse vectors in native `{indices, values}` format via `sparse_as_indices`

High-performance FastAPI server for BGE-M3 embeddings and selectable reranking:

Feature	Detail
Embeddings	BGE-M3 dense/sparse/ColBERT by default; optional Qwen3 dense with BGE-M3 sparse and ColBERT
Reranking	Interactive startup choice: `BAAI/bge-reranker-v2-m3` or `Qwen/Qwen3-Reranker-0.6B`
Authentication	Optional Bearer token on non-public endpoints
Rate Limiting	Token bucket, 3600 req/min per IP, burst 120
Backpressure	Embedding queue max 200, rerank slots max 32, HTTP 503 on overflow
Graceful Shutdown	30s drain for in-flight requests
Prometheus Metrics	Counters, histograms, gauges for both models
Dynamic Batching	Embedding batch size adapts to GPU VRAM and max input length at startup

Quick Start

1. Setup

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements-gpu.txt
REM CPU-only machine instead: pip install -r requirements-cpu.txt

2. Start the Server

start_server.bat and start_server.sh are parameterized and choose execution target and device:

start_server.bat [local|docker] [cpu|gpu|auto]

./start_server.sh [local|docker] [cpu|gpu|auto]

Command	What it does
`start_server.bat` / `./start_server.sh`	Default: `docker auto`
`start_server.bat docker auto`	Docker startup with device auto-detection
`start_server.bat local gpu`	Local venv, CUDA auto-detect
`start_server.bat local cpu`	Local venv, forces CPU (`CUDA_VISIBLE_DEVICES=-1`)
`start_server.bat docker gpu`	`docker compose -f docker-compose.gpu.yml build && up -d` with NVIDIA runtime
`start_server.bat docker cpu`	Compose with override `docker-compose.cpu.yml` (no GPU)

Arguments are case-insensitive. Built-in validation: unrecognized parameters print usage and exit with code 1.

Startup asks two independent model questions:

Dense embedding backend: choose BGE for current all-BGE embeddings, or QWEN to return Qwen dense vectors while keeping BGE sparse and ColBERT vectors.
Reranker backend: choose BGE or QWEN for /rerank.

The interactive model selection is provided by the launcher scripts. Direct uvicorn or docker compose startup does not prompt; it uses environment variables or .env, falling back to BGE defaults.

local mode: requires .venv already created (see step 1). The script activates venv, checks for uvicorn, installs dependencies if missing, then starts the server.

docker mode: requires Docker Desktop / Engine in PATH. The script builds the image and starts the container in background. For logs:

docker compose -f docker-compose.gpu.yml logs -f embedder

In Docker Desktop the project appears as smart-embedder (containers smart-embedder-gpu / smart-embedder-cpu).

Or directly without wrapper:

uvicorn bge-m3_server:app --host 0.0.0.0 --port 8000

Wait for these log lines:

INFO - Reranker ready.
INFO - Server ready to accept requests

3. Automatic Test

In a second terminal (with server running):

python test_server.py

Expected output: 17/17 tests passed. With --token and API_TOKEN configured, the authentication check is included and the expected output is 18/18 tests passed.

test_server.py accepts --url to point to a different host and --token when API_TOKEN is configured:

python test_server.py --url http://localhost:8000
python test_server.py --token <token>

4. Benchmark

Measures latency (avg/p50/p95/p99) and throughput on embed_dense, embed_full, rerank scenarios:

python benchmark.py --concurrency 8 --requests 100 --batch-size 4

Flag	Default	Description
`--url`	`http://localhost:8000`	Server target
`--token`	`API_TOKEN` env or empty	Bearer token if server requires auth
`--concurrency`	`8`	Concurrent requests in-flight
`--requests`	`100`	Requests per scenario
`--batch-size`	`4`	Sentences/passages per request
`--warmup`	`5`	Warmup requests (excluded from metrics)
`--timeout`	`60`	Timeout for single request
`--max-batch-size`	`128`	Local guardrail on payload limits; `0` disables
`--scenarios`	all	CSV: `embed_dense,embed_full,rerank`
`--sleep-between`	`0`	Pause between scenarios (use `65` if rate-limit active)

Note: Default rate limits (3600 req/min, burst 120) are tuned for benchmarks on a single client at conc<=16. For extreme stress testing: RATE_LIMIT_REQUESTS_PER_MINUTE=1000000 docker compose -f docker-compose.gpu.yml up -d.

Output: ASCII table with Reqs / OK / Fail / Conc / Wall / Req/s / Units/s / Avg / P50 / P95 / P99 / Min / Max.

Latest measured run (RTX 4060 Laptop 8GB, batch=4, conc=8, transformers==4.57.3):

Scenario	Req/s	Units/s	P50	P95	P99
`embed_dense`	44.5	178	176.9ms	185.4ms	187.3ms
`embed_full` (dense+sparse+colbert)	28.7	115	294.1ms	350.6ms	498.9ms
`rerank`	37.8	151	205.7ms	250.3ms	263.8ms

Docker

Prerequisites

Docker Desktop / Docker Engine with Compose v2+
NVIDIA Container Toolkit

nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

First Startup

# Verify CUDA tag exists before building
docker pull nvidia/cuda:12.6.3-runtime-ubuntu22.04

# Build and GPU startup (first time: downloads selected embedding and reranker models)
docker compose -f docker-compose.gpu.yml up --build

# Or via bat wrapper (Windows)
start_server.bat docker gpu

# CPU execution (compose override)
docker compose -f docker-compose.gpu.yml -f docker-compose.cpu.yml up --build
# Equivalent:
start_server.bat docker cpu

Wait for these log lines:

INFO - Reranker ready.
INFO - Server ready to accept requests

Server available at http://localhost:8000.
Models are saved in the default named volume smart-embedder-hf-cache and mounted at /app/model_cache; subsequent restarts do not re-download them.

The first Docker startup with QWEN selected downloads the selected Qwen model into the Hugging Face cache volume. Startup can take longer than BGE on an empty cache; later runs reuse the cached model.

Useful Commands

# Startup in background
docker compose -f docker-compose.gpu.yml up -d

# Real-time logs
docker compose -f docker-compose.gpu.yml logs -f embedder

# Stop
docker compose -f docker-compose.gpu.yml down

# Rebuild after code changes (deps cached if requirements-gpu.txt unchanged)
docker compose -f docker-compose.gpu.yml up --build

# Complete rebuild from scratch
docker compose -f docker-compose.gpu.yml build --no-cache

Verify GPU in Container

docker compose -f docker-compose.gpu.yml run --rm embedder python3 -c "
import torch
print('PyTorch:', torch.__version__)
print('CUDA available:', torch.cuda.is_available())
if torch.cuda.is_available():
    print('GPU:', torch.cuda.get_device_name(0))
"

Exposure on Local Network

By default the server is bound to 127.0.0.1:8000 (localhost only).

To change the exposed port (e.g. 8000 already in use), set PORT in .env or shell before startup. Docker remaps the published host port; the container stays on 8000 internally:

PORT=8001 docker compose -f docker-compose.gpu.yml up -d
# or persistent: add PORT=8001 to .env

In local mode the launcher passes PORT to uvicorn --port.

For LAN access modify docker-compose.gpu.yml (remove the 127.0.0.1: bind prefix):

ports:
  - "${PORT:-8000}:8000"

Warning: If exposed on network, add a reverse proxy with authentication (nginx, Traefik).

Project Files

File	Description
`bge-m3_server.py`	Main server
`requirements-gpu.txt`	Python dependencies (GPU / CUDA PyTorch wheel)
`requirements-cpu.txt`	Python dependencies (CPU-only PyTorch wheel)
`Dockerfile.gpu`	GPU image build (CUDA 12.6, non-root, hardened)
`Dockerfile.cpu`	CPU-only image build (slim Python base, no CUDA)
`docker-compose.gpu.yml`	Container orchestration with GPU and model volume
`docker-compose.cpu.yml`	Compose override: slim CPU image, removes GPU reservation
`.env.example`	Environment variables template (copy to `.env` for local override)
`.dockerignore`	Excludes `.venv`, cache, docs from build context
`start_server.bat`	Windows startup script parameterized (`local\|docker` x `cpu\|gpu\|auto`)
`start_server.sh`	Unix shell startup script parameterized (`local\|docker` x `cpu\|gpu\|auto`)
`test_server.py`	Runtime test suite (17 checks, 18 with `--token`)
`benchmark.py`	Benchmark latency/throughput with summary table

API Endpoints

`POST /embeddings/`

Generates embeddings for a list of texts.

Request:

{
  "sentences": ["Hello world!", "Ciao mondo!"],
  "return_dense": true,
  "return_sparse": true,
  "return_colbert": true,
  "normalize_dense": false,
  "sparse_as_indices": false
}

sparse_as_indices (default: false): When true, sparse vectors are returned in QDRANT-compatible format instead of the default token-id dict:

"sparse": {"indices": [10, 1389, 2349], "values": [0.277, 0.292, 0.313]}

Use with SparseVector(indices=..., values=...) when upserting to QDRANT.

The active embedding backends are selected at server startup. With the default BGE dense backend, dense, sparse, and colbert all come from BAAI/bge-m3. With Qwen dense selected, only dense changes to Qwen/Qwen3-Embedding-0.6B; sparse and colbert still come from BAAI/bge-m3.

Response:

{
  "data": [
    {
      "id": 0,
      "text": "Hello world!",
      "embeddings": {
        "dense": [0.021, -0.013, ...],
        "sparse": {"12": 0.08, "435": 0.12, ...},
        "colbert": [[0.01, ...], ...]
      }
    }
  ],
  "model_name": "Qwen/Qwen3-Embedding-0.6B",
  "dense_model_name": "Qwen/Qwen3-Embedding-0.6B",
  "sparse_model_name": "BAAI/bge-m3",
  "colbert_model_name": "BAAI/bge-m3",
  "processing_time_ms": 104.5,
  "warnings": [
    {
      "code": "input_truncated",
      "severity": "warning",
      "message": "Input text was truncated to the model token limit.",
      "target": {
        "field": "sentences",
        "index": 0,
        "pointer": "/sentences/0"
      },
      "details": {
        "model": "BAAI/bge-m3",
        "max_tokens": 8192,
        "original_tokens": 9000,
        "truncated_tokens": 808,
        "truncation_side": "end"
      }
    }
  ]
}

cURL:

curl -X POST "http://localhost:8000/embeddings/" \
  -H "Content-Type: application/json" \
  -d '{"sentences": ["Hello world!"], "return_dense": true}'

If API_TOKEN is set:

curl -X POST "http://localhost:8000/embeddings/" \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{"sentences": ["Hello world!"], "return_dense": true}'

`POST /rerank`

Ranks a list of passages by relevance to a query.

Request:

{
  "query": "What is machine learning?",
  "passages": [
    "Machine learning is a subset of AI.",
    "The weather is nice today.",
    "Deep learning uses neural networks."
  ],
  "normalize": true
}

Response:

{
  "results": [
    {"index": 0, "passage": "Machine learning is a subset of AI.", "score": 0.987},
    {"index": 2, "passage": "Deep learning uses neural networks.", "score": 0.821},
    {"index": 1, "passage": "The weather is nice today.", "score": 0.003}
  ],
  "model_name": "BAAI/bge-reranker-v2-m3",
  "processing_time_ms": 52.2,
  "warnings": []
}

normalize: true returns a score in [0, 1] (sigmoid)
normalize: false returns a raw score (negative values possible)
With QWEN selected, scores are yes-probabilities and normalize is kept as an API-compatible no-op
Do not compare BGE normalize: false raw logits directly with QWEN scores
Passages are returned sorted by descending score
The index field returns the original position in the input list
model_name reports the reranker selected at startup

For over-token query-passage pairs, the server preserves the query where possible, truncates passages from the end, returns 200 OK, and includes query_truncated or passage_truncated entries in warnings.

Warning token counts are computed during server-side preparation. Rerank inputs are then decoded back to text and tokenized again by the model backend, so original_tokens, max_tokens, and truncated_tokens should be treated as diagnostic metadata rather than exact proof of final backend tokenization.

cURL:

curl -X POST "http://localhost:8000/rerank" \
  -H "Content-Type: application/json" \
  -d '{"query": "machine learning", "passages": ["ML is AI", "Nice weather"], "normalize": true}'

If API_TOKEN is set:

curl -X POST "http://localhost:8000/rerank" \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{"query": "machine learning", "passages": ["ML is AI", "Nice weather"], "normalize": true}'

`GET /health`

curl "http://localhost:8000/health"

Returns server status, GPU info, active embedding/reranker models, batch size.

Relevant model fields:

{
  "version": "1.2.0",
  "model": "BAAI/bge-m3",
  "dense_embedding_model": "Qwen/Qwen3-Embedding-0.6B",
  "reranker_model": "BAAI/bge-reranker-v2-m3"
}

`GET /stats`

curl "http://localhost:8000/stats"

Returns uptime, total requests, total sentences, total batches, rejected requests, hardware.

`GET /metrics`

curl "http://localhost:8000/metrics"

Prometheus scraping endpoint in text/plain format.

`GET /docs`

Interactive Swagger documentation: http://localhost:8000/docs

If API_TOKEN is configured, Swagger shows the lock on POST endpoints. Use the Authorize button and enter only the token, without Bearer prefix.

Configuration

Limits are tunable via environment variable (override in docker-compose.gpu.yml or shell before startup):

Env var	Default	Description
`PORT`	`8000`	Host port to expose. Docker: published host port (container stays on 8000). Local: `uvicorn --port`. Set in `.env` or shell if 8000 is taken
`BGE_EMBED_MAX_LENGTH`	`MAX_INPUT_LENGTH` fallback / `8192`	Max tokens for BGE-M3 embedding input; applies to dense, sparse, and ColBERT outputs
`QWEN_EMBED_MAX_LENGTH`	`32768`	Max tokens for Qwen dense embedding input
`BGE_RERANK_MAX_LENGTH`	`8192`	Max query+passage tokens for BGE rerank; BAAI notes this reranker was fine-tuned at 1024 and recommends 1024 for practical use
`QWEN_RERANK_MAX_LENGTH`	`32768`	Max query+passage tokens for Qwen reranker when QWEN is selected
`MAX_INPUT_LENGTH`	`8192`	Legacy fallback for `BGE_EMBED_MAX_LENGTH`; prefer the backend-specific variables above
`REQUEST_TIMEOUT`	`90`	Global HTTP timeout (sec); keep above `RERANK_GPU_TIMEOUT`
`DENSE_EMBEDDING_MODEL`	`BAAI/bge-m3`	Dense embedding backend selected by launcher (`BAAI/bge-m3` or `Qwen/Qwen3-Embedding-0.6B`)
`RERANKER_MODEL`	`BAAI/bge-reranker-v2-m3`	Reranker selected by launcher (`BAAI/bge-reranker-v2-m3` or `Qwen/Qwen3-Reranker-0.6B`)
`QWEN_RERANK_BATCH_SIZE`	launcher-tuned / `16` fallback	Max query-passage pairs per Qwen reranker micro-batch
`API_TOKEN`	empty	Optional bearer token for non-public endpoints; empty disables authentication
`MAX_QUEUE_SIZE`	`200`	Max requests in queue `/embeddings/` (backpressure)
`RERANK_MAX_QUEUE`	`32`	Max concurrent slots for `/rerank` (backpressure)
`RERANK_GPU_TIMEOUT`	`60`	Hard timeout for a single rerank inference (sec); keep below `REQUEST_TIMEOUT`
`RATE_LIMIT_REQUESTS_PER_MINUTE`	`3600`	Rate limit per IP (60 req/s)
`RATE_LIMIT_BURST_SIZE`	`120`	Token bucket burst (~2s of traffic)
`PYTORCH_CUDA_ALLOC_CONF`	`expandable_segments:True`	CUDA caching-allocator config; reduces fragmentation OOM on variable-length batches (single-GPU, no NCCL). Set in `Dockerfile.gpu` and `docker-compose.gpu.yml`

Texts longer than the active backend-specific token limit are truncated and reported in the response warnings array. The server no longer rejects requests based on character-count payload limits.

With API_TOKEN set, all non-public endpoints require:

Authorization: Bearer <token>

Service endpoints (/health, /stats, /metrics, /docs, /redoc, /openapi.json) remain accessible without token.

When DENSE_EMBEDDING_MODEL=Qwen/Qwen3-Embedding-0.6B, only dense vectors change. Sparse lexical weights and ColBERT vectors still come from BAAI/bge-m3, so mixed requests are supported through the same /embeddings/ endpoint.

The Qwen dense path intentionally does not add query/document instruction prefixes. This keeps the existing /embeddings/ API transparent, but deployments optimizing retrieval quality should benchmark task-specific Qwen formatting separately before changing request semantics.

When QWEN reranking is selected and GPU mode is used, the launch scripts can auto-tune Qwen rerank limits from detected GPU VRAM if QWEN_RERANK_BATCH_SIZE or QWEN_RERANK_MAX_LENGTH are not set in the environment or .env:

GPU VRAM	QWEN_RERANK_BATCH_SIZE	QWEN_RERANK_MAX_LENGTH
<= 6 GB	4	4096
<= 8 GB	8	8192
> 8 GB	16	8192

When QWEN reranking is selected in CPU mode, the launch scripts use conservative defaults unless overridden:

Device	QWEN_RERANK_BATCH_SIZE	QWEN_RERANK_MAX_LENGTH
CPU	1	2048

.env.example pins QWEN_RERANK_MAX_LENGTH=32768, the documented model maximum. To use the launcher VRAM-tuned values above, leave QWEN_RERANK_MAX_LENGTH unset in your shell and remove or comment it from your local .env.

Benchmark defaults are tuned for NVIDIA RTX 4060 Laptop 8GB: observed throughput ~29-45 req/s at conc=8 depending on scenario (see benchmark table above). Override:

# Ad-hoc (shell env)
RATE_LIMIT_REQUESTS_PER_MINUTE=10000 docker compose -f docker-compose.gpu.yml up -d

# Persistent - copy .env.example to .env and modify
cp .env.example .env
docker compose -f docker-compose.gpu.yml up -d

Compose automatically loads .env in the same directory. .env is in .gitignore; .env.example is the versioned template.

MULTI_GPU_DEVICES = None (in bge-m3_server.py) can be changed to ['cuda:0', 'cuda:1'] for multi-GPU.

Embedding batch size is automatically calculated from available VRAM and the active embedding max length. With default BGE embeddings, this is BGE_EMBED_MAX_LENGTH=8192. If Qwen dense embeddings are selected, the tuning uses the larger of BGE_EMBED_MAX_LENGTH and QWEN_EMBED_MAX_LENGTH, because mixed dense+sparse/ColBERT requests can exercise both tokenizers:

Condition	batch_size	MAX_REQUESTS_IN_BATCH
GPU > 8 GB	12	16
GPU <= 8 GB and > 4 GB	6	16
GPU <= 4 GB	3	16
CPU	1	8

If the active embedding tuning length is <=512, the server switches to the short-sequence profile:

VRAM	batch_size	MAX_REQUESTS_IN_BATCH
> 8 GB	128	64
> 6 GB	64	32
> 4 GB	32	16
<= 4 GB	16	16

Prometheus Metrics

Embedding

Metric	Type	Label
`embedding_requests_total`	Counter	`status`, `endpoint`
`embedding_requests_rejected_total`	Counter	`reason`
`embedding_sentences_processed_total`	Counter	-
`embedding_request_duration_seconds`	Histogram	`endpoint`
`embedding_batch_size`	Histogram	-
`embedding_gpu_inference_duration_seconds`	Histogram	-
`embedding_queue_size`	Gauge	-
`embedding_active_requests`	Gauge	-
`embedding_gpu_memory_allocated_bytes`	Gauge	Legacy process GPU allocated memory, kept for existing dashboards
`embedding_gpu_memory_reserved_bytes`	Gauge	Legacy process GPU reserved memory, kept for existing dashboards
`embedding_server_info`	Info	`model`, `dense_embedding_model`, `bge_embed_max_length`, `qwen_embed_max_length`, `bge_rerank_max_length`, `qwen_rerank_max_length`, `version`, `gpu_available`, `device`

GPU Process

These gauges are process-level CUDA readings updated after embedding and rerank inference. They include all loaded models and both endpoint paths.

Metric	Type	Label
`gpu_memory_allocated_bytes`	Gauge	Process GPU tensor memory allocated by PyTorch
`gpu_memory_reserved_bytes`	Gauge	Process GPU memory reserved by the PyTorch caching allocator
`gpu_memory_free_bytes`	Gauge	CUDA device free memory from `torch.cuda.mem_get_info()`
`gpu_memory_total_bytes`	Gauge	CUDA device total memory from `torch.cuda.mem_get_info()`

Reranker

Metric	Type	Label
`rerank_requests_total`	Counter	`status`
`rerank_requests_rejected_total`	Counter	`reason`
`rerank_pairs_processed_total`	Counter	-
`rerank_request_duration_seconds`	Histogram	-
`rerank_inference_duration_seconds`	Histogram	-
`rerank_active_requests`	Gauge	-

Useful PromQL Queries

# Throughput embedding (req/sec)
rate(embedding_requests_total[1m])

# Latency P95
histogram_quantile(0.95, rate(embedding_request_duration_seconds_bucket[5m]))

# Error rate (%)
rate(embedding_requests_total{status="error"}[5m]) / rate(embedding_requests_total[5m]) * 100

# GPU tensor memory allocated by PyTorch (GB)
gpu_memory_allocated_bytes / 1024 / 1024 / 1024

# GPU memory reserved by PyTorch caching allocator (GB)
gpu_memory_reserved_bytes / 1024 / 1024 / 1024

# CUDA device memory visible to the process (GB)
gpu_memory_free_bytes / 1024 / 1024 / 1024

# Reranker throughput (pairs/sec)
rate(rerank_pairs_processed_total[1m])

Setup Prometheus

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'smart-embedder'
    static_configs:
      - targets: ['localhost:8000']
    metrics_path: '/metrics'

Grafana Dashboard - Recommended Panels

Embedding Request Rate - rate(embedding_requests_total[1m])
Latency P50/P95/P99 - histogram_quantile(0.X, ...)
Queue Size - embedding_queue_size
GPU Memory - gpu_memory_allocated_bytes, gpu_memory_reserved_bytes, gpu_memory_free_bytes
Rerank Request Rate - rate(rerank_requests_total[1m])
Batch Size Distribution - embedding_batch_size

Security and Limits

Rate Limiting

Algorithm: Token Bucket per IP
Limit: RATE_LIMIT_REQUESTS_PER_MINUTE=3600 req/min, RATE_LIMIT_BURST_SIZE=120
Response: HTTP 429 with header Retry-After: 60

GPU Execution

Embedding and rerank inference share a single-worker GPU executor, so the two paths never run forward passes concurrently. This bounds peak VRAM to the larger resident model instead of the sum, preventing concurrency-driven CUDA OOM on small GPUs. The CUDA default stream already serializes kernels, so this costs effectively no throughput.

Backpressure

/embeddings/ queue max: MAX_QUEUE_SIZE=200
/rerank slots max: RERANK_MAX_QUEUE=32 (admission bound on the shared single-worker GPU executor)
Acquire timeout: 0.5s
Rejections are reflected in both /stats (rejected_requests) and Prometheus (embedding_requests_rejected_total or rerank_requests_rejected_total, depending on endpoint).
Rate limit uses direct connection IP (request.client.host). If the server is behind a trusted reverse proxy, update the middleware to extract IP from X-Forwarded-For.

Timeout

REQUEST_TIMEOUT=90s is the global HTTP timeout (504 to the caller).
GPU_PROCESS_TIMEOUT=15s (CUDA) / 30s (CPU) limits embedding batch inference on the thread pool.
RERANK_GPU_TIMEOUT=60s limits rerank inference and should stay below REQUEST_TIMEOUT.
Timeouts are tracked in Prometheus as embedding_requests_total{status="timeout"} or rerank_requests_total{status="timeout"}.

Graceful Shutdown

Blocks new requests (middleware)
Waits for queue drain
Completes in-flight requests (max 30s)
Cancels processing loop and closes the shared GPU executor

Troubleshooting

Server Won't Start

python -c "import torch; print(torch.cuda.is_available())"
pip install -r requirements-gpu.txt --upgrade

`429 Too Many Requests` Errors

Client exceeds rate limit. Increase RATE_LIMIT_REQUESTS_PER_MINUTE or reduce call frequency.

`503 Service Unavailable` Errors

Queue is full. Increase MAX_QUEUE_SIZE or scale horizontally with a load balancer.

`504 Gateway Timeout` Errors

Embedding inference exceeded GPU_PROCESS_TIMEOUT (15s on CUDA, 30s on CPU) or rerank inference exceeded RERANK_GPU_TIMEOUT. Reduce batch size or check GPU availability.

Prometheus Metrics Not Visible

curl http://localhost:8000/metrics

Verify that target in prometheus.yml is reachable and that port 8000 is not blocked by firewall.

Docker: GPU Not Detected in Container

# Verify NVIDIA Container Toolkit
docker run --rm --gpus all nvidia/cuda:12.6.3-runtime-ubuntu22.04 nvidia-smi

If it fails: reinstall NVIDIA Container Toolkit and restart Docker.

Docker: CUDA Tag Not Found

Error: manifest for nvidia/cuda:12.6.3-runtime-ubuntu22.04 not found

Search correct tag on hub.docker.com/r/nvidia/cuda/tags and update first line of Dockerfile.gpu.

Docker: Container Unhealthy on First Startup

Default Compose and Dockerfile healthchecks allow a 300s startup period for first-run model downloads. On slow networks or empty caches, increase the healthcheck start period above 300s in your custom Compose override:

start_period: 300s

References

License

Follows the selected model licenses (BAAI/bge-m3, BAAI/bge-reranker-v2-m3, and optionally Qwen/Qwen3-Embedding-0.6B and Qwen/Qwen3-Reranker-0.6B).

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
docs/superpowers		docs/superpowers
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile.cpu		Dockerfile.cpu
Dockerfile.gpu		Dockerfile.gpu
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
bge-m3_server.py		bge-m3_server.py
docker-compose.cpu.yml		docker-compose.cpu.yml
docker-compose.gpu.yml		docker-compose.gpu.yml
requirements-cpu.txt		requirements-cpu.txt
requirements-gpu.txt		requirements-gpu.txt
start_server.bat		start_server.bat
start_server.sh		start_server.sh
test_reranker_wrapper.py		test_reranker_wrapper.py
test_server.py		test_server.py

Folders and files

Latest commit

History

Repository files navigation

SMART EMBEDDER

Quick Start

1. Setup

2. Start the Server

3. Automatic Test

4. Benchmark

Docker

Prerequisites

First Startup

Useful Commands

Verify GPU in Container

Exposure on Local Network

Project Files

API Endpoints

POST /embeddings/

POST /rerank

GET /health

GET /stats

GET /metrics

GET /docs

Configuration

Prometheus Metrics

Embedding

GPU Process

Reranker

Useful PromQL Queries

Setup Prometheus

Grafana Dashboard - Recommended Panels

Security and Limits

Rate Limiting

GPU Execution

Backpressure

Timeout

Graceful Shutdown

Troubleshooting

Server Won't Start

429 Too Many Requests Errors

503 Service Unavailable Errors

504 Gateway Timeout Errors

Prometheus Metrics Not Visible

Docker: GPU Not Detected in Container

Docker: CUDA Tag Not Found

Docker: Container Unhealthy on First Startup

References

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Contributors

Uh oh!

Languages

`POST /embeddings/`

`POST /rerank`

`GET /health`

`GET /stats`

`GET /metrics`

`GET /docs`

`429 Too Many Requests` Errors

`503 Service Unavailable` Errors

`504 Gateway Timeout` Errors