BlitzEmbedding - Building a Production-Ready Multi-Model Embedding Server

Introduction

Text embeddings have become critical infrastructure for NLP applications, but serving them at scale presents several engineering challenges. Models consume significant memory (1-8GB each), inference latency varies dramatically between CPU and GPU, and applications often need multiple specialized models available simultaneously.

After running embedding workloads in production and hitting the typical scaling bottlenecks, I built BlitzEmbedding to address these specific problems. This post covers the architecture decisions, implementation details, and lessons learned from building a multi-model embedding server that handles thousands of requests per second.

Problem Space

The core challenges I needed to solve:

  1. Memory Management: Transformer models are large. A typical sentence-transformer model consumes 1-2GB of VRAM. Loading/unloading models on every request kills performance.

  2. Hardware Heterogeneity: CPU inference works for small batches, but GPU acceleration becomes essential for larger workloads. The system needs to route intelligently.

  3. Model Diversity: Different use cases require different models. Search needs dense retrievers, reranking needs cross-encoders, and domain-specific tasks need specialized models.

  4. Cost Optimization: GPU compute is expensive. Keeping instances spinning when idle burns money.

Architecture Deep Dive

The system implements a distributed architecture built on Azure Container Apps, with intelligent request routing based on batch characteristics.

Request Flow and Routing Logic

def route_request(batch_size: int, model_type: str) -> str:
if batch_size < 10:
return "cpu_cluster"
elif model_type in ["cross-encoder", "reranker"]:
return "gpu_cluster" # Cross-encoders are compute-intensive
else:
return "gpu_cluster" if batch_size >= 10 else "cpu_cluster"

The routing decision happens at the API Management layer. Small batches (< 10 samples) go to CPU containers because the GPU warm-up overhead doesn’t justify the compute benefit. Large batches and compute-intensive models (cross-encoders) route to GPU containers.

Container Infrastructure

Each container runs a FastAPI server with the following structure:

class BlitzEmbeddingServer:
def __init__(self):
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model_cache = ModelCache(max_models=4, device=self.device)
self.batch_size = 1024 if self.device.type == "cuda" else 128
async def embed(self, request: EmbedRequest) -> EmbedResponse:
model = await self.model_cache.get_model(request.model_name)
embeddings = await self._generate_embeddings(model, request.texts)
return EmbedResponse(embeddings=embeddings)

CPU containers are sized with 4 vCPUs and 16GB RAM. GPU containers use the same 4 vCPUs and 16GB RAM, but with T4 GPUs. The GPU containers can scale to zero when unused, which typically saves upto 90% on compute costs for workloads with variable traffic.

Model Management with Ray

The most complex part of the system is model caching. I initially tried a simple LRU implementation but ran into issues with memory fragmentation and inconsistent eviction behavior under load.

Ray-Based Model Multiplexing

Ray’s model multiplexing solved these problems:

from ray import serve
import SentenceTransformer
@serve.deployment
class ModelInferencer:
def __init__(self):
self.model_path = "<path_to_model_in_blob_storage>"
@serve.multiplexed(max_num_models_per_replica=3)
async def get_model(self, model_id: str):
if model_id not in self.model_cache:
model_weights = download_model(self.model_path)
return SentenceTransformer(model_weights)
async def __call__(self, request: starlette.requests.Request):
model_id = serve.get_multiplexed_model_id()
model = await self.get_model(model_id)
return model.encode(request.text)
entry = ModelInferencer.bind()

Ray’s LRU cache provides several advantages over manual implementation:

  1. Automatic Memory Management: Ray tracks object references and handles garbage collection properly
  2. Distributed Caching: Models can be shared across multiple worker processes
  3. Memory Pressure Handling: Ray’s object store can spill to disk under memory pressure
  4. Consistent Eviction: The LRU policy works correctly even under concurrent access

Model Optimization Pipeline

Models undergo preprocessing before deployment:

def optimize_model_for_inference(model_path: str, target_device: str):
if target_device == "cpu":
# Convert to ONNX and quantize
model = SentenceTransformer(model_path)
model.save_to_hub("optimized-model", create_pr=False)
# Quantize with ONNX Runtime
quantized_model = quantize_dynamic(
model_path="model.onnx",
quantized_model_path="model_quantized.onnx",
weight_type=QuantType.QInt8
)
elif target_device == "gpu":
# Load with FP16 and TensorRT optimization
model = SentenceTransformer(model_path, device="cuda")
model.half() # Convert to FP16
# TensorRT optimization happens at runtime
model = torch.jit.script(model)

CPU models use INT8 quantization through ONNX Runtime, typically reducing memory usage by 75% with minimal accuracy loss. GPU models use FP16 precision, halving memory requirements while maintaining full precision for most use cases.

Performance Engineering

Batching Strategy

The batching implementation varies by hardware target:

class BatchProcessor:
def __init__(self, device_type: str):
self.batch_size = 1024 if device_type == "cuda" else 128
self.max_wait_time = 50 # milliseconds
async def process_batch(self, texts: List[str]) -> np.ndarray:
# Dynamic batching based on input size
if len(texts) > self.batch_size:
# Process in chunks
results = []
for i in range(0, len(texts), self.batch_size):
chunk = texts[i:i + self.batch_size]
embeddings = await self._embed_chunk(chunk)
results.append(embeddings)
return np.vstack(results)
else:
return await self._embed_chunk(texts)

GPU instances use large batch sizes (1024) to maximize throughput. The T4 has 16GB of memory, so we can fit large batches of most models. CPU instances use smaller batches (128) optimized for latency rather than throughput.

Data Serialization

I tested several serialization formats and MessagePack performed best:

FormatSize (MB)Serialize (ms)Deserialize (ms)
JSON12.44538
Pickle8.72319
MessagePack8.21512
import msgpack
import msgpack_numpy as m
# Configure msgpack for numpy arrays
m.patch()
def serialize_embeddings(embeddings: np.ndarray) -> bytes:
return msgpack.packb({
'embeddings': embeddings,
'shape': embeddings.shape,
'dtype': str(embeddings.dtype)
})
def deserialize_embeddings(data: bytes) -> np.ndarray:
unpacked = msgpack.unpackb(data, raw=False)
return unpacked['embeddings']

MessagePack with numpy extensions reduces payload size by 30-40% compared to JSON and serializes 3x faster.

Async Implementation

The server uses FastAPI with async handlers throughout:

@app.post("/embed")
async def embed_endpoint(request: EmbedRequest) -> EmbedResponse:
async with asyncio.Semaphore(4): # Limit concurrent requests
model = await model_cache.get_model(request.model_name)
# Process in parallel for large requests
if len(request.texts) > 1000:
chunks = [request.texts[i:i+500] for i in range(0, len(request.texts), 500)]
tasks = [process_chunk(model, chunk) for chunk in chunks]
results = await asyncio.gather(*tasks)
embeddings = np.vstack(results)
else:
embeddings = await process_texts(model, request.texts)
return EmbedResponse(
embeddings=embeddings.tolist(),
model=request.model_name,
usage={"tokens": sum(len(text.split()) for text in request.texts)}
)

The semaphore prevents memory exhaustion under high load. For large requests, we process chunks in parallel to maximize GPU utilization.

Auto-scaling Configuration

Azure Container Apps scaling rules:

resources:
cpu: 4
memory: 16Gi
scale:
minReplicas: 0
maxReplicas: 10
rules:
- name: "http-requests"
http:
metadata:
concurrentRequests: "10"
- name: "cpu-utilization"
custom:
type: "cpu"
metadata:
type: "Utilization"
value: "70"

GPU containers scale to zero after 5 minutes of no requests. Scale-up takes 20-30 seconds due to model loading, so we use predictive scaling based on traffic patterns.

Monitoring and Observability

All metrics are tracked via Prometheus, application logs are tracked via Loki and are available in Grafana.

Performance Results

In production, the system handles:

  • roundtrip P95 latency: 800-900ms for small batches, 1000-2000 ms for batches of > 1024 texts
  • Model switching: 50ms cache hit, 15-20s cache miss (model loading)

The Ray-based caching reduced model loading overhead by 40% compared to the previous implementation, primarily due to better memory management and reduced GC pressure.

Lessons Learned

  1. Ray’s LRU cache is significantly more robust than manual implementations when dealing with large objects and concurrent access patterns.

  2. GPU auto-scaling works well for batch workloads but requires careful tuning of scale-up triggers to avoid cold starts during traffic spikes.

  3. MessagePack serialization is worth the complexity for high-throughput embedding workloads where network I/O becomes a bottleneck.

  4. Pre-loading models into container images dramatically improves cold start performance but increases image size (4-6GB per image).

  5. Monitoring GPU memory usage is critical - CUDA memory leaks can crash containers and are difficult to debug in production.

The system has been running in production for 12 months, processed millions of embedding requests with 99.99% uptime. The architecture scales well and the cost optimization through auto-scaling has proven effective for variable workloads.