BlitzEmbedding - Building a Production-Ready Multi-Model Embedding Server
Introduction
Text embeddings have become critical infrastructure for NLP applications, but serving them at scale presents several engineering challenges. Models consume significant memory (1-8GB each), inference latency varies dramatically between CPU and GPU, and applications often need multiple specialized models available simultaneously.
After running embedding workloads in production and hitting the typical scaling bottlenecks, I built BlitzEmbedding to address these specific problems. This post covers the architecture decisions, implementation details, and lessons learned from building a multi-model embedding server that handles thousands of requests per second.
Problem Space
The core challenges I needed to solve:
-
Memory Management: Transformer models are large. A typical sentence-transformer model consumes 1-2GB of VRAM. Loading/unloading models on every request kills performance.
-
Hardware Heterogeneity: CPU inference works for small batches, but GPU acceleration becomes essential for larger workloads. The system needs to route intelligently.
-
Model Diversity: Different use cases require different models. Search needs dense retrievers, reranking needs cross-encoders, and domain-specific tasks need specialized models.
-
Cost Optimization: GPU compute is expensive. Keeping instances spinning when idle burns money.
Architecture Deep Dive
The system implements a distributed architecture built on Azure Container Apps, with intelligent request routing based on batch characteristics.
Request Flow and Routing Logic
def route_request(batch_size: int, model_type: str) -> str: if batch_size < 10: return "cpu_cluster" elif model_type in ["cross-encoder", "reranker"]: return "gpu_cluster" # Cross-encoders are compute-intensive else: return "gpu_cluster" if batch_size >= 10 else "cpu_cluster"The routing decision happens at the API Management layer. Small batches (< 10 samples) go to CPU containers because the GPU warm-up overhead doesn’t justify the compute benefit. Large batches and compute-intensive models (cross-encoders) route to GPU containers.
Container Infrastructure
Each container runs a FastAPI server with the following structure:
class BlitzEmbeddingServer: def __init__(self): self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") self.model_cache = ModelCache(max_models=4, device=self.device) self.batch_size = 1024 if self.device.type == "cuda" else 128
async def embed(self, request: EmbedRequest) -> EmbedResponse: model = await self.model_cache.get_model(request.model_name) embeddings = await self._generate_embeddings(model, request.texts) return EmbedResponse(embeddings=embeddings)CPU containers are sized with 4 vCPUs and 16GB RAM. GPU containers use the same 4 vCPUs and 16GB RAM, but with T4 GPUs. The GPU containers can scale to zero when unused, which typically saves upto 90% on compute costs for workloads with variable traffic.
Model Management with Ray
The most complex part of the system is model caching. I initially tried a simple LRU implementation but ran into issues with memory fragmentation and inconsistent eviction behavior under load.
Ray-Based Model Multiplexing
Ray’s model multiplexing solved these problems:
from ray import serveimport SentenceTransformer
@serve.deploymentclass ModelInferencer: def __init__(self): self.model_path = "<path_to_model_in_blob_storage>"
@serve.multiplexed(max_num_models_per_replica=3) async def get_model(self, model_id: str): if model_id not in self.model_cache: model_weights = download_model(self.model_path) return SentenceTransformer(model_weights)
async def __call__(self, request: starlette.requests.Request): model_id = serve.get_multiplexed_model_id() model = await self.get_model(model_id) return model.encode(request.text)
entry = ModelInferencer.bind()Ray’s LRU cache provides several advantages over manual implementation:
- Automatic Memory Management: Ray tracks object references and handles garbage collection properly
- Distributed Caching: Models can be shared across multiple worker processes
- Memory Pressure Handling: Ray’s object store can spill to disk under memory pressure
- Consistent Eviction: The LRU policy works correctly even under concurrent access
Model Optimization Pipeline
Models undergo preprocessing before deployment:
def optimize_model_for_inference(model_path: str, target_device: str): if target_device == "cpu": # Convert to ONNX and quantize model = SentenceTransformer(model_path) model.save_to_hub("optimized-model", create_pr=False)
# Quantize with ONNX Runtime quantized_model = quantize_dynamic( model_path="model.onnx", quantized_model_path="model_quantized.onnx", weight_type=QuantType.QInt8 )
elif target_device == "gpu": # Load with FP16 and TensorRT optimization model = SentenceTransformer(model_path, device="cuda") model.half() # Convert to FP16
# TensorRT optimization happens at runtime model = torch.jit.script(model)CPU models use INT8 quantization through ONNX Runtime, typically reducing memory usage by 75% with minimal accuracy loss. GPU models use FP16 precision, halving memory requirements while maintaining full precision for most use cases.
Performance Engineering
Batching Strategy
The batching implementation varies by hardware target:
class BatchProcessor: def __init__(self, device_type: str): self.batch_size = 1024 if device_type == "cuda" else 128 self.max_wait_time = 50 # milliseconds
async def process_batch(self, texts: List[str]) -> np.ndarray: # Dynamic batching based on input size if len(texts) > self.batch_size: # Process in chunks results = [] for i in range(0, len(texts), self.batch_size): chunk = texts[i:i + self.batch_size] embeddings = await self._embed_chunk(chunk) results.append(embeddings) return np.vstack(results) else: return await self._embed_chunk(texts)GPU instances use large batch sizes (1024) to maximize throughput. The T4 has 16GB of memory, so we can fit large batches of most models. CPU instances use smaller batches (128) optimized for latency rather than throughput.
Data Serialization
I tested several serialization formats and MessagePack performed best:
| Format | Size (MB) | Serialize (ms) | Deserialize (ms) |
|---|---|---|---|
| JSON | 12.4 | 45 | 38 |
| Pickle | 8.7 | 23 | 19 |
| MessagePack | 8.2 | 15 | 12 |
import msgpackimport msgpack_numpy as m
# Configure msgpack for numpy arraysm.patch()
def serialize_embeddings(embeddings: np.ndarray) -> bytes: return msgpack.packb({ 'embeddings': embeddings, 'shape': embeddings.shape, 'dtype': str(embeddings.dtype) })
def deserialize_embeddings(data: bytes) -> np.ndarray: unpacked = msgpack.unpackb(data, raw=False) return unpacked['embeddings']MessagePack with numpy extensions reduces payload size by 30-40% compared to JSON and serializes 3x faster.
Async Implementation
The server uses FastAPI with async handlers throughout:
@app.post("/embed")async def embed_endpoint(request: EmbedRequest) -> EmbedResponse: async with asyncio.Semaphore(4): # Limit concurrent requests model = await model_cache.get_model(request.model_name)
# Process in parallel for large requests if len(request.texts) > 1000: chunks = [request.texts[i:i+500] for i in range(0, len(request.texts), 500)] tasks = [process_chunk(model, chunk) for chunk in chunks] results = await asyncio.gather(*tasks) embeddings = np.vstack(results) else: embeddings = await process_texts(model, request.texts)
return EmbedResponse( embeddings=embeddings.tolist(), model=request.model_name, usage={"tokens": sum(len(text.split()) for text in request.texts)} )The semaphore prevents memory exhaustion under high load. For large requests, we process chunks in parallel to maximize GPU utilization.
Auto-scaling Configuration
Azure Container Apps scaling rules:
resources: cpu: 4 memory: 16Giscale: minReplicas: 0 maxReplicas: 10 rules: - name: "http-requests" http: metadata: concurrentRequests: "10" - name: "cpu-utilization" custom: type: "cpu" metadata: type: "Utilization" value: "70"GPU containers scale to zero after 5 minutes of no requests. Scale-up takes 20-30 seconds due to model loading, so we use predictive scaling based on traffic patterns.
Monitoring and Observability
All metrics are tracked via Prometheus, application logs are tracked via Loki and are available in Grafana.
Performance Results
In production, the system handles:
- roundtrip P95 latency: 800-900ms for small batches, 1000-2000 ms for batches of > 1024 texts
- Model switching: 50ms cache hit, 15-20s cache miss (model loading)
The Ray-based caching reduced model loading overhead by 40% compared to the previous implementation, primarily due to better memory management and reduced GC pressure.
Lessons Learned
-
Ray’s LRU cache is significantly more robust than manual implementations when dealing with large objects and concurrent access patterns.
-
GPU auto-scaling works well for batch workloads but requires careful tuning of scale-up triggers to avoid cold starts during traffic spikes.
-
MessagePack serialization is worth the complexity for high-throughput embedding workloads where network I/O becomes a bottleneck.
-
Pre-loading models into container images dramatically improves cold start performance but increases image size (4-6GB per image).
-
Monitoring GPU memory usage is critical - CUDA memory leaks can crash containers and are difficult to debug in production.
The system has been running in production for 12 months, processed millions of embedding requests with 99.99% uptime. The architecture scales well and the cost optimization through auto-scaling has proven effective for variable workloads.