FastAPI: Optimizing for AI Workloads
Tips and tricks to make FastAPI shine with AI models.
1οΈβ£ Understand the AI bottlenecks first (before tuning FastAPI)
FastAPI itself is rarely the slow part. For AI APIs, the usual killers are:
- π₯ Model inference time (CPU/GPU)
- π§ Memory pressure (loading models per request π€¦)
- π¦ Serialization (big tensors β JSON)
- π§΅ Blocking code inside async routes
So tuning FastAPI = removing friction around inference, not micro-optimizing Python.
2οΈβ£ Load models ONCE (never inside requests)
β Wrong
@app.post("/predict")
async def predict():
model = load_model() # π every request
return model.run()
β
Correct
from contextlib import asynccontextmanager
@asynccontextmanager
async def lifespan(app: FastAPI):
app.state.model = load_model()
yield
app = FastAPI(lifespan=lifespan)
@app.post("/predict")
async def predict():
return app.state.model.run()
π‘ Rule:
Model load = startup
Model run = request
This alone gives 10β100x latency improvement.
3οΈβ£ Async FastAPI β async AI code (important)
Most AI libs are blocking:
- PyTorch
- TensorFlow
- NumPy
- OpenCV
β This blocks the event loop
@app.post("/predict")
async def predict():
return model.run(input) # blocking
β Run inference in threadpool
from fastapi.concurrency import run_in_threadpool
@app.post("/predict")
async def predict():
return await run_in_threadpool(model.run, input)
π Why?
- Keeps FastAPI responsive
- Allows concurrent requests
- Prevents tail latency explosions
4οΈβ£ Batch requests (HUGE for AI APIs)
Instead of:
1 request β 1 inference
Do: N requests β 1 batched inference
Pattern
- Queue inputs
- Run inference every X ms or when batch is full
- Fan-out results
This gives:
- π Better GPU utilization
- πΈ Lower infra cost
- β± Lower average latency
Used by:
- OpenAI-style APIs
- Recommendation systems
- Embedding services
If you want, I can sketch a FastAPI batching worker.
5οΈβ£ Use the right server config (donβt ignore this)
Production command (baseline)
uvicorn app:app \
--host 0.0.0.0 \
--port 8000 \
--workers 4
AI-specific rules
- CPU inference β more workers
- GPU inference β usually workers=1 per GPU
- Avoid Gunicorn workers fighting for GPU memory
π‘ For GPU:
1 worker = 1 GPU
6οΈβ£ Avoid JSON for large tensors
JSON is slow + huge.
Better options
- 𧬠MsgPack
- π§ͺ Protobuf
- πΌοΈ Binary (for images/audio)
- π¦ Base64 only if unavoidable
Example:
from fastapi.responses import Response
@app.post("/embedding")
def embed():
vector = model.embed(text)
return Response(
content=vector.tobytes(),
media_type="application/octet-stream"
)
Latency drops massively for embeddings.
7οΈβ£ Cache aggressively (AI loves caching)
Cache levels
β Input β Output cache (exact prompts)
β Embedding cache
β Feature cache
β Prompt template cache
Tools:
Redis
- In-memory LRU
- Disk cache for embeddings
- Even 10β20% cache hit rate = big infra savings.
8οΈβ£ Memory tuning (silent killer)
Watch for:
- Multiple model copies
- Tensor leaks
- Unreleased GPU memory
# PyTorch tips
with torch.no_grad():
output = model(x)
torch.cuda.empty_cache()
Use:
# tracemalloc
torch.cuda.memory_summary()
9οΈβ£ Observability (or youβll tune blind)
Must-have metrics:
- p50 / p95 / p99 latency
- Inference time vs total request time
- Queue wait time
- GPU utilization
- Memory usage
Stack:
-
Prometheus
-
Grafana
-
OpenTelemetry
π Final words
FastAPI is a fantastic framework for building AI APIs, but the real performance wins come from understanding and optimizing around the unique challenges of AI workloads. Focus on model loading, async handling, batching, and proper serialization to get the best results.
Happy coding! π