FastAPI: Optimizing for AI Workloads

Tips and tricks to make FastAPI shine with AI models.

FastAPIAPIAI

1️⃣ Understand the AI bottlenecks first (before tuning FastAPI)

FastAPI itself is rarely the slow part. For AI APIs, the usual killers are:

  1. πŸ”₯ Model inference time (CPU/GPU)
  2. 🧠 Memory pressure (loading models per request 🀦)
  3. πŸ“¦ Serialization (big tensors β†’ JSON)
  4. 🧡 Blocking code inside async routes

So tuning FastAPI = removing friction around inference, not micro-optimizing Python.

2️⃣ Load models ONCE (never inside requests)

❌ Wrong
@app.post("/predict")
async def predict():
    model = load_model()  # πŸ’€ every request
    return model.run()
βœ… Correct
from contextlib import asynccontextmanager

@asynccontextmanager
async def lifespan(app: FastAPI):
    app.state.model = load_model()
    yield

app = FastAPI(lifespan=lifespan)

@app.post("/predict")
async def predict():
    return app.state.model.run()

πŸ’‘ Rule:

Model load = startup
Model run = request

This alone gives 10–100x latency improvement.

3️⃣ Async FastAPI β‰  async AI code (important)

Most AI libs are blocking:

  1. PyTorch
  2. TensorFlow
  3. NumPy
  4. OpenCV

❌ This blocks the event loop

@app.post("/predict")
async def predict():
    return model.run(input)  # blocking

βœ… Run inference in threadpool

from fastapi.concurrency import run_in_threadpool

@app.post("/predict")
async def predict():
    return await run_in_threadpool(model.run, input)

πŸ“Œ Why?

  • Keeps FastAPI responsive
  • Allows concurrent requests
  • Prevents tail latency explosions

4️⃣ Batch requests (HUGE for AI APIs)

Instead of: 1 request β†’ 1 inference

Do: N requests β†’ 1 batched inference

Pattern

  1. Queue inputs
  2. Run inference every X ms or when batch is full
  3. Fan-out results

This gives:

  • πŸš€ Better GPU utilization
  • πŸ’Έ Lower infra cost
  • ⏱ Lower average latency

Used by:

  • OpenAI-style APIs
  • Recommendation systems
  • Embedding services

If you want, I can sketch a FastAPI batching worker.

5️⃣ Use the right server config (don’t ignore this)

Production command (baseline)

uvicorn app:app \
  --host 0.0.0.0 \
  --port 8000 \
  --workers 4

AI-specific rules

  • CPU inference β†’ more workers
  • GPU inference β†’ usually workers=1 per GPU
  • Avoid Gunicorn workers fighting for GPU memory

πŸ’‘ For GPU:

1 worker = 1 GPU

6️⃣ Avoid JSON for large tensors

JSON is slow + huge.

Better options

  1. 🧬 MsgPack
  2. πŸ§ͺ Protobuf
  3. πŸ–ΌοΈ Binary (for images/audio)
  4. πŸ“¦ Base64 only if unavoidable

Example:

from fastapi.responses import Response

@app.post("/embedding")
def embed():
    vector = model.embed(text)
    return Response(
        content=vector.tobytes(),
        media_type="application/octet-stream"
    )

Latency drops massively for embeddings.

7️⃣ Cache aggressively (AI loves caching)

Cache levels

βœ… Input β†’ Output cache (exact prompts)

βœ… Embedding cache

βœ… Feature cache

βœ… Prompt template cache

Tools:

Redis
  • In-memory LRU
  • Disk cache for embeddings
  • Even 10–20% cache hit rate = big infra savings.

8️⃣ Memory tuning (silent killer)

Watch for:
  • Multiple model copies
  • Tensor leaks
  • Unreleased GPU memory
# PyTorch tips
with torch.no_grad():
    output = model(x)

torch.cuda.empty_cache()

Use:

# tracemalloc
torch.cuda.memory_summary()

9️⃣ Observability (or you’ll tune blind)

Must-have metrics:

  • p50 / p95 / p99 latency
  • Inference time vs total request time
  • Queue wait time
  • GPU utilization
  • Memory usage
Stack:
  1. Prometheus

  2. Grafana

  3. OpenTelemetry

πŸ”Ÿ Final words

FastAPI is a fantastic framework for building AI APIs, but the real performance wins come from understanding and optimizing around the unique challenges of AI workloads. Focus on model loading, async handling, batching, and proper serialization to get the best results.

Happy coding! πŸš€