+
🔔
PB
ml-platform / inference-server / Merge requests / !342
Open

feat: add streaming response support for LLM inference endpoint !342

priya_backend wants to merge feature/streaming-llm main  ·  opened 3 days ago  ·  3 commits  ·  12 files changed
Overview Commits 3 Pipelines Changes 12
Pipeline #8821 passed  ·  4 jobs  ·  view details
✓ Ready to merge
P
priya_backend Author 3 days ago

This MR adds server-sent-events (SSE) based streaming to the /v1/inference endpoint, enabling low-latency token-by-token delivery for LLM completions.

Changes:

  • New StreamManager class in inference/streaming.py — manages async generator lifecycle and client disconnects
  • Route handler updated to detect Accept: text/event-stream and switch response mode
  • Added X-Stream-Chunk-Size header support (default 512 tokens)
  • Integration tests for both streaming and non-streaming paths

Closes #318. Tested against llama-3-8b and mixtral-8x7b locally.

Activity
S
sanjay_ml Maintainer 2 days ago

Architecture looks solid. LGTM overall — two minor nits before I approve:

  • Should we default chunk_size to 512 or 1024? Most of our clients are on 1G links; 1024 might reduce round-trips meaningfully.
  • The async generator cleanup in streaming.py:89 might leak if the client disconnects mid-stream before the generator raises StopAsyncIteration. Worth adding a try/finally around the yield.

Otherwise this is well-structured. Once these are addressed I'll approve.

W
wei_code_review 2 days ago

How are you handling backpressure when the consumer reads slower than the model generates? I don't see any buffer depth limiting in StreamManager. If we get a slow client we could accumulate unbounded memory in the async queue.

Suggest a bounded asyncio.Queue(maxsize=N) with a configurable max — maybe STREAM_BUFFER_DEPTH env var defaulting to 32?

K
code_reviewer_k 1 day ago
{fill}
P
priya_backend Author 1 day ago

@wei_code_review — great catch, added a bounded asyncio.Queue(maxsize=32) in abc1234. Also added STREAM_BUFFER_DEPTH env var as you suggested.

@sanjay_ml defaulting to 512 per your recommendation — matches our existing chunk_size config pattern. The try/finally fix is in the same commit.

T
tanisha_infra 14 hours ago

Heads up on resource limits: the streaming endpoint will need updated memory limits in the k8s deployment manifest. Current limit is 512Mi — with concurrent streams and the new buffer we risk OOMKilled pods under load.

Suggesting 1Gi minimum. I can raise a separate infra MR for the manifest update if that's easier to track.