feat: add streaming response support for LLM inference endpoint (!342) · Merge requests · ml-platform / inference-server

✓

Pipeline #8821 passed · 4 jobs · view details

✓ Ready to merge

priya_backend Author 3 days ago

This MR adds server-sent-events (SSE) based streaming to the /v1/inference endpoint, enabling low-latency token-by-token delivery for LLM completions.

Changes:

New StreamManager class in inference/streaming.py — manages async generator lifecycle and client disconnects
Route handler updated to detect Accept: text/event-stream and switch response mode
Added X-Stream-Chunk-Size header support (default 512 tokens)
Integration tests for both streaming and non-streaming paths

Closes #318. Tested against llama-3-8b and mixtral-8x7b locally.

Activity

sanjay_ml Maintainer 2 days ago

Architecture looks solid. LGTM overall — two minor nits before I approve:

Should we default chunk_size to 512 or 1024? Most of our clients are on 1G links; 1024 might reduce round-trips meaningfully.
The async generator cleanup in streaming.py:89 might leak if the client disconnects mid-stream before the generator raises StopAsyncIteration. Worth adding a try/finally around the yield.

Otherwise this is well-structured. Once these are addressed I'll approve.

wei_code_review 2 days ago

How are you handling backpressure when the consumer reads slower than the model generates? I don't see any buffer depth limiting in StreamManager. If we get a slow client we could accumulate unbounded memory in the async queue.

Suggest a bounded asyncio.Queue(maxsize=N) with a configurable max — maybe STREAM_BUFFER_DEPTH env var defaulting to 32?

code_reviewer_k 1 day ago

{fill}

priya_backend Author 1 day ago

@wei_code_review — great catch, added a bounded asyncio.Queue(maxsize=32) in abc1234. Also added STREAM_BUFFER_DEPTH env var as you suggested.

@sanjay_ml defaulting to 512 per your recommendation — matches our existing chunk_size config pattern. The try/finally fix is in the same commit.

tanisha_infra 14 hours ago

Heads up on resource limits: the streaming endpoint will need updated memory limits in the k8s deployment manifest. Current limit is 512Mi — with concurrent streams and the new buffer we risk OOMKilled pods under load.

Suggesting 1Gi minimum. I can raise a separate infra MR for the manifest update if that's easier to track.

feat: add streaming response support for LLM inference endpoint !342