This MR adds server-sent-events (SSE) based streaming to the /v1/inference endpoint, enabling low-latency token-by-token delivery for LLM completions.
Changes:
StreamManager class in inference/streaming.py — manages async generator lifecycle and client disconnectsAccept: text/event-stream and switch response modeX-Stream-Chunk-Size header support (default 512 tokens)Closes #318. Tested against llama-3-8b and mixtral-8x7b locally.
Architecture looks solid. LGTM overall — two minor nits before I approve:
chunk_size to 512 or 1024? Most of our clients are on 1G links; 1024 might reduce round-trips meaningfully.streaming.py:89 might leak if the client disconnects mid-stream before the generator raises StopAsyncIteration. Worth adding a try/finally around the yield.Otherwise this is well-structured. Once these are addressed I'll approve.
How are you handling backpressure when the consumer reads slower than the model generates? I don't see any buffer depth limiting in StreamManager. If we get a slow client we could accumulate unbounded memory in the async queue.
Suggest a bounded asyncio.Queue(maxsize=N) with a configurable max — maybe STREAM_BUFFER_DEPTH env var defaulting to 32?
@wei_code_review — great catch, added a bounded asyncio.Queue(maxsize=32) in abc1234. Also added STREAM_BUFFER_DEPTH env var as you suggested.
@sanjay_ml defaulting to 512 per your recommendation — matches our existing chunk_size config pattern. The try/finally fix is in the same commit.
Heads up on resource limits: the streaming endpoint will need updated memory limits in the k8s deployment manifest. Current limit is 512Mi — with concurrent streams and the new buffer we risk OOMKilled pods under load.
Suggesting 1Gi minimum. I can raise a separate infra MR for the manifest update if that's easier to track.