Nov 28, 2025

Latency is UX: what I learned cutting long-chat latency 50%+

In AI products, performance isn't just a technical metric—it's the core of the experience. A 3-second wait for a chat response kills the flow. When we scaled long-form interaction, the context management overhead started bloating our Time To First Token (TTFT).

### The Strategy We focused on three core areas: - **Context Puning**: Instead of sending the last 50 messages, we send the last 10 plus a dynamically updated "working memory" summary of the preceding 40. - **Streaming-First Architecture**: We moved all LLM calls to a streaming pattern, showing the user the response as it's generated, which perceived latency to almost zero. - **KV Cache Optimization**: On the backend, we ensured that shared system prompts were cached effectively to reduce prefill time.

### Impact These changes cut our average TTFT by over 50%, directly impacting user retention and session length.

Enjoyed this note?

I regularly share thoughts on building AI products and scaling engineering loops. Let's connect if you're shipping something interesting.