Why Your LLM Inference Is Slow: vLLM, KV Cache, and GPU Fit

The most common LLM serving mistake is assuming that a model "fits" because the weights loaded once. Production inference is not just model weights. It is we

AI Infrastructure | 9 min read | 2026-06-15

The most common LLM serving mistake is assuming that a model "fits" because the weights loaded once. Production inference is not just model weights. It is weights, KV cache, batch shape, context length, decode length, runtime overhead, and the way your traffic arrives.

That is why a demo can feel fast on a rented GPU and then slow down the moment real users arrive. The model did not suddenly get worse. The memory math changed.

The model fitting is only step one

Weights are the easy part to estimate. A 7B model in FP16 needs roughly 14GB before overhead. Quantization reduces that. But serving is not done after weights load. Every active request also needs KV cache so the model can reuse attention state while generating tokens.

Longer prompts, longer outputs, and more concurrent sequences all increase KV cache pressure. If you size the GPU only around weight memory, you leave no room for real traffic.

KV cache is where latency surprises start

KV cache grows with context length and active sequences. A single short prompt may look fine. Ten users with long chat history can push the same endpoint into memory pressure. The visible symptom is not always an obvious crash. Sometimes it is worse: lower tokens per second, higher time-to-first-token, queue buildup, and random-looking slow requests.

Modern runtimes such as vLLM are built to manage this more efficiently, but they still need a realistic budget. vLLM documentation calls out GPU memory utilization, maximum batched tokens, and maximum sequences as important tuning levers. Those settings decide how aggressively the server uses memory for throughput versus safety.

Time-to-first-token and decode speed are different problems

When users complain that the model is slow, ask which part is slow. Time-to-first-token usually points to prefill cost: prompt length, queueing, batching, and context. Slow generation after the first token points more toward decode throughput, output length, and how many active sequences are sharing the GPU.

A 16K prompt with a short answer and a 1K prompt with a 4K answer stress the system differently. Treating both as "one request" hides the real bottleneck.

The wrong GPU can look cheap until it spills

If the model, runtime, and KV cache do not comfortably fit in VRAM, the server may offload or thrash. That is when a cheaper GPU becomes expensive. You pay for time while tokens crawl. A larger GPU can be cheaper for the same workload if it finishes faster and avoids failed retries.

This is the practical reason to test with realistic context length before choosing between L4, RTX 4090, A100, H100, or a hosted model API. The right answer depends on the request shape, not just the model name.

Batching helps until it does not

Batching improves utilization by letting the GPU serve multiple requests together. But every extra active sequence consumes memory. Push batch settings too high and you can create exactly the memory pressure you were trying to avoid.

For latency-sensitive chat, smaller controlled batches may feel better. For offline batch jobs, higher throughput can matter more than first-token latency. The same GPU can be configured differently depending on whether you are serving a chatbot, a coding agent, or a document processing queue.

How to diagnose slow inference

  • Check VRAM headroom: If usage is pinned near the limit, KV cache has no room to breathe.
  • Separate TTFT from decode: Measure first token latency and tokens per second separately.
  • Log prompt and output length: Requests with the same endpoint can have wildly different cost.
  • Watch queue depth: Slow inference may be queueing, not pure model speed.
  • Test realistic concurrency: One-user demos do not prove production fit.

When a hosted API is better

If you do not need custom weights, direct SSH access, or a private runtime, a hosted OpenAI-compatible API is often simpler. You avoid sizing GPUs, tuning KV cache, restarting runtimes, and keeping machines warm. You pay per request and move faster.

Rent a GPU when you need control: custom Docker images, ComfyUI, vLLM servers, private weights, fine-tuning, notebooks, or long experiments. Use a hosted API when you need stateless inference without operating the machine.

The practical takeaway

If your LLM endpoint is slow, do not only upgrade the model or blame the provider. Check GPU fit, KV cache pressure, context length, batch settings, and output length. The fastest fix is often not a bigger GPU. It is choosing the right serving mode for the workload.

For stateless chat and model API workflows, start with Lumino hosted models. For custom runtimes, notebooks, and vLLM servers, use Lumino GPU Cloud.

Further reading