Inference runtime

vLLM

High-throughput inference server with PagedAttention

Linux + NVIDIA GPU primary; AMD ROCm + CPU also supportedApache 2.0

vLLM is the production-leaning local server — PagedAttention for memory efficiency, continuous batching for high throughput, OpenAI-compatible REST API. Common pairing with on-prem deployments serving a small team's chatbot or coding assistant.

Report an issue with vLLM

Posts to your status feed

Pick the closest match below, edit the body, and post. Your report carries the #vllm tag automatically so it surfaces here + in the trending-tags rail. Reporting also follows vLLM so you’ll get status updates.

Down Very Slow Hallucinating Refusing Prompts Rate-Limited Other

vLLM

Report an issue with vLLM

Recent coverage

Tag aliases

Tags1

Community tags

Edit history

Community status