vLLM
Register a local vLLM server as a BitRouter provider — high-throughput GPU serving behind an OpenAI-compatible API.
vLLM is a high-throughput inference engine for serving models on your own GPUs. Its vllm serve command exposes an OpenAI-compatible API at http://localhost:8000/v1, which BitRouter fronts as one provider block.
Prerequisites
-
BitRouter installed, with a
bitrouter.yaml(scaffold one withbitrouter init). -
vLLM serving a model:
vllm serve meta-llama/Llama-3.1-8B-Instruct # default port 8000
Add vLLM to BitRouter
# bitrouter.yaml
providers:
vllm:
api_base: http://localhost:8000/v1
api_protocol:
- "*": chat_completions
models:
- id: meta-llama/Llama-3.1-8B-InstructThe models id must match the name vLLM serves. By default that's the full Hugging Face repo id; pass --served-model-name my-model to vllm serve to alias it to something shorter, then use that alias here.
Optional auth. vLLM is keyless by default. If you launched it with --api-key <token> (or VLLM_API_KEY), add api_key: ${VLLM_API_KEY} to the provider block — it resolves from the environment at load time.
Port clash with Unsloth. vLLM and Unsloth Studio both default to :8000. If you run both, start one on another port (vllm serve … --port 8001) and update api_base.
Route to it
bitrouter route vllm:meta-llama/Llama-3.1-8B-InstructThen start BitRouter and send a request.
Learn more
How is this guide?