Features

Model Fallback

Pass an ordered list of models and BitRouter walks the list when the primary fails.

LLM endpoints fail. Rate limits, model outages, context-overflow errors, and content filters all surface as request errors that would otherwise stall an agent loop. Model fallback lets you pass a ranked list of models in a single request — BitRouter walks the list until one succeeds, then returns that response.

This is a body-level extension to the OpenAI, Anthropic, and Google protocol surfaces. No SDK required — set one field.

Quick example

curl http://127.0.0.1:4356/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o",
    "models": [
      "openai/gpt-4o",
      "anthropic/claude-sonnet-4-6",
      "google/gemini-2.5-pro"
    ],
    "messages": [{"role": "user", "content": "Summarize the Iliad in one sentence."}]
  }'

The model field stays the primary for billing and routing semantics. The models array overrides it as an ordered preference list — first one that returns successfully wins.

You can omit model entirely when models is set. If both are present, the first entry of models is the primary and model is ignored. We recommend always passing model so error logs in your app stay readable.

What triggers a fallback

BitRouter falls through to the next model on errors that are upstream-side and likely transient, and surfaces 4xx errors caused by your request directly to the caller.

OutcomeSignalBehavior
Rate limited429Fall through
Server error5xxFall through
Timeout / connection drop408, network errorFall through
Context window exceededprovider-specific codeFall through
Content filter / refusalprovider-specific codeFall through
Mid-stream failure (no tokens emitted)stream aborted before first tokenFall through
Mid-stream failure (after first token)stream aborted mid-responseSurfaced — partial output already sent
Authentication error401Surfaced
Forbidden / quota exhausted402, 403Surfaced
Validation / bad request400, 422Surfaced
Explicit cancelclient disconnectSurfaced

Fallback is single-pass. BitRouter attempts each model at most once per request, in order. There is no exponential backoff between attempts — the assumption is that you'd rather retry on a different model immediately than wait on a failing one.

Fallback runs per request, not per token. If a stream succeeds on model A and disconnects after token 50, BitRouter does not silently restart on model B — that would emit a discontinuous response. Wrap fallback at the request boundary; checkpoint and resume in your agent for stream-level resilience.

Inspecting which model answered

Three breadcrumbs:

  • Response body model field — set to the model that actually generated the response, not the one you requested first. (OpenAI convention.)
  • Response header bitrouter-served-by<provider-id>/<model-id>, e.g. anthropic-direct/anthropic/claude-sonnet-4-6.
  • Response header bitrouter-fallback-trace — comma-separated list of attempts and outcomes, e.g. openai/gpt-4o:rate_limit,anthropic/claude-sonnet-4-6:served. Only emitted when at least one fallback fired.

Cost and latency tradeoffs

Each fallback attempt is a fresh upstream request. Practical advice:

  • Lowest expected cost: order by cheapest first, accept higher tail latency under load.
  • Lowest expected latency: order by most reliable first, accept higher per-token cost.
  • For long-running agent loops: bias toward reliability. The cost of a stalled loop is much higher than the marginal cost difference between two frontier models.

For declarative cost or latency optimization across providers of a single model, see Provider Selection. Fallback and provider selection compose: BitRouter picks the best provider for each model in your models array, falling through to the next model only after the chosen provider for the current one has exhausted its retry budget.

Anthropic and Google surfaces

The models field works identically on /v1/messages (Anthropic Messages) and /v1beta/models/{model}:generateContent (Google Generative AI). On Anthropic, the field is read alongside the existing model field; on Google, BitRouter accepts it as an extension to the request body — the upstream :generateContent path is rewritten per attempt.

Limits

  • Maximum 8 entries in models.
  • Each model ID must resolve to a registered model in the registry. Unknown IDs return 400 before any upstream attempt.
  • Streaming is supported. The first model that begins emitting tokens wins; later models are not attempted even if the stream fails after the first token.

How is this guide?

On this page