Retries

Retry transport problems, not contract truth

A good retry policy helps the runtime survive temporary failures without teaching the model to loop around validation or permission problems.

Format: AI-first docs

Source: DocsAgentsRetriesPage

Warning

Recovery should branch on machine-readable types

When the contract tells you what kind of failure occurred, the runtime should decide whether to retry, reconfigure, or stop.

Retry Design Pillars

Keep retries finite, typed, and outside the model loop.

Classification

Decide quickly whether the failure is transient or terminal.

Retry timeouts, some 5xx, and rate-limit cases carefully.
Do not retry malformed requests unchanged.
Do not retry auth failures unless credentials changed.

Backoff

Retry with a bounded and observable delay strategy.

Use exponential backoff with jitter.
Respect `Retry-After` when present.
Cap attempt counts so failure stays visible.

Isolation

Keep retries in the runtime adapter instead of the reasoning loop.

The model should not guess whether to replay the same request.
Return a classified failure after retries are exhausted.
Log attempt count and branch reason.

Separate retryable failures from terminal contract failures

Not every failed call deserves another try. The runtime should classify transport, rate-limit, validation, and permission failures before it decides anything else.

Treat validation and missing-field failures as payload work.
Treat permission and auth failures as capability work.
Reserve retries for failures that may succeed later without semantic changes.

Keep retry behavior bounded and inspectable

A retry policy should make the system safer, not more mysterious. That means explicit limits, logs, and typed branch rules.

Record the last type and status seen across attempts.
Expose terminal failure to operators after the cap is hit.
Keep the policy consistent across tools that hit the same API.

Retry Comparison

Use this split to decide whether another attempt is responsible.

Failure class	Default move	Reason
Timeout / transient 5xx	Retry with bounded backoff.	The request may succeed later without changes.
Validation / missing input	Stop and repair the request.	Replaying the same payload will fail the same way.
Auth / forbidden	Stop and reconfigure credentials or scopes.	The runtime needs new capability, not another attempt.

Decision Matrix

Use this when a live request fails and the runtime must choose a recovery path.

Situation

You receive `rate_limit.exceeded`

Action

Wait, respect `Retry-After` if present, then retry with bounded backoff.

Why

The request may succeed later without changing input.

Situation

You receive a validation-specific type

Action

Stop and change the payload or missing fields first.

Why

Transport-level retries cannot repair a bad request.

Situation

You receive an auth or forbidden type

Action

Treat it as runtime configuration work.

Why

Credentials and scopes belong to the runtime boundary, not the model.

Representative failure branches

Use typed envelopes like these to control retries safely.

Retryable branch

1{

2 "error": "Rate limit exceeded",

3 "type": "rate_limit.exceeded"

4}

Terminal branch

1{

2 "error": "Missing required field: targetRole",

3 "type": "career_prediction.missing_target_role"

4}

Next surfaces

After retries are clear, design continuation and endpoint-level error coverage.

Open Agent Webhooks

Open Error Types

Open OpenAPI