Retries
Retry transport problems, not contract truth
A good retry policy helps the runtime survive temporary failures without teaching the model to loop around validation or permission problems.
Recovery should branch on machine-readable types
When the contract tells you what kind of failure occurred, the runtime should decide whether to retry, reconfigure, or stop.
Retry Design Pillars
Keep retries finite, typed, and outside the model loop.
Classification
Decide quickly whether the failure is transient or terminal.
- Retry timeouts, some 5xx, and rate-limit cases carefully.
- Do not retry malformed requests unchanged.
- Do not retry auth failures unless credentials changed.
Backoff
Retry with a bounded and observable delay strategy.
- Use exponential backoff with jitter.
- Respect `Retry-After` when present.
- Cap attempt counts so failure stays visible.
Isolation
Keep retries in the runtime adapter instead of the reasoning loop.
- The model should not guess whether to replay the same request.
- Return a classified failure after retries are exhausted.
- Log attempt count and branch reason.
Separate retryable failures from terminal contract failures
Not every failed call deserves another try. The runtime should classify transport, rate-limit, validation, and permission failures before it decides anything else.
- Treat validation and missing-field failures as payload work.
- Treat permission and auth failures as capability work.
- Reserve retries for failures that may succeed later without semantic changes.
Keep retry behavior bounded and inspectable
A retry policy should make the system safer, not more mysterious. That means explicit limits, logs, and typed branch rules.
- Record the last type and status seen across attempts.
- Expose terminal failure to operators after the cap is hit.
- Keep the policy consistent across tools that hit the same API.
Retry Comparison
Use this split to decide whether another attempt is responsible.
| Failure class | Default move | Reason |
|---|---|---|
| Timeout / transient 5xx | Retry with bounded backoff. | The request may succeed later without changes. |
| Validation / missing input | Stop and repair the request. | Replaying the same payload will fail the same way. |
| Auth / forbidden | Stop and reconfigure credentials or scopes. | The runtime needs new capability, not another attempt. |
Decision Matrix
Use this when a live request fails and the runtime must choose a recovery path.
Situation
You receive `rate_limit.exceeded`
Action
Wait, respect `Retry-After` if present, then retry with bounded backoff.
Why
The request may succeed later without changing input.
Situation
You receive a validation-specific type
Action
Stop and change the payload or missing fields first.
Why
Transport-level retries cannot repair a bad request.
Situation
You receive an auth or forbidden type
Action
Treat it as runtime configuration work.
Why
Credentials and scopes belong to the runtime boundary, not the model.
Representative failure branches
Use typed envelopes like these to control retries safely.
{ "error": "Rate limit exceeded", "type": "rate_limit.exceeded"}{ "error": "Missing required field: targetRole", "type": "career_prediction.missing_target_role"}Next surfaces
After retries are clear, design continuation and endpoint-level error coverage.
