Resilience Policies
BLOGE makes resilience a first-class graph concern instead of burying it inside operator implementations. Each node can declare retry, timeout, and fallback behavior so runtime failure handling remains visible, reviewable, and measurable.
The resilience envelope
BLOGE applies resilience in this order:
- Retry wraps execution from the outside
- Timeout applies to each attempt
- Fallback handles the final failure if one is configured
That ordering means every retry attempt gets its own timeout budget and the fallback only activates after retries are exhausted or a non-retryable failure escapes.
Retry
Retry is useful for transient infrastructure failures such as flaky network calls, short-lived downstream overload, or temporary locks.
Java API
.node("checkCredit", creditCheckOperator)
.retry(3, Duration.ofMillis(100), BackoffStrategy.EXPONENTIAL)DSL
node checkCredit : CreditCheckOperator {
retry = { attempts: 3, backoff: 100ms, strategy: exponential }
}BLOGE supports fixed, exponential, and jitter-oriented retry backoff strategies. Use them only on failures that are safe and meaningful to retry.
Timeout
Timeout places an upper bound on one operator attempt. It is the simplest way to stop a slow dependency from stalling the whole graph.
Java API
.node("fetchProducts", fetchProductsOperator)
.timeout(Duration.ofSeconds(5))DSL
node fetchProducts : FetchProductsOperator {
timeout = 5s
}In virtual-thread execution, a timed-out attempt can be interrupted without burning an operating system thread for the duration of the wait.
Fallback
Fallback lets a graph degrade intentionally when a dependency is unavailable but the workflow can still continue with a substitute value.
Java API
.node("checkCredit", creditCheckOperator)
.fallback(ex -> new CreditResult(false, "service unavailable"))DSL
node checkCredit : CreditCheckOperator {
fallback = { approved: false, reason: "credit service unavailable" }
}Fallback should produce a value that downstream nodes can understand clearly. It should not hide a broken contract or silently convert every exception into “success.”
What to retry and what not to retry
| Good retry candidates | Poor retry candidates |
|---|---|
| HTTP 503 / connection reset | Validation failures |
| Temporary lock contention | Business rule rejection |
| Short-lived downstream unavailability | Bad request payloads |
| Remote service warm-up failures | Deterministic programmer errors |
If the operator is not idempotent, add the right business safeguards before enabling aggressive retries.
Common resilience patterns
Parallel fan-out with differentiated policies
In a BFF graph, one branch may be mandatory while others can degrade:
fetchProfile: timeout + no fallbackfetchNotifications: timeout + retry + empty-list fallbackfetchLoyalty: short timeout + default fallback
BLOGE keeps these policies attached to the nodes that own the risk.
Decision flows with explicit degraded paths
A credit-check node may fall back to a rejected result. That makes the business consequence visible instead of forcing downstream code to infer why a value is missing.
Long-running flows with operational time bounds
Wait-like operators can still use timeouts and explicit timeout actions when the business process should move to escalation or compensation instead of waiting forever.
Telemetry and diagnostics
When bloge-metrics-otel is enabled, resilience behavior shows up directly in metrics:
bloge.node.retriesbloge.node.timeoutsbloge.node.fallbacksbloge.node.errors
That makes it possible to distinguish healthy traffic from degraded-but-successful traffic.
Anti-patterns to avoid
- Over-broad fallback: swallowing every exception and pretending the node succeeded
- Missing timeout: allowing a slow dependency to hold the graph hostage indefinitely
- Retrying business rejections: turning domain errors into noisy repeated failures
- Hidden resilience: putting all retry logic inside operator code so the graph definition lies about runtime behavior
Recommended checklist
Before enabling a resilience policy, ask:
- Is the operator idempotent or otherwise safe to retry?
- What is the maximum acceptable latency contribution of this node?
- If a fallback is used, does downstream code know it is degraded?
- Should failure cancel the subtree, choose another branch, or emit a compensating result?
Next steps
- See how node scheduling reacts to these policies in Execution Model
- Explore anti-pattern examples in Example Catalog
- Instrument production behavior in Observability