Resilience Policies

BLOGE makes resilience a first-class graph concern instead of burying it inside operator implementations. Each node can declare retry, timeout, and fallback behavior so runtime failure handling remains visible, reviewable, and measurable.

The resilience envelope

BLOGE applies resilience in this order:

Retry wraps execution from the outside
Timeout applies to each attempt
Fallback handles the final failure if one is configured

That ordering means every retry attempt gets its own timeout budget and the fallback only activates after retries are exhausted or a non-retryable failure escapes.

Retry

Retry is useful for transient infrastructure failures such as flaky network calls, short-lived downstream overload, or temporary locks.

Java API

java

.node("checkCredit", creditCheckOperator)
    .retry(3, Duration.ofMillis(100), BackoffStrategy.EXPONENTIAL)

DSL

bloge

node checkCredit : CreditCheckOperator {
  retry = { attempts: 3, backoff: 100ms, strategy: exponential }
}

BLOGE supports fixed, exponential, and jitter-oriented retry backoff strategies. Use them only on failures that are safe and meaningful to retry.

Timeout

Timeout places an upper bound on one operator attempt. It is the simplest way to stop a slow dependency from stalling the whole graph.

Java API

java

.node("fetchProducts", fetchProductsOperator)
    .timeout(Duration.ofSeconds(5))

DSL

bloge

node fetchProducts : FetchProductsOperator {
  timeout = 5s
}

In virtual-thread execution, a timed-out attempt can be interrupted without burning an operating system thread for the duration of the wait.

Fallback

Fallback lets a graph degrade intentionally when a dependency is unavailable but the workflow can still continue with a substitute value.

Java API

java

.node("checkCredit", creditCheckOperator)
    .fallback(ex -> new CreditResult(false, "service unavailable"))

DSL

bloge

node checkCredit : CreditCheckOperator {
  fallback = { approved: false, reason: "credit service unavailable" }
}

Fallback should produce a value that downstream nodes can understand clearly. It should not hide a broken contract or silently convert every exception into “success.”

What to retry and what not to retry

Good retry candidates	Poor retry candidates
HTTP 503 / connection reset	Validation failures
Temporary lock contention	Business rule rejection
Short-lived downstream unavailability	Bad request payloads
Remote service warm-up failures	Deterministic programmer errors

If the operator is not idempotent, add the right business safeguards before enabling aggressive retries.

Common resilience patterns

Parallel fan-out with differentiated policies

In a BFF graph, one branch may be mandatory while others can degrade:

fetchProfile: timeout + no fallback
fetchNotifications: timeout + retry + empty-list fallback
fetchLoyalty: short timeout + default fallback

BLOGE keeps these policies attached to the nodes that own the risk.

Decision flows with explicit degraded paths

A credit-check node may fall back to a rejected result. That makes the business consequence visible instead of forcing downstream code to infer why a value is missing.

Long-running flows with operational time bounds

Wait-like operators can still use timeouts and explicit timeout actions when the business process should move to escalation or compensation instead of waiting forever.

Telemetry and diagnostics

When bloge-metrics-otel is enabled, resilience behavior shows up directly in metrics:

bloge.node.retries
bloge.node.timeouts
bloge.node.fallbacks
bloge.node.errors

That makes it possible to distinguish healthy traffic from degraded-but-successful traffic.

Anti-patterns to avoid

Over-broad fallback: swallowing every exception and pretending the node succeeded
Missing timeout: allowing a slow dependency to hold the graph hostage indefinitely
Retrying business rejections: turning domain errors into noisy repeated failures
Hidden resilience: putting all retry logic inside operator code so the graph definition lies about runtime behavior

Recommended checklist

Before enabling a resilience policy, ask:

Is the operator idempotent or otherwise safe to retry?
What is the maximum acceptable latency contribution of this node?
If a fallback is used, does downstream code know it is degraded?
Should failure cancel the subtree, choose another branch, or emit a compensating result?

Next steps

See how node scheduling reacts to these policies in Execution Model
Explore anti-pattern examples in Example Catalog
Instrument production behavior in Observability

Resilience Policies ​

The resilience envelope ​

Retry ​

Java API ​

DSL ​

Timeout ​

Java API ​

DSL ​

Fallback ​

Java API ​

DSL ​

What to retry and what not to retry ​

Common resilience patterns ​

Parallel fan-out with differentiated policies ​

Decision flows with explicit degraded paths ​

Long-running flows with operational time bounds ​

Telemetry and diagnostics ​

Anti-patterns to avoid ​

Recommended checklist ​

Next steps ​

Resilience Policies

The resilience envelope

Retry

Java API

DSL

Timeout

Java API

DSL

Fallback

Java API

DSL

What to retry and what not to retry

Common resilience patterns

Parallel fan-out with differentiated policies

Decision flows with explicit degraded paths

Long-running flows with operational time bounds

Telemetry and diagnostics

Anti-patterns to avoid

Recommended checklist

Next steps