Skip to content

Resilience Policies

BLOGE makes resilience a first-class graph concern instead of burying it inside operator implementations. Each node can declare retry, timeout, and fallback behavior so runtime failure handling remains visible, reviewable, and measurable.

The resilience envelope

BLOGE applies resilience in this order:

  1. Retry wraps execution from the outside
  2. Timeout applies to each attempt
  3. Fallback handles the final failure if one is configured

That ordering means every retry attempt gets its own timeout budget and the fallback only activates after retries are exhausted or a non-retryable failure escapes.

Retry

Retry is useful for transient infrastructure failures such as flaky network calls, short-lived downstream overload, or temporary locks.

Java API

java
.node("checkCredit", creditCheckOperator)
    .retry(3, Duration.ofMillis(100), BackoffStrategy.EXPONENTIAL)

DSL

bloge
node checkCredit : CreditCheckOperator {
  retry = { attempts: 3, backoff: 100ms, strategy: exponential }
}

BLOGE supports fixed, exponential, and jitter-oriented retry backoff strategies. Use them only on failures that are safe and meaningful to retry.

Timeout

Timeout places an upper bound on one operator attempt. It is the simplest way to stop a slow dependency from stalling the whole graph.

Java API

java
.node("fetchProducts", fetchProductsOperator)
    .timeout(Duration.ofSeconds(5))

DSL

bloge
node fetchProducts : FetchProductsOperator {
  timeout = 5s
}

In virtual-thread execution, a timed-out attempt can be interrupted without burning an operating system thread for the duration of the wait.

Fallback

Fallback lets a graph degrade intentionally when a dependency is unavailable but the workflow can still continue with a substitute value.

Java API

java
.node("checkCredit", creditCheckOperator)
    .fallback(ex -> new CreditResult(false, "service unavailable"))

DSL

bloge
node checkCredit : CreditCheckOperator {
  fallback = { approved: false, reason: "credit service unavailable" }
}

Fallback should produce a value that downstream nodes can understand clearly. It should not hide a broken contract or silently convert every exception into “success.”

What to retry and what not to retry

Good retry candidatesPoor retry candidates
HTTP 503 / connection resetValidation failures
Temporary lock contentionBusiness rule rejection
Short-lived downstream unavailabilityBad request payloads
Remote service warm-up failuresDeterministic programmer errors

If the operator is not idempotent, add the right business safeguards before enabling aggressive retries.

Common resilience patterns

Parallel fan-out with differentiated policies

In a BFF graph, one branch may be mandatory while others can degrade:

  • fetchProfile: timeout + no fallback
  • fetchNotifications: timeout + retry + empty-list fallback
  • fetchLoyalty: short timeout + default fallback

BLOGE keeps these policies attached to the nodes that own the risk.

Decision flows with explicit degraded paths

A credit-check node may fall back to a rejected result. That makes the business consequence visible instead of forcing downstream code to infer why a value is missing.

Long-running flows with operational time bounds

Wait-like operators can still use timeouts and explicit timeout actions when the business process should move to escalation or compensation instead of waiting forever.

Telemetry and diagnostics

When bloge-metrics-otel is enabled, resilience behavior shows up directly in metrics:

  • bloge.node.retries
  • bloge.node.timeouts
  • bloge.node.fallbacks
  • bloge.node.errors

That makes it possible to distinguish healthy traffic from degraded-but-successful traffic.

Anti-patterns to avoid

  • Over-broad fallback: swallowing every exception and pretending the node succeeded
  • Missing timeout: allowing a slow dependency to hold the graph hostage indefinitely
  • Retrying business rejections: turning domain errors into noisy repeated failures
  • Hidden resilience: putting all retry logic inside operator code so the graph definition lies about runtime behavior

Before enabling a resilience policy, ask:

  1. Is the operator idempotent or otherwise safe to retry?
  2. What is the maximum acceptable latency contribution of this node?
  3. If a fallback is used, does downstream code know it is degraded?
  4. Should failure cancel the subtree, choose another branch, or emit a compensating result?

Next steps