Skip to content

Observability (OTel)

bloge-metrics-otel adds production-facing observability integrations to BLOGE. It emits metrics, traces, and structured logs that line up with the graph execution model, including retries, timeouts, and fallback behavior.

Components

ComponentRole
MetricsExecutionListenerEmits graph, node, retry, timeout, fallback, and stream metrics
TracingOperatorInterceptorCreates graph-level and node-level spans
LoggingExecutionListenerWrites structured lifecycle logs with BLOGE-specific MDC keys
OtelContextCarrierPropagates OpenTelemetry context into engine virtual threads
MdcContextCarrierPropagates SLF4J MDC values such as traceId and requestId

Manual wiring

java
TracingOperatorInterceptor tracing = new TracingOperatorInterceptor(tracer);

GraphEngine engine = GraphEngine.builder()
    .registry(registry)
    .interceptors(List.of(tracing))
    .listeners(List.of(
        tracing,
        new MetricsExecutionListener(meterRegistry, "bloge"),
        new LoggingExecutionListener(false, false)
    ))
    .contextCarriers(List.of(
        new OtelContextCarrier(),
        new MdcContextCarrier()
    ))
    .build();

The same tracing component is registered as both an interceptor and a listener so node spans are nested under the active graph span.

Metrics emitted

MetricsExecutionListener can emit:

  • bloge.graph.duration
  • bloge.node.duration
  • bloge.node.errors
  • bloge.node.retries
  • bloge.node.timeouts
  • bloge.node.fallbacks
  • bloge.node.skipped
  • bloge.stream.chunk.count
  • bloge.stream.duration
  • bloge.stream.errors

Durable integrations can add checkpoint, work-item, and lease metrics on top of these signals.

Spring Boot properties

When used with bloge-spring, observability can be configured through familiar properties:

yaml
spring:
  bloge:
    observability:
      metrics:
        enabled: true
        prefix: bloge
      tracing:
        enabled: true
      logging:
        enabled: true
      context-propagation: true
      mdc-propagation: true

Production dashboards

The example project ships a Grafana dashboard with queries such as:

  • graph duration p95
  • retry count by graph
  • fallback count by graph

These align well with the questions operators ask in production:

  • is this graph slower than usual?
  • are we succeeding because of fallback instead of primary service health?
  • which node is accumulating retries?

Logging and data sensitivity

LoggingExecutionListener can optionally include node input and output payloads. Leave both disabled unless you have reviewed the payloads for sensitive business or personal data.

Why telemetry matters in BLOGE

BLOGE's graph model makes it possible to instrument execution at the orchestration boundary instead of at arbitrary call sites. That means telemetry answers graph-level questions directly:

  • which branch was taken?
  • which node timed out?
  • how often are we degrading through fallback?
  • how much latency is added by a specific fan-out or retrying dependency?
  1. enable metrics first
  2. add tracing where graph executions must appear in distributed traces
  3. enable structured logs for lifecycle visibility
  4. propagate MDC and OTel context when your platform already uses them elsewhere

Next steps