Skip to content

Observability Capability

Purpose: Define one end-to-end contract for metrics, traces, and logs across platform and applications.

Why this capability exists

Observability currently spans multiple repositories and document sets. This page is the canonical handshake between app teams and platform teams.

Architecture scope

At a high level:

  1. App emits telemetry
  2. OTEL Collector receives traces and metrics
  3. Logs are emitted as structured JSON to stdout and shipped by GKE logging pipeline
  4. Telemetry lands in GCP backends and is visualized in Grafana

Platform contract (infrastructure and manifests)

Platform provides:

  • Collector deployment, routing, and exporter configuration
  • Required IAM roles for collector write paths
  • Baseline dashboards and query patterns

Source starting points:

  • iot-manifests/docs/concepts/opentelemetry.md
  • iot-manifests/docs/features/observability.md
  • iot-manifests/docs/applications/vendor/otel-collector.md
  • iot-infrastructure/docs/concepts/observability.md

Application contract

Application teams are responsible for:

  • Instrumentation strategy for their runtime
  • Structured logging schema and severity usage
  • Required environment variables for telemetry export
  • Verifying trace-log correlation for at least one request path

Definition of done checklist

  • [ ] Service emits traces with stable service name
  • [ ] Service emits structured logs with correlation fields
  • [ ] Golden path request can be traced end to end
  • [ ] Dashboard and alert queries are documented
  • [ ] Failure modes and runbook links are documented

Open questions

  • Should we standardize one language-specific OTEL starter kit per runtime?
  • Should logs ever flow through OTLP for selected services?