Skip to content

IoT Documentation

Observability Capability

Observability Capability

Purpose: Define one end-to-end contract for metrics, traces, and logs across platform and applications.

Why this capability exists

Observability currently spans multiple repositories and document sets. This page is the canonical handshake between app teams and platform teams.

Architecture scope

At a high level:

App emits telemetry
OTEL Collector receives traces and metrics
Logs are emitted as structured JSON to stdout and shipped by GKE logging pipeline
Telemetry lands in GCP backends and is visualized in Grafana

Platform contract (infrastructure and manifests)

Platform provides:

Collector deployment, routing, and exporter configuration
Required IAM roles for collector write paths
Baseline dashboards and query patterns

Source starting points:

iot-manifests/docs/concepts/opentelemetry.md
iot-manifests/docs/features/observability.md
iot-manifests/docs/applications/vendor/otel-collector.md
iot-infrastructure/docs/concepts/observability.md

Application contract

Application teams are responsible for:

Instrumentation strategy for their runtime
Structured logging schema and severity usage
Required environment variables for telemetry export
Verifying trace-log correlation for at least one request path

Definition of done checklist

[ ] Service emits traces with stable service name
[ ] Service emits structured logs with correlation fields
[ ] Golden path request can be traced end to end
[ ] Dashboard and alert queries are documented
[ ] Failure modes and runbook links are documented

Open questions

Should we standardize one language-specific OTEL starter kit per runtime?
Should logs ever flow through OTLP for selected services?