Choosing Your Observability Stack Starts With Scope
Observability tooling often gets discussed as a shopping list: Prometheus vs OpenTelemetry, Grafana vs Elastic, metrics vs traces.
In practice, the right choice depends far less on tools and far more on scope.
This post walks through a practical way to think about observability as systems grow, and why starting simple is often the correct engineering decision.
Observability is about questions, not signals
Before choosing tools, it helps to ask:
What questions do I need to answer right now?
Different scopes produce different questions:
- Is the service healthy?
- Is something slower than usual?
- Why did this request fail?
- Where is time actually being spent across services?
Metrics, logs, and traces answer different classes of questions. You don’t need all of them at once.
Scope 1: Metrics only — start here
For many systems, metrics are enough.
A Prometheus + Grafana setup gives you:
- latency
- throughput
- error rates
- saturation
With:
- low overhead
- simple mental model
- battle-tested tooling
- powerful querying via PromQL
Service -> Prometheus -> Grafana
This combination is hard to beat for:
- quick visibility
- alerting
- operational confidence
If you can answer “Is the system healthy?” you’re already ahead.
Scope 2: Metrics + logs — adding context
Eventually, metrics tell you that something is wrong — but not why.
That’s where logs come in.
Pairing Prometheus with a log system such as:
- OpenSearch
- Loki
- Elastic
- …
lets you:
- correlate spikes with concrete events
- inspect error paths
- understand failures in detail
Service -> Prometheus -> Grafana
-> Logs -> Search UI
At this stage:
- metrics give you the signal
- logs give you the explanation
This is often enough for:
- monoliths
- small service meshes
- teams with clear ownership boundaries
Scope 3: Metrics + logs + traces — when systems grow
As systems become more distributed, new questions appear:
- Which service caused the slowdown?
- Where did latency accumulate?
- How did a single request flow across boundaries?
At this point, tracing becomes essential.
This is where OpenTelemetry (OTel) fits naturally.
OpenTelemetry as an evolution, not a replacement
OpenTelemetry isn’t “better Prometheus”.
It’s a unifying instrumentation layer for:
- metrics
- logs
- traces
Using a single SDK, you can instrument once and export to multiple backends.
A common, practical setup looks like this:
Service
↓
OpenTelemetry SDK
├─ metrics -> Prometheus
├─ traces -> Tempo / Jaeger / OTLP backend
└─ logs -> OpenSearch / Loki
Important detail:
You can keep Prometheus.
OTel metrics can be exported to Prometheus, which means:
- existing dashboards continue to work
- PromQL remains your alerting language
- migration is incremental, not disruptive
Why scope matters more than tooling
Jumping straight to “full observability” too early has costs:
- higher cognitive load
- more moving parts
- more to maintain
- more places to look during incidents
If your system doesn’t need cross-service tracing yet, traces add noise, not clarity.
Observability should grow with the questions your system asks.
A simple decision guide
-
Single service, clear ownership
-> Prometheus + Grafana -
Need root-cause analysis
-> Add logs -
Distributed workflows, async flows, service meshes
-> Introduce tracing via OpenTelemetry
Each step builds on the previous one.
None of them invalidate earlier choices.
The takeaway
Choosing an observability stack isn’t about picking the “best” tool.
It’s about matching tooling to scope:
- start with metrics
- add logs for context
- grow into unified telemetry when system boundaries matter
Start simple. Evolve deliberately. Let the system’s questions guide the tooling — not the other way around.
Related reading
If you’re interested in building observable and reliable systems, you may also like:
-
Async Doesn’t Make Your System Fast — It Makes It Honest
Why async systems expose real behavior — and why observability matters more because of it. -
Evolving a FastAPI Backend: From REST to Messaging, WebSockets, and Event-Driven UX
How architecture changes introduce new observability requirements. -
Docker Build Speed Isn’t Magic — It’s Cache Discipline
Understanding fundamentals to avoid chasing the wrong optimizations.