Skip to main content
Ultralytics YOLO in Production·Scale and Observe·Lesson 8/9
Lessonintermediate

Observability and Drift

What to log, what to dashboard, and how to spot accuracy regression before users do.

A computer vision system in production has three failure modes that traditional monitoring misses: silent accuracy drops, distribution drift, and per-class regressions. The fix is to instrument the model the same way you instrument an HTTP service — with explicit, alertable signals — and to put model monitoring on the same footing as the rest of your MLOps discipline.

Outcome

Define and emit the metrics that catch accuracy and drift regressions, and set thresholds that page on real problems.

Fast Track
If you already know your way around, here's the short version.
  1. Per-frame: latency, queue depth, decode failures.

  2. Per-stream: detection volume, mean confidence, mean object size.

  3. Per-deployment: holdout set mAP, weekly.

  4. Distribution drift: KS-test on confidence histograms, mean object area.

Hands-on

The three layers of observability

Ultralytics Platform deployment logs with severity filter

LayerQuestionExamples
Infrastructure"Is the service up?"CPU, GPU, memory, throughput, dropped frames
Model behavior"Is the model doing what it did yesterday?"Detection volume, mean confidence, class distribution
Model accuracy"Is the model still right?"Holdout mAP (performance metrics), sampled human labels, alarm rates

Most teams cover the first layer with off-the-shelf tools, then forget the other two. The other two are where production CV breaks silently; model monitoring is the discipline that keeps them visible.

Metrics worth emitting per frame

import time
import logging

log = logging.getLogger("yolo.inference")

for r in model.track("rtsp://camera-1.local/stream", stream=True, persist=True):
    inf_ms = r.speed["inference"]
    n_boxes = len(r.boxes)
    mean_conf = float(r.boxes.conf.mean()) if n_boxes else None

    log.info(
        "inference",
        extra={
            "camera": "camera-1",
            "inference_ms": inf_ms,
            "detections": n_boxes,
            "mean_confidence": mean_conf,
        },
    )

Per camera, send to your metric system:

  • inference_msinference latency. Alert on p95 above your budget.
  • detections — count of returned boxes. Alert on a sudden drop or surge — a cheap anomaly detection signal.
  • mean_confidence — distributional signal. A drift down often means domain change before mAP does.
  • dropped_frames — capacity signal. > 0 means the system is overloaded.

The weekly holdout check

Define a small holdout set — 200–500 production-realistic labeled images, sampled before launch — and validate against it weekly:

yolo val model=production.engine data=holdout.yaml > weekly_metrics.txt

Pipe the mAP into your metrics system. Alert when it drops by more than 1 mAP point from the previous week. The most common cause is a model swap that was supposed to be a no-op; the second most common is data drift (next section). A confusion matrix on the same holdout makes per-class regressions obvious.

Holdout sampling matters

The holdout has to look like current production. A holdout sampled from January's data won't catch September's drift. Refresh the holdout every quarter with newly labeled production frames.

Distribution drift

The model's behavior changes when its inputs change — the textbook definition of data drift. Two cheap distributional signals:

  1. Confidence histogram. Plot a histogram of detection confidence weekly. If the distribution shifts left (fewer high-confidence detections, more uncertain ones), the input domain has drifted.
  2. Object size distribution. Plot box width × height. New camera mounts, new optics, weather conditions all show up here.

A Kolmogorov-Smirnov test compares two distributions and gives a p-value. A p-value < 0.01 means "these distributions are reliably different" — a useful trigger for investigation, and often the input to an active learning loop or a continual learning refresh.

from scipy.stats import ks_2samp
import numpy as np

baseline = np.load("baseline_confidences.npy")     # collected at deploy
recent = np.load("recent_confidences.npy")         # last 24 hours

statistic, p_value = ks_2samp(baseline, recent)
if p_value < 0.01:
    print(f"DRIFT: KS={statistic:.3f}, p={p_value:.4f}")

What to alert on

Be selective. Alerts that fire weekly without action become noise.

SignalThresholdRationale
p95 inference_ms > budget3 consecutive minutesReal latency regression, not blip
Detection rate drops > 50%5 minutesModel returning empty / something broke
Holdout mAP drops > 1.0WeeklyReal accuracy regression
Confidence KS p < 0.01DailyDistribution drift
Dropped frames > 5%10 minutesCapacity bottleneck

The discipline: every alert should map to a concrete next action. If you can't say what an on-call would do when the alert fires, the alert is noise.

Try It

Add per-frame logging (inference_ms, detection count, mean confidence) to your tracking loop. Stand up a small Grafana / Datadog / whatever you use — and watch the graphs for an hour. The shape of the lines tells you more than any individual number.

Done When
You've finished the lesson when all of these are true.
  • You log inference latency, detection volume, and mean confidence per frame.

  • You have a weekly holdout validation job and an alert if mAP drops > 1.

  • You have a distribution-drift signal (confidence histogram or KS test).

What's next

Last lesson: cost and latency tuning — the levers you'll actually pull when the dashboards say something's off.