Reading Detection Metrics Honestly

Precision, recall, mAP — what they hide and how to look at them together.

A single number is comforting. mAP@0.5 = 0.83. Done. Except the same mAP can hide wildly different model behavior, and the wrong choice between two models can be invisible to mAP. The fix is to look at precision, recall, and the confusion matrix together — not the headline number.

Outcome

Read precision, recall, and mAP together to decide whether a model is ready, and identify what to fix when it isn't.

Fast Track

If you already know your way around, here's the short version.

Precision — of the boxes the model drew, how many were correct?
Recall — of the boxes that should exist, how many did the model draw?
mAP — the area under precision-recall, averaged across IoU thresholds.
mAP@0.5 is forgiving. mAP@0.5:0.95 is honest.

Hands-on

Precision and recall

Mean average precision metric overview

For a detector, every detection is either a true positive (TP) or a false positive (FP). Every ground-truth object is either detected (TP) or missed (FN). The full YOLO performance metrics guide goes deeper once these definitions feel natural.

Precision = TP / (TP + FP) — when the model draws a box, how often is it right?
Recall = TP / (TP + FN) — when there's a real object, how often does the model find it?

Both range from 0 to 1. Both matter. Optimizing only one is almost always a mistake — and the F1-score is the standard way to summarize the two when you need one number.

                          PR space
       ▲
   1.0 │   ◀── ideal corner
       │  ╲
       │   ╲     ◇ a tight model (high precision, lower recall)
   pre │    ╲╲
   ci  │   ╳   ◇  ◇ a recall-y model (lots of detections, some wrong)
   sion│ ╳ ╳ ╳
       │ ╳     ╳
       │ ╳        ╳
   0.0 └──────────────▶
       0          recall          1.0

What thresholds do to the curve

Lower the confidence threshold → more detections → recall up, precision down. Raise the confidence threshold → fewer detections → precision up, recall down.

A useful detector has a wide stretch of high precision and high recall. The tradeoff at any single threshold is what the PR curve shows — and the area under that curve is average precision (AP).

mAP, the way most people quote it

mAP@0.5 is the mean of average precisions across all classes, computed at an IoU threshold of 0.5 — meaning a detection is considered correct if it overlaps the ground truth by ≥ 50%.

That threshold is forgiving. A box can be wildly oversized and still count, and a related family of curves — ROC for classification, PR for detection — will tell you a lot more than a single accuracy number.

mAP@0.5:0.95 averages mAP over IoU thresholds from 0.5 to 0.95 in steps of 0.05. Tighter overlap requirements at the high end mean this metric is much harder to game by drawing big sloppy boxes.

Metric	Use when
mAP@0.5	Quick sanity check, leaderboard comparison
mAP@0.5:0.95	What you actually report when shipping
AP per class	Diagnosis — which classes are dragging down the average

Look at per-class AP, not just mAP

A 0.83 mAP can hide a class with AP 0.40. If that class is the one that drives your business value, you don't have an 0.83 model — you have a broken model with a flattering average. Always look at per-class AP.

Confusion matrix for classes

Per-class AP tells you which class is bad. The confusion matrix tells you what it's getting confused with — a workflow detailed in the model evaluation insights guide.

If forklift AP is low and the confusion matrix shows forklift rows leaking into pallet_jack, you have a class-confusion problem — not a localization problem. Two fixes:

Add training examples that disambiguate the two classes (especially the visual cues that distinguish them).
Merge the classes if downstream code doesn't actually need to distinguish them.

The threshold sweep

When you ship, you ship with a single threshold (or a per-class set of thresholds). Run Validation mode to compute the curves and pick deliberately:

Compute precision and recall at every threshold from 0 to 1 in steps of 0.01.
Plot precision and recall against threshold.
Pick the threshold that satisfies your business constraint:
- "I can tolerate 5% false positives" → highest threshold where precision ≥ 0.95.
- "I cannot miss more than 1% of events" → lowest threshold where recall ≥ 0.99.

Documenting the threshold and the constraint that produced it is more important than the mAP number.

Try It

Run validation on an Ultralytics YOLO model and look at three things: per-class AP, the PR curve for your hardest class, and the confusion matrix. Find one class whose AP is misleadingly close to the mAP, and one class that's the real bottleneck.

Done When

You've finished the lesson when all of these are true.

You can describe what precision and recall measure without looking it up.
You report mAP@0.5:0.95 (or your project's tighter equivalent), not just mAP@0.5.
You can name your project's bottleneck class from a confusion matrix.

What's next

We've gone from problem to task to data to metrics. Last lesson: putting it all together with a real model.

Get Started