Skip to main content
Build with Ultralytics Platform·Build the Dataset·Lesson 3/10
Lessonintermediate

Auto-Annotation as an Accelerator

Let Ultralytics YOLO label first, let humans review — and the rules that keep this from poisoning your dataset.

Hand-labeling is the slowest step in computer vision. Auto-annotation runs a model first and asks humans to review instead of draw — typically a YOLO detector for bounding boxes plus SAM 2 for masks. Done well, it 5–10× annotator throughput. Done badly, it teaches your model exactly what your auto-annotator already knew — and nothing more.

Outcome

Run auto-annotation on a candidate set, set up the human review loop, and avoid the most common failure mode.

Fast Track
If you already know your way around, here's the short version.
  1. Run a pretrained Ultralytics YOLO model (or your latest fine-tune) on candidate frames.

  2. Confidence threshold ≈ 0.5 — high enough to be mostly right, low enough to surface borderline cases.

  3. Humans review and accept/edit/reject. Don't auto-accept silently.

  4. The frames the model is least sure about are the most valuable to label.

Hands-on

How auto-annotation works

Ultralytics Platform YOLO smart annotation

Auto-annotation is two passes:

  1. Inference pass — run a model over every candidate frame; produce candidate boxes/masks/keypoints.
  2. Human review pass — annotators see the predictions overlaid on each image and accept, edit, or reject them.
graph LR
    A[candidate frames] --> B[YOLO predict]
    B --> C[boxes overlay]
    C --> D{human review}
    D -- accept --> E[labeled dataset]
    D -- edit --> E
    D -- reject --> F[hand-label]
    F --> E

If the inference model is decent, the human is mostly clicking "accept" — which is 5–10× faster than drawing.

Pick the right inference model

Use the best model you have access to for your task. In order of preference:

  1. Your latest fine-tune. If you already have a v1 model on similar data, this is the best auto-annotator for v2.
  2. A YOLO trained on a similar dataset. Pose datasets, agricultural datasets, security datasets — pick one with overlapping classes. Open-vocabulary detectors like YOLO World can also bootstrap brand-new classes from text prompts.
  3. A pretrained Ultralytics YOLO model on COCO/etc. Fine if your classes overlap with the 80 COCO classes. Useless otherwise.

The confidence threshold

Auto-annotation's confidence threshold is a tradeoff between overdraw (boxes the model wasn't sure about, that humans have to delete) and misses (real objects without a box, that humans have to draw).

ThresholdWhat humans do most
0.7+Draw missing boxes
0.5Mostly accept; some edits
0.3Delete extra boxes; accept some
0.1Spend most time deleting

0.5 is usually the sweet spot. You can tune it for your model and dataset by sampling 20 frames and counting average click cost per frame.

Don't auto-accept silently

If you skip human review on high-confidence detections, you'll bake in every systematic error your auto-annotator has — biased classes, wrong rules at edges, spurious detections of look-alikes. The dataset becomes a fixed point of the auto-annotator's mistakes. Always review.

Active learning: route the right frames to humans

Once you have an auto-annotator and reviewers, the next optimization: route the uncertain frames to humans first. The model is most likely to be wrong where it's least confident.

A useful sampling order for review:

  1. Frames where the model has detections at confidence 0.4–0.6 (borderline).
  2. Frames with very high count of detections (likely to have NMS edge cases).
  3. Frames where two classes had similar confidence on the same object.
  4. Random sample of the rest, for coverage.

Platform's review queue can be sorted by these criteria.

Class-by-class fallback

Auto-annotation is great for classes the model knows. For new classes you're introducing, the model has nothing to offer:

  • Bootstrap by labeling 100–200 examples by hand.
  • Train a v0 model on those (it'll be bad — that's fine).
  • Use v0 to auto-annotate the next batch; the bad-but-existing predictions still beat starting from scratch.
  • Iterate. By v1 the auto-annotator is useful.

Reviewer metrics

Beyond throughput:

MetricWhat it tells you
Edit rate% of auto-annotations the reviewer changed. Higher = bad auto-annotator
Reject rate% of frames sent back to hand-label entirely
Time per frameThroughput
Inter-reviewer agreementAre reviewers converging on the same calls?

Edit rate trending up over time is the early warning that your auto-annotator's domain has drifted. If you'd rather offload the labeling UI entirely, the Roboflow integration and other tools listed in the integrations index plug straight back into Platform datasets, while the broader data collection and annotation guide covers the underlying class-imbalance tradeoffs.

Try It

Run auto-annotation on 100 frames with a pretrained Ultralytics YOLO model. Review them. Time how long the review takes. Compare to your estimate for hand-labeling those same frames. The ratio is your auto-annotation ROI.

Done When
You've finished the lesson when all of these are true.
  • You've run auto-annotation on a candidate set.

  • You've reviewed every frame — accepting, editing, or rejecting.

  • You can name your edit rate, and have a hypothesis for why it's high or low.

What's next

Labels in hand. Next: curating, deduplicating, and splitting before training.