Skip to main content
Lessonbeginner

Dataset Quality Control

Find missing labels, wrong classes, sloppy boxes, duplicates, and class imbalance — before you train.

Fixing labels usually improves performance more than tuning hyperparameters. The QC pass is where you find what the labeling guide missed and where you confirm the dataset's distribution matches your spec. Skipping QC is how teams burn a week of GPU time on a fixable label bug.

Outcome

Run a structured QC pass on the labeled dataset, fix what's broken, and document the result.

Fast Track
If you already know your way around, here's the short version.
  1. Visual review — open the dataset gallery and skim 100+ random samples.

  2. Coverage check — every relevant object labeled in every image.

  3. Class histogram — flag imbalanced classes for collection or oversampling.

  4. Leakage check — same scene / camera / day shouldn't span multiple splits.

Hands-on

The seven-check QC pass

Ultralytics Platform dataset charts and statistics

Run all seven before any training run. None of them take more than 10 minutes — the preprocessing annotated data guide covers many of these in detail:

#CheckWhat you're looking forTools
1Missing labelsImages where annotators forgot some objectsVisual review, low-detection-count outliers
2Wrong classesClass indices swapped, mislabeled instancesConfusion in val confusion matrix; random sample audits
3Loose / tight boxesBoxes that don't closely enclose objectsVisual review; small box-to-object ratio anomalies
4Duplicate imagesNear-identical frames split across train/valPerceptual hash + content hash (Platform does this automatically)
5Class imbalanceOne class dominates, another has < 100 instancesCharts tab class histogram
6Split leakageSame scene / camera / day appears in train AND val/testGroup-by metadata before splitting (lesson 6)
7Background images0–10% of dataset is unlabeled "no objects here" framesCount images with zero labels

Spot-check by sampling

You don't need to review every image. A 100-sample stratified audit catches systemic issues:

  • 10 from each class, picked at random.
  • 10 background images.
  • 10 "edge cases" from your spec.
  • The lowest 10 confidence predictions from a quick pretrained-model run on the train set.

The last bucket is gold: it surfaces the images the model finds unusual — usually mislabeled or genuinely odd.

Class balance: what to do when it's broken

Real datasets are imbalanced. The question is whether it'll hurt training.

   forklift  ████████████████████████ 8000 instances
   person    ████████████████ 5400
   pallet    ██ 700                              ← red flag

A 10:1 imbalance like the one above will produce a forklift-confident model that misses pallets. Three fixes, in order of preference:

  1. Collect more pallets. Always the best answer. Aim for ≥ 1500 images and ≥ 10,000 labeled instances of every class, not just the easy ones.
  2. Oversample at training time. Augment minority-class images more aggressively so each epoch sees them more often.
  3. Class-weighted loss (advanced). Don't reach for this until 1 and 2 are exhausted.

Dedup once more, after labeling

Even after upload-time dedup, post-labeling dedup catches a different kind of duplicate: visually distinct frames with identical labels. They're usually:

  • Two camera angles of the same event with the same labels.
  • Sequential video frames where nothing moved.

Drop them — they pad volume without adding training signal.

The leakage check is not optional

Random per-image splits leak whenever images are correlated (same scene, same camera, same day). Lesson 6 covers the splitting strategy in depth — but the QC version of the question is:

Does the same scene / camera / day appear in both my train and val splits?

If yes, your val mAP is overstating reality, and the model that ships will be worse than the one you measured. The fix is a grouped split — train and val never share a scene-key. The preprocessing annotated data guide walks through the mechanics; the splits lesson next covers the strategy.

Why fixing labels beats tuning hyperparameters

A common pattern: model val mAP is 0.62, team spends two weeks on hyperparameter tuning, mAP gets to 0.64. Then someone audits the data and finds 8% of forklift labels are mis-classified as pallet. They fix the labels and retrain — mAP jumps to 0.78.

This is the rule of thumb: before changing anything else, fix the data. The fine-tuning guide, the YOLO performance metrics guide, and the preprocessing annotated data guide all reinforce it. Models are good at generalizing from clean signal; they can't generalize from noise.

Document the QC result

Before training, write a one-page QC report:

Dataset: forklift_v1
Audit date: 2026-05-09
Total images: 4,820   |   Total labels: 18,442
Background images: 380 (7.9%)
Per-class instances:
  forklift  6,801
  person    8,310
  pallet    3,331
Visual audit: 120 images sampled, 4 mislabeled (3.3%), fixed.
Duplicates removed: 312
Leakage check: split-by-day, no scene overlap.
Class imbalance: pallet ~2.5× lower than person; ok, not collecting more this round.
Ready to train: yes.

This report is also the artifact you hand to the auditor or the customer. "Yes, we audited the dataset" with no document is not the same as "yes, here's the audit."

Try It

Run the seven checks on your dataset. Find at least one issue (you will). Fix it, then re-run. Write the QC report — one page, the format above.

Done When
You've finished the lesson when all of these are true.
  • All seven QC checks pass or have a documented exception.

  • Class histogram is acceptable (no class < 1500 images / 10,000 instances, or you've documented why).

  • Background images are 0–10% of the dataset.

  • You have a written one-page QC report ready to hand off.

What's next

Clean dataset. Next: split it without leaking the test set into your val mAP.