Split the Dataset Correctly

Train, val, and test — and the leakage rules that decide whether your metrics are honest.

A bad split is worse than no split — it gives you a number you trust that isn't true. The split decision is one-way: every metric you'll ever publish about this model is tied to it. Get it right once, and never touch it again.

Outcome

Produce train / val / test splits with no leakage, balanced class distributions, and documentation that future-you (or an auditor) can verify.

Fast Track

If you already know your way around, here's the short version.

Baseline: 70% train, 15% val, 15% test (of objects, not images).
Split by the right unit — scene, camera, day, location — not always by image.
Stratify class proportions across splits.
Lock the test set away. Touch it once at the end, not weekly.

Hands-on

The 70/15/15 baseline

Ultralytics Platform split redistribution dialog

For most enterprise projects, a sensible starting split is:

Split	Share	Purpose
Train	70%	Model learns weights from this
Val	15%	Pick best epoch, tune hyperparameters, compare runs
Test	15%	Final go/no-go before deployment — touch once

Adjust if your dataset is small (more weight on train), or if you want a stricter regression suite (more weight on test). The exact percentages matter less than the discipline.

Train ~70%, val ~15%, test ~15% — measured in objects, not just images.

Split by the right unit

This is the lesson most teams learn the hard way. Random per-image splits leak when frames are correlated:

   bad: random per-image split
   ┌──────────────────────────────────────┐
   │ video1-frame-0124   →  TRAIN          │
   │ video1-frame-0125   →  VAL           ← leaked!
   │ video1-frame-0126   →  TRAIN          │
   └──────────────────────────────────────┘
   val mAP looks great. production mAP doesn't.

   good: split by scene (group all frames from one video)
   ┌──────────────────────────────────────┐
   │ video1   →  TRAIN  (all 1800 frames) │
   │ video2   →  VAL    (all 1500 frames) │
   │ video3   →  TEST   (all 2200 frames) │
   └──────────────────────────────────────┘

Pick the split unit that matches the generalization you care about at deployment:

Deployment generalization	Split by
Same camera, new days	Day
Same site, new cameras	Camera
Same product line, new sites	Location
New customer entirely	Tenant / customer
Video-based detection	Scene / source video

When in doubt, split by the most general unit — it gives the most honest metrics.

Stratify class distributions

Random splits can starve a small class. With 70 pallet labels, a 70/15/15 split might land 7 pallets in val — too few for a meaningful AP. Stratification fixes this by forcing class proportions to match across splits:

   without stratification:        with stratification:
   train: 50 pallets / 4500 imgs   train: 49 pallets / 4500 imgs
   val:    7 pallets /  900 imgs   val:   11 pallets /  900 imgs
   test:  13 pallets /  900 imgs   test:  10 pallets /  900 imgs

Most modern tooling (Ultralytics Platform, scikit-learn, Roboflow) does stratified splits with one toggle.

Multiple validation sets

The Ultralytics YOLO data YAML accepts a list of val sets — useful when you want both a curated regression set and a fresh production-realistic one:

path: ./datasets/forklift
train: images/train
val:
  - images/val_curated      # stable, your regression suite
  - images/val_production   # rotates with newly collected frames
test: images/test

The curated set tracks regressions; the production-realistic set tracks domain shift.

Lock the test set

The single most expensive mistake: comparing 10 model variants on the test set. Once the test set has informed any decision, it's training data, and your final number is dishonest. Two rules:

All hyperparameter tuning and run comparison happens on val.
Test is touched once, at the end, for a single go/no-go decision.

If you need richer comparisons, keep multiple val sets — never multiple touches of test.

Splitting choice is one-way

Once you've trained against a split, every metric you publish is tied to it. Changing the split changes every number. Decide carefully, document the decision, and don't quietly re-shuffle later.

k-fold cross-validation for small datasets

If your dataset is small (< 500 images / class) and a single 70/15/15 split feels statistically flimsy, k-fold cross-validation trains k models on k different splits and averages the metric — see also the cross-validation glossary entry. Costs k× the GPU time; gives a much more reliable mAP estimate. Worth it for small, expensive datasets.

For more on splitting mechanics — folder layout, stratified sampling, and how to encode the split decision in a data.yaml — see the preprocessing annotated data guide.

Try It

Pick the split unit for your project (image / scene / camera / day / location). Justify the choice in one sentence. Then create a stratified 70/15/15 split in your tool of choice, and confirm class proportions match across splits.

Done When

You've finished the lesson when all of these are true.

You've picked and documented the split unit.
Class proportions are within ±20% across train, val, and test.
No scene / camera / day appears in more than one split.
The test set is locked — only touched at the end.

What's next

Splits ready. Next: a careful look at augmentation — what to use, and what to not use.