Define the Dataset Specification
Plan the dataset on paper before a single image is collected — scenarios, environments, edge cases, and negatives.
A dataset specification is the bridge between the business objective and the camera. It enumerates what visual variation must appear in the dataset for the model to be deployable. Skipping this step is how teams end up with 5,000 well-lit warehouse photos and a model that breaks at dusk on day one. The data collection and annotation guide covers the underlying class-count and bias decisions in depth.
Produce a written dataset specification listing target classes, scenarios, environments, edge cases, and negative examples — with rough quotas.
Enumerate scenarios: each combination of camera × location × time × condition.
Set quantitative targets: ≥1500 images / class, ≥10000 labeled instances / class.
Plan for edge cases up front: occluded, partial, small, stacked, unusual orientations.
Reserve 0–10% background images (no objects) — reduces false positives.
Hands-on
What a dataset spec looks like

The dataset spec slots into the wider lifecycle covered in steps of a CV project — between the business objective (lesson 1) and the camera:
graph LR
A[Business<br/>objective] --> B[Vision task<br/>+ classes]
B --> C[Dataset<br/>specification]
C --> D[Collect]
D --> E[Label]
E --> F[Split + QC]
F --> G[Fine-tune]
G --> H{Meets<br/>metric?}
H -- no --> C
H -- yes --> I[Deploy]
style C fill:#FF9800,color:#fff
style F fill:#2196F3,color:#fff
style G fill:#9C27B0,color:#fff
style I fill:#4CAF50,color:#fffA working spec is a table — one row per scenario. A scenario is a unique combination of capture variables that the model must handle:
scenario camera location time-of-day weather quota notes
─────────────────────────────────────────────────────────────────────────────────────
dock-day cam-1..4 bay-A,B,C 08:00–17:00 any 1200 busiest hours
dock-night cam-1..4 bay-A,B,C 22:00–04:00 any 400 fewer ops, harder lighting
loading-rain cam-2,3 bay-B all rain only 200 rare but critical
stacked any any any any 150 pallets stacked > 2 high
negatives any any any any 150 empty docks, no targetsScenarios make it concrete. A scenario you can't write down is a scenario the model won't see in training and will fail on in production.
Quantity targets that actually predict good results
The Ultralytics Tips for Best Training Results recommend, as a rule of thumb for production:
| Target | Recommendation |
|---|---|
| Images per class | ≥ 1500 |
| Labeled instances per class | ≥ 10,000 |
| Background images (no labels) | 0–10% of total — reduces false positives |
| Variance | Different times, seasons, weather, lighting, angles, sources, cameras |
Smaller datasets work — narrow tasks (one class, one camera) often hit acceptable accuracy with 200–500 images. But for production, 1500 images / 10,000 instances / class is the line below which you should expect to need active learning rounds before the model is shippable.
The variance dimensions
For every dimension below, write down whether your dataset spec covers it. Empty cells are scenarios you'll discover in production:
| Dimension | Examples |
|---|---|
| Time | Day / night / dawn / dusk |
| Weather | Sunny / cloudy / rain / fog / snow |
| Lighting | Bright / dim / mixed / glare / shadow |
| Camera | Make, model, lens, mounting height, FoV |
| Geography | Site A / site B / different cities / countries |
| Operators | Different shifts, different teams, different uniforms |
| Product variants | Different SKUs, packaging revisions, color variants |
| Object size | Near (large in frame), mid, far (small) |
| Occlusion | Clear / partial / heavy / behind glass |
| Failure modes | Mislabeled boxes, broken pallets, dirty cameras |
Variance matters more than volume. 5,000 images from one camera at noon are worth less than 1,500 images that span the table above. The data collection and annotation guide goes deeper on diverse sourcing.
Don't forget background and edge cases
Two categories that always look optional and always come back to bite teams:
- Background images — frames with no labeled objects. They teach the model "nothing is here." Aim for 0–10% of the dataset (COCO has ~1%). Without them, the model invents detections in empty scenes.
- Edge cases — heavy occlusion, partial objects at frame edges, unusual orientations, stacked objects, look-alikes (a forklift's mast vs. a column). Reserve a quota line for each edge case in the spec.
common edge cases
┌──────────────┐ ┌──────────────┐
│ ▪▪▪▪▪▪▪▪▪▪▪▪ │ │ ░ ▪ ░ ░ ▪ ░ │ ← rare but
│ ▪▪▪▪▪▪▪▪▪▪▪▪ │ │ ░ ░ ░ ▪ ░ ░ │ diagnostic
│ (1500+ each) │ │ ░ ▪ ░ ░ ░ ▪ │
└──────────────┘ └──────────────┘
model trains on model fails on
these by default these in prodA good spec over-samples edge cases on purpose — they're rare in the wild, so a stratified collection makes them visible during training.
Draft the dataset spec for your project as a table: 5–10 scenario rows, with quotas. Show it to someone who knows the deployment site (a foreman, a shift lead, a customer). They'll add 2–3 scenarios you missed — that's the value.
Your spec lists ≥ 5 scenarios with explicit quotas.
You've enumerated variance across at least 5 of the 10 dimensions above.
You've reserved a row for background images.
You've reserved at least one row per identified edge case.
Plan in hand. Next: actually collect images that match it.