Define the Dataset Specification

Plan the dataset on paper before a single image is collected — scenarios, environments, edge cases, and negatives.

A dataset specification is the bridge between the business objective and the camera. It enumerates what visual variation must appear in the dataset for the model to be deployable. Skipping this step is how teams end up with 5,000 well-lit warehouse photos and a model that breaks at dusk on day one. The data collection and annotation guide covers the underlying class-count and bias decisions in depth.

Outcome

Produce a written dataset specification listing target classes, scenarios, environments, edge cases, and negative examples — with rough quotas.

Fast Track

If you already know your way around, here's the short version.

Enumerate scenarios: each combination of camera × location × time × condition.
Set quantitative targets: ≥1500 images / class, ≥10000 labeled instances / class.
Plan for edge cases up front: occluded, partial, small, stacked, unusual orientations.
Reserve 0–10% background images (no objects) — reduces false positives.

Hands-on

What a dataset spec looks like

Ultralytics Platform dataset upload dialog

The dataset spec slots into the wider lifecycle covered in steps of a CV project — between the business objective (lesson 1) and the camera:

graph LR
    A[Business<br/>objective] --> B[Vision task<br/>+ classes]
    B --> C[Dataset<br/>specification]
    C --> D[Collect]
    D --> E[Label]
    E --> F[Split + QC]
    F --> G[Fine-tune]
    G --> H{Meets<br/>metric?}
    H -- no --> C
    H -- yes --> I[Deploy]

    style C fill:#FF9800,color:#fff
    style F fill:#2196F3,color:#fff
    style G fill:#9C27B0,color:#fff
    style I fill:#4CAF50,color:#fff

A working spec is a table — one row per scenario. A scenario is a unique combination of capture variables that the model must handle:

   scenario      camera     location      time-of-day   weather       quota   notes
   ─────────────────────────────────────────────────────────────────────────────────────
   dock-day      cam-1..4   bay-A,B,C     08:00–17:00   any           1200    busiest hours
   dock-night    cam-1..4   bay-A,B,C     22:00–04:00   any            400    fewer ops, harder lighting
   loading-rain  cam-2,3    bay-B          all          rain only      200    rare but critical
   stacked       any        any            any          any            150    pallets stacked > 2 high
   negatives     any        any            any          any            150    empty docks, no targets

Scenarios make it concrete. A scenario you can't write down is a scenario the model won't see in training and will fail on in production.

Quantity targets that actually predict good results

The Ultralytics Tips for Best Training Results recommend, as a rule of thumb for production:

Target	Recommendation
Images per class	≥ 1500
Labeled instances per class	≥ 10,000
Background images (no labels)	0–10% of total — reduces false positives
Variance	Different times, seasons, weather, lighting, angles, sources, cameras

Smaller datasets work — narrow tasks (one class, one camera) often hit acceptable accuracy with 200–500 images. But for production, 1500 images / 10,000 instances / class is the line below which you should expect to need active learning rounds before the model is shippable.

The variance dimensions

For every dimension below, write down whether your dataset spec covers it. Empty cells are scenarios you'll discover in production:

Dimension	Examples
Time	Day / night / dawn / dusk
Weather	Sunny / cloudy / rain / fog / snow
Lighting	Bright / dim / mixed / glare / shadow
Camera	Make, model, lens, mounting height, FoV
Geography	Site A / site B / different cities / countries
Operators	Different shifts, different teams, different uniforms
Product variants	Different SKUs, packaging revisions, color variants
Object size	Near (large in frame), mid, far (small)
Occlusion	Clear / partial / heavy / behind glass
Failure modes	Mislabeled boxes, broken pallets, dirty cameras

Variance matters more than volume. 5,000 images from one camera at noon are worth less than 1,500 images that span the table above. The data collection and annotation guide goes deeper on diverse sourcing.

Don't forget background and edge cases

Two categories that always look optional and always come back to bite teams:

Background images — frames with no labeled objects. They teach the model "nothing is here." Aim for 0–10% of the dataset (COCO has ~1%). Without them, the model invents detections in empty scenes.
Edge cases — heavy occlusion, partial objects at frame edges, unusual orientations, stacked objects, look-alikes (a forklift's mast vs. a column). Reserve a quota line for each edge case in the spec.

   common               edge cases
   ┌──────────────┐     ┌──────────────┐
   │ ▪▪▪▪▪▪▪▪▪▪▪▪ │     │ ░ ▪ ░ ░ ▪ ░ │   ← rare but
   │ ▪▪▪▪▪▪▪▪▪▪▪▪ │     │ ░ ░ ░ ▪ ░ ░ │     diagnostic
   │ (1500+ each) │     │ ░ ▪ ░ ░ ░ ▪ │
   └──────────────┘     └──────────────┘
   model trains on      model fails on
   these by default     these in prod

A good spec over-samples edge cases on purpose — they're rare in the wild, so a stratified collection makes them visible during training.

Try It

Draft the dataset spec for your project as a table: 5–10 scenario rows, with quotas. Show it to someone who knows the deployment site (a foreman, a shift lead, a customer). They'll add 2–3 scenarios you missed — that's the value.

Done When

You've finished the lesson when all of these are true.

Your spec lists ≥ 5 scenarios with explicit quotas.
You've enumerated variance across at least 5 of the 10 dimensions above.
You've reserved a row for background images.
You've reserved at least one row per identified edge case.

What's next

Plan in hand. Next: actually collect images that match it.