Skip to main content
Build with Ultralytics Platform·Build the Dataset·Lesson 2/10
Lessonintermediate

Source Data with Discovery

Pull frames from videos and image stores, deduplicate, and tag for labeling.

The dataset is the project — quality training data outweighs almost every modeling choice. Discovery is the first surface in Platform: it ingests images and videos, samples frames intelligently, dedupes near-identical shots, and surfaces what's worth labeling. Done well, it cuts annotation cost by half before any human looks at a single image — and feeds an active learning loop you'll lean on for every retrain.

Outcome

Ingest a video or image folder via Discovery, sample frames, and produce a deduplicated candidate set ready for annotation.

Fast Track
If you already know your way around, here's the short version.
  1. Upload videos or images, or connect a cloud bucket.

  2. Sample frames at a sensible rate (1–5 fps for videos).

  3. Run dedup — perceptual hash + embedding distance.

  4. Tag the survivors with environment metadata (camera, time of day, lighting).

Hands-on

Two kinds of input data

InputBest forNotes
Image folder / bucketCurated datasets, photo shoots, scrapesEach image is independent — no sampling needed
Video files / RTSPReal-world captureNeed to sample — most consecutive frames are near-duplicates

Platform handles both. For videos the key parameter is the sample rate.

Sampling rate: the most important knob

A 30-fps, 30-minute video has 54,000 frames. Most of them are visually identical to their neighbors. Labeling all of them is waste.

A useful starting rule:

  • Static cameras (warehouse, retail, fixed cameras): 1 frame every 1–5 seconds.
  • Moving cameras (drone, vehicle): 1 frame every 0.2–0.5 seconds.
  • Event-driven (a bird flies through the frame): if you have detection cues, sample only when something is happening.

Discovery lets you configure sample rate per source. Start aggressive (lower fps), check coverage, increase if you're missing important moments.

Dedup, then dedup again

Even at 1 fps, a fixed camera shows the same empty corridor for hours. Platform applies two layers of dedup:

  1. Perceptual hash (cheap) — drops literally near-identical frames.
  2. Embedding distance (expensive but smart) — drops frames whose semantic content matches an earlier frame.
   Original 30 min @ 30 fps:           54,000 frames
   Sample at 1 fps:                     1,800 frames
   Perceptual hash dedup:               ~ 600 frames    (66% similar to neighbors)
   Embedding-distance dedup:            ~ 300 frames    (50% semantically redundant)

That's a 99.4% reduction. The 300 surviving frames are the ones worth showing a human.

Dedup happens before annotation, not after

Annotators are expensive. The point of dedup is to not hand near-duplicates to humans. Discovery dedups before pushing frames into the annotation queue.

Metadata tags

Tag every surviving frame with environment metadata:

  • Camera ID
  • Time of day (auto-derived from EXIF or filename if available)
  • Weather (manual, optional)
  • Location

This metadata pays off later: stratified splits, data drift analysis, and "what's our coverage of nighttime shots?" sanity checks all live on these tags.

When to pull from cloud storage directly

For large datasets (> 100k images), uploading is impractical. Ultralytics Platform supports direct connections to:

  • AWS S3 buckets.
  • Google Cloud Storage.
  • Azure Blob Storage.

Platform reads in place, samples, and dedups without copying everything. This is also the right pattern when data residency matters — the source data never leaves your bucket. If you'd rather pull from public benchmarks, the Ultralytics datasets catalog lists drop-in options like COCO and Open Images V7, and you can pipe in third-party labeling work via the Roboflow integration.

A common pitfall: oversampling rare events

If your dataset is mostly empty corridors and rarely a forklift, naïve sampling gives you a boring dataset. Use Discovery's event-driven sampling: run a quick pretrained Ultralytics YOLO model on the source first, sample heavily when something is detected, sparsely when not. The data collection and annotation guide and the synthetic data glossary entry both go deeper on filling rare-class gaps without overfitting.

   plain 1 fps sampling           event-driven sampling
   ┌─────────────────────────┐    ┌─────────────────────────┐
   │ ▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ │    │ ▪      ▪▪▪▪▪      ▪▪▪▪▪ │
   │ (empty corridor)        │    │ (sparse)  ↑ (rich)      │
   │                         │    │      forklift visible   │
   └─────────────────────────┘    └─────────────────────────┘
   1800 frames, ~50 with target   ~400 frames, ~80 with target

Smaller dataset, more useful labels.

Try It

Upload a 5–10 minute video to Discovery. Try sample rates of 1 fps and 0.2 fps. Compare the surviving frame count after dedup. Visually scan the survivors at 1 fps — are they diverse, or do you see redundancy that dedup missed?

Done When
You've finished the lesson when all of these are true.
  • You've ingested a video or image folder via Discovery.

  • You've set a sample rate and run dedup.

  • Your survivors are tagged with at least camera and time-of-day metadata.

What's next

We have candidate frames. Next: turning them into labels at scale with auto-annotation.