Source Data with Discovery
Pull frames from videos and image stores, deduplicate, and tag for labeling.
The dataset is the project — quality training data outweighs almost every modeling choice. Discovery is the first surface in Platform: it ingests images and videos, samples frames intelligently, dedupes near-identical shots, and surfaces what's worth labeling. Done well, it cuts annotation cost by half before any human looks at a single image — and feeds an active learning loop you'll lean on for every retrain.
Ingest a video or image folder via Discovery, sample frames, and produce a deduplicated candidate set ready for annotation.
Upload videos or images, or connect a cloud bucket.
Sample frames at a sensible rate (1–5 fps for videos).
Run dedup — perceptual hash + embedding distance.
Tag the survivors with environment metadata (camera, time of day, lighting).
Hands-on
Two kinds of input data
| Input | Best for | Notes |
|---|---|---|
| Image folder / bucket | Curated datasets, photo shoots, scrapes | Each image is independent — no sampling needed |
| Video files / RTSP | Real-world capture | Need to sample — most consecutive frames are near-duplicates |
Platform handles both. For videos the key parameter is the sample rate.
Sampling rate: the most important knob
A 30-fps, 30-minute video has 54,000 frames. Most of them are visually identical to their neighbors. Labeling all of them is waste.
A useful starting rule:
- Static cameras (warehouse, retail, fixed cameras): 1 frame every 1–5 seconds.
- Moving cameras (drone, vehicle): 1 frame every 0.2–0.5 seconds.
- Event-driven (a bird flies through the frame): if you have detection cues, sample only when something is happening.
Discovery lets you configure sample rate per source. Start aggressive (lower fps), check coverage, increase if you're missing important moments.
Dedup, then dedup again
Even at 1 fps, a fixed camera shows the same empty corridor for hours. Platform applies two layers of dedup:
- Perceptual hash (cheap) — drops literally near-identical frames.
- Embedding distance (expensive but smart) — drops frames whose semantic content matches an earlier frame.
Original 30 min @ 30 fps: 54,000 frames
Sample at 1 fps: 1,800 frames
Perceptual hash dedup: ~ 600 frames (66% similar to neighbors)
Embedding-distance dedup: ~ 300 frames (50% semantically redundant)That's a 99.4% reduction. The 300 surviving frames are the ones worth showing a human.
Annotators are expensive. The point of dedup is to not hand near-duplicates to humans. Discovery dedups before pushing frames into the annotation queue.
Metadata tags
Tag every surviving frame with environment metadata:
- Camera ID
- Time of day (auto-derived from EXIF or filename if available)
- Weather (manual, optional)
- Location
This metadata pays off later: stratified splits, data drift analysis, and "what's our coverage of nighttime shots?" sanity checks all live on these tags.
When to pull from cloud storage directly
For large datasets (> 100k images), uploading is impractical. Ultralytics Platform supports direct connections to:
- AWS S3 buckets.
- Google Cloud Storage.
- Azure Blob Storage.
Platform reads in place, samples, and dedups without copying everything. This is also the right pattern when data residency matters — the source data never leaves your bucket. If you'd rather pull from public benchmarks, the Ultralytics datasets catalog lists drop-in options like COCO and Open Images V7, and you can pipe in third-party labeling work via the Roboflow integration.
A common pitfall: oversampling rare events
If your dataset is mostly empty corridors and rarely a forklift, naïve sampling gives you a boring dataset. Use Discovery's event-driven sampling: run a quick pretrained Ultralytics YOLO model on the source first, sample heavily when something is detected, sparsely when not. The data collection and annotation guide and the synthetic data glossary entry both go deeper on filling rare-class gaps without overfitting.
plain 1 fps sampling event-driven sampling
┌─────────────────────────┐ ┌─────────────────────────┐
│ ▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪ │ │ ▪ ▪▪▪▪▪ ▪▪▪▪▪ │
│ (empty corridor) │ │ (sparse) ↑ (rich) │
│ │ │ forklift visible │
└─────────────────────────┘ └─────────────────────────┘
1800 frames, ~50 with target ~400 frames, ~80 with targetSmaller dataset, more useful labels.
Upload a 5–10 minute video to Discovery. Try sample rates of 1 fps and 0.2 fps. Compare the surviving frame count after dedup. Visually scan the survivors at 1 fps — are they diverse, or do you see redundancy that dedup missed?
You've ingested a video or image folder via Discovery.
You've set a sample rate and run dedup.
Your survivors are tagged with at least camera and time-of-day metadata.
We have candidate frames. Next: turning them into labels at scale with auto-annotation.