Train your first YOLO model·Ship·Lesson 9/10

Lessonbeginner

Run Inference on Video

From single-image predict to real-time video — and what changes when frames are continuous.

A model that detects in a still image is interesting. A model that detects in video, frame by frame, is what people actually deploy. Video adds two new questions: speed (can you keep up?) and continuity — object tracking keeps that same car in frame 30 and frame 31 sharing an ID. We'll handle both.

Outcome

Run Ultralytics YOLO on a video file or a webcam stream, save annotated output, and use tracking to keep object IDs consistent across frames.

Fast Track

If you already know your way around, here's the short version.

model('path/to/video.mp4', save=True) saves an annotated MP4.
model.track(...) adds persistent object IDs across frames.
Use stream=True for live sources to avoid memory blowup.
Pick a model size that runs at your target fps — see lesson 3.

Hands-on

A video file

Track mode accepts video paths the same way Predict mode accepts images:

from ultralytics import YOLO

model = YOLO("runs/detect/forklift_v1/weights/best.pt")
model("input.mp4", save=True, conf=0.4)
# Output: runs/detect/predict/input.mp4 (annotated)

That decodes the video, runs each frame through the model, draws boxes, and writes a new MP4. Easy. Slow if the video is long.

Webcam or live stream

Pass an integer (webcam index) or an RTSP URL — for a polished UI, the Streamlit live inference guide wraps this same loop in a browser app:

model(0, stream=True, save=True)                                  # default webcam
model("rtsp://camera.local/stream", stream=True, save=True)        # IP camera

stream=True is critical for live or long sources — it processes frames lazily as they arrive, instead of loading everything into memory.

Always use `stream=True` for live or long sources

Without stream=True, YOLO tries to read the entire source first. A 4-hour CCTV recording will exhaust RAM. stream=True returns a generator — you process one frame at a time.

Tracking: persistent IDs across frames

Detection alone gives you "there's a car here in frame 30, there's a car here in frame 31." It does not say the same car. Multi-object tracking (the broader umbrella covers analytics like object counting, heatmaps, speed estimation, and action recognition) does:

results = model.track(
    "input.mp4",
    save=True,
    persist=True,
    tracker="bytetrack.yaml",   # or "botsort.yaml"
)

Each box now has a box.id — an integer that follows the object across frames. Use it for counting (count unique IDs), trajectories, or "this object has been here for 3 seconds" logic.

Tracker	Speed	Robustness
`bytetrack.yaml`	Faster, simpler	Loses re-identifications when objects briefly disappear
`botsort.yaml`	Slower	Better at recovering IDs after occlusion (uses appearance features)

A frame-by-frame loop

When you need custom logic (count, alert, draw extra overlays), drop into the generator:

from ultralytics import YOLO
from collections import defaultdict

model = YOLO("runs/detect/forklift_v1/weights/best.pt")

per_class_unique_ids = defaultdict(set)

for frame_results in model.track("input.mp4", stream=True, persist=True):
    for box in frame_results.boxes:
        if box.id is None:
            continue
        cls_name = frame_results.names[int(box.cls)]
        per_class_unique_ids[cls_name].add(int(box.id))

for cls, ids in per_class_unique_ids.items():
    print(f"{cls}: {len(ids)} unique objects seen")

Performance: keep up with the source

Your effective fps = min(decode fps, model inference fps, write fps). On a modern GPU, model fps is rarely the bottleneck for yolo26n/s; for larger models, it can be.

If you can't keep up:

Smaller model. n over s, or quantize to INT8 (next lesson).
Lower imgsz. 480 is much faster than 640, modest accuracy hit.
Skip frames. Detect every Nth frame, interpolate boxes between detections (cheap if your tracker handles it).
GPU. Two orders of magnitude over CPU for the same model — pushing the model to a TensorRT engine on edge devices is the usual next step.

Try It

Run model.track('any_video.mp4', save=True) on a real video. Open the result and watch the boxes have persistent IDs through occlusions. If you see ID flicker, switch the tracker from bytetrack.yaml to botsort.yaml.

Done When

You've finished the lesson when all of these are true.

You've run inference on a video file and saved an annotated output.
You've used model.track(...) and confirmed box.id is set.
You know your effective fps and which knob would speed it up if needed.

Show solution

from ultralytics import YOLO

model = YOLO("runs/detect/forklift_v1/weights/best.pt")
results = model.track(
    source="input.mp4",
    save=True,
    persist=True,
    tracker="botsort.yaml",
    conf=0.4,
)

What's next

Last lesson: take the trained model out of Python and into the runtime your product actually uses.

Get Started