Multi-Stream Inference

Run multiple cameras concurrently without dropping frames.

Production computer vision is rarely one camera. It's 4, or 16, or 100. Naïvely calling model(stream) per camera in a Python loop maxes a single core and falls behind. The right architecture batches across cameras (a batch size tuning problem), decodes in parallel, and keeps the GPU saturated for real-time inference.

Outcome

Run 4 or more concurrent video streams through one model and keep up with all of them.

Fast Track

If you already know your way around, here's the short version.

Decode in worker threads/processes — never inline.
Batch frames from multiple cameras into one model call.
One Ultralytics YOLO instance, many sources, single GPU.
Profile until the bottleneck is the GPU, not the decode or the IO.

Hands-on

Where the time goes

DeepStream multi-stream setup for YOLO deployment

Per-frame work in a multi-stream system:

Decode — RTSP / file → numpy array. ~5–20 ms per frame on CPU.
Preprocess — resize, letterbox, normalize. ~1–3 ms.
Inference — model forward + NMS. ~5–30 ms depending on model and GPU.
Postprocess + downstream — tracking, counting, sinks. ~1–5 ms.

If you serially do (1)→(2)→(3)→(4) per frame per camera in one thread, the GPU sits idle most of the time waiting for decode. The fix is concurrency; the thread-safe inference guide covers the failure modes to avoid.

Architecture: producers + batched consumer

   ┌────────┐    ┌────────┐    ┌────────┐    ┌────────┐
   │cam 1 dec│   │cam 2 dec│   │cam 3 dec│   │cam 4 dec│   ← thread per camera
   └────┬───┘    └────┬───┘    └────┬───┘    └────┬───┘
        ▼             ▼             ▼             ▼
            ┌──────────────────────────────────┐
            │       frame queue (bounded)      │
            └────────────┬─────────────────────┘
                         ▼
            ┌──────────────────────────────────┐
            │ model.predict(batch)             │   ← single GPU consumer
            └────────────┬─────────────────────┘
                         ▼
            ┌──────────────────────────────────┐
            │  per-stream postprocess + sinks  │
            └──────────────────────────────────┘

Built-in batched streams

Ultralytics has a built-in shortcut: pass a list of sources and stream=True to Predict mode.

from ultralytics import YOLO

model = YOLO("yolo26n.engine")     # use the optimized engine

sources = [
    "rtsp://camera-1.local/stream",
    "rtsp://camera-2.local/stream",
    "rtsp://camera-3.local/stream",
    "rtsp://camera-4.local/stream",
]

for results in model.track(sources, stream=True, persist=True):
    # `results` is a list — one per source
    for stream_idx, r in enumerate(results):
        for box in r.boxes:
            # process box; box.id is per-stream-and-class unique
            pass

Under the hood Ultralytics opens a thread per source, decodes in parallel, and batches frames into one model call when they arrive close in time. Pair this with a TensorRT or OpenVINO engine for the best per-stream throughput.

When you need a custom architecture

The built-in is great up to ~16 streams. Above that, or when you have non-trivial postprocessing (DB writes, network sinks), a custom architecture wins:

Decode in subprocesses for true parallelism (Python GIL doesn't matter for native decode but does for any Python work in the loop).
Use a frame queue with a max size. If decode is faster than inference, drop old frames rather than blow memory.
Pin the inference thread to one GPU and pre-load the engine.

import queue
import threading
import cv2
import numpy as np
from ultralytics import YOLO

frame_queue: queue.Queue = queue.Queue(maxsize=64)
stop = threading.Event()

def decoder(source, stream_id):
    cap = cv2.VideoCapture(source)
    while not stop.is_set():
        ok, frame = cap.read()
        if not ok:
            break
        try:
            frame_queue.put((stream_id, frame), timeout=0.1)
        except queue.Full:
            pass    # drop frame; consumer is behind

def consumer():
    model = YOLO("yolo26n.engine")
    while not stop.is_set():
        batch = []
        ids = []
        for _ in range(8):
            try:
                sid, f = frame_queue.get(timeout=0.05)
            except queue.Empty:
                break
            ids.append(sid)
            batch.append(f)
        if not batch:
            continue
        results = model(batch, verbose=False)
        for sid, r in zip(ids, results):
            # postprocess per stream
            pass

threads = [threading.Thread(target=decoder, args=(src, i), daemon=True) for i, src in enumerate(sources)]
for t in threads: t.start()
threading.Thread(target=consumer, daemon=True).start()

Drop frames, don't queue forever

If your inference can't keep up, queueing every frame just delays your latency permanently. Set a max queue size, drop oldest, and measure the drop rate. Real-time CV systems should drop frames during overload, not lag.

Capacity planning

Rough rule of thumb on a single A10/T4 GPU at 640×640:

yolo26n FP16 engine: ~50–80 streams at 5–10 fps.
yolo26s FP16 engine: ~25–40 streams at 5–10 fps.
yolo26m FP16 engine: ~12–20 streams.

For higher fps, divide. For higher resolution (1280×1280), divide by 4. Always benchmark on the real hardware before sizing the fleet. For larger fleets, Triton Inference Server and DeepStream on Jetson are the natural production homes.

Try It

Run 4 video files (or 4 RTSP cams) through model.track(sources, stream=True). Measure end-to-end fps per stream. Try doubling to 8 sources. Note where it stops keeping up.

Done When

You've finished the lesson when all of these are true.

You can run 4+ sources through a single model and read per-stream results.
You've measured your effective per-stream fps on real hardware.
Your architecture drops frames cleanly under overload instead of growing memory.

What's next

We can run at scale. Now: monitoring it well enough to notice when something goes wrong.

Get Started