Skip to main content
Ultralytics YOLO in Production·Choose a Runtime·Lesson 1/9
Lessonintermediate

Choose a Deployment Target

Map your hardware and constraints to the runtime that wins on it.

There is no universally best runtime. There is a best runtime for your hardware, your latency budget, and your operational constraints. The model deployment options guide covers the broader landscape, and the model deployment glossary entry frames the trade-offs; picking deliberately early saves you weeks of but it ran fine on my laptop debugging.

Outcome

Pick a deployment target — server GPU, edge device, mobile, browser — and name the matching runtime and export format.

Fast Track
If you already know your way around, here's the short version.
  1. Server NVIDIA GPU → TensorRT (engine).

  2. Server CPU (Intel) → OpenVINO.

  3. Cross-platform server → ONNX Runtime (onnx).

  4. iOS / macOS → CoreML. Android → TFLite.

  5. Browser → ONNX Runtime Web or TF.js (with a small model).

Hands-on

The matrix

Ultralytics Platform deploy page world map with overview cards

HardwareBest runtimeFormatTypical speedup vs PyTorch
NVIDIA dGPU / JetsonTensorRTengine2–4× (FP16), 4–8× (INT8)
Intel CPUOpenVINOopenvino1.5–3×
AMD CPU / genericONNX Runtimeonnx1.2–2×
Apple SiliconCoreMLcoreml2–5× via ANE
AndroidTFLitetflite1.5–3× (INT8)
BrowserONNX Runtime WebonnxHighly variable; n/s only

Decision questions, in order

Walk this list before picking. Stop at the first answer that locks in a runtime.

  1. Where does the model run? Cloud, on-prem server, edge box, phone, browser? This is by far the biggest constraint.
  2. What's the budget? A 4×A100 server happily runs yolo26x; a Jetson Nano won't survive past yolo26n.
  3. What's the latency target? "Realtime" can mean 30 fps or 5 fps. Be specific. The runtime you pick has to keep up at the inference latency budget at the model size you can afford.
  4. What's the SDK constraint? "Must use Python on Linux" is different from "must call from Swift on iOS." Some runtimes have great C++ APIs, others don't.
  5. What's the fleet size? One deployment vs 10,000 edge devices changes whether per-device engine builds are realistic.

Don't optimize for hypothetical portability

A common mistake: pick ONNX everywhere "for portability," then run on a single NVIDIA fleet for 18 months. ONNX is fine, but TensorRT would have been 3× faster and saved a lot of GPU spend.

Optimize the runtime for the deployment that actually exists today.

If the deployment changes later, re-export. Exports are cheap; perpetual underperformance is not.

TensorRT engines are device-specific

A TensorRT engine compiled on an A100 will not run on a Jetson Orin. For edge fleets, you either build engines on-device at install time (slower first boot, but portable across SKUs) or keep a build matrix per device class.

Edge-specific gotchas

Edge devices add operational concerns the cloud doesn't. The Raspberry Pi and DeepStream on Jetson guides cover the most common targets in detail.

  • Cold start. TensorRT engine builds can take 10–60 seconds the first time. Bake the engine into the device image.
  • Power. Sustained yolo26m at 30 fps on a Jetson eats real wattage. Plan thermal headroom.
  • Memory. Edge SoCs are tighter than they look. INT8 buys you headroom for image preprocessing buffers and tracker state.
  • Updates. How does a new best.pt get to the device? OTA infrastructure is half of edge ML — a Docker-based image makes this far less painful.

Where to validate

Whatever runtime you pick, validate on the target hardware before committing. Numbers from a different machine are noise. The cheapest version of validation is Benchmark mode:

yolo benchmark model=yolo26n.pt imgsz=640

That sweeps PyTorch / ONNX / OpenVINO / TensorRT / etc. and reports fps + mAP per format. Run it on the device you'll deploy to. For cloud fleets, Triton Inference Server is the natural next step once you've picked a runtime.

Try It

Make a list: deployment hardware, latency target, model size budget, SDK constraint. Pick a runtime that satisfies all four. If two satisfy them, pick the one with better tooling on the target — usually TensorRT on NVIDIA, OpenVINO on Intel, ONNX everywhere else.

Done When
You've finished the lesson when all of these are true.
  • You can name the deployment hardware and the matching runtime.

  • You've thought through fleet size, SDK, and update story — not just latency.

  • You've verified the runtime is even available on your target before exporting.

What's next

We've picked a target. Next: actually export and verify the model.