Choose a Deployment Target
Map your hardware and constraints to the runtime that wins on it.
There is no universally best runtime. There is a best runtime for your hardware, your latency budget, and your operational constraints. The model deployment options guide covers the broader landscape, and the model deployment glossary entry frames the trade-offs; picking deliberately early saves you weeks of but it ran fine on my laptop debugging.
Pick a deployment target — server GPU, edge device, mobile, browser — and name the matching runtime and export format.
Server NVIDIA GPU → TensorRT (
engine).Server CPU (Intel) → OpenVINO.
Cross-platform server → ONNX Runtime (
onnx).iOS / macOS → CoreML. Android → TFLite.
Browser → ONNX Runtime Web or TF.js (with a small model).
Hands-on
The matrix

| Hardware | Best runtime | Format | Typical speedup vs PyTorch |
|---|---|---|---|
| NVIDIA dGPU / Jetson | TensorRT | engine | 2–4× (FP16), 4–8× (INT8) |
| Intel CPU | OpenVINO | openvino | 1.5–3× |
| AMD CPU / generic | ONNX Runtime | onnx | 1.2–2× |
| Apple Silicon | CoreML | coreml | 2–5× via ANE |
| Android | TFLite | tflite | 1.5–3× (INT8) |
| Browser | ONNX Runtime Web | onnx | Highly variable; n/s only |
Decision questions, in order
Walk this list before picking. Stop at the first answer that locks in a runtime.
- Where does the model run? Cloud, on-prem server, edge box, phone, browser? This is by far the biggest constraint.
- What's the budget? A 4×A100 server happily runs
yolo26x; a Jetson Nano won't survive pastyolo26n. - What's the latency target? "Realtime" can mean 30 fps or 5 fps. Be specific. The runtime you pick has to keep up at the inference latency budget at the model size you can afford.
- What's the SDK constraint? "Must use Python on Linux" is different from "must call from Swift on iOS." Some runtimes have great C++ APIs, others don't.
- What's the fleet size? One deployment vs 10,000 edge devices changes whether per-device engine builds are realistic.
Don't optimize for hypothetical portability
A common mistake: pick ONNX everywhere "for portability," then run on a single NVIDIA fleet for 18 months. ONNX is fine, but TensorRT would have been 3× faster and saved a lot of GPU spend.
Optimize the runtime for the deployment that actually exists today.
If the deployment changes later, re-export. Exports are cheap; perpetual underperformance is not.
A TensorRT engine compiled on an A100 will not run on a Jetson Orin. For edge fleets, you either build engines on-device at install time (slower first boot, but portable across SKUs) or keep a build matrix per device class.
Edge-specific gotchas
Edge devices add operational concerns the cloud doesn't. The Raspberry Pi and DeepStream on Jetson guides cover the most common targets in detail.
- Cold start. TensorRT engine builds can take 10–60 seconds the first time. Bake the engine into the device image.
- Power. Sustained
yolo26mat 30 fps on a Jetson eats real wattage. Plan thermal headroom. - Memory. Edge SoCs are tighter than they look. INT8 buys you headroom for image preprocessing buffers and tracker state.
- Updates. How does a new
best.ptget to the device? OTA infrastructure is half of edge ML — a Docker-based image makes this far less painful.
Where to validate
Whatever runtime you pick, validate on the target hardware before committing. Numbers from a different machine are noise. The cheapest version of validation is Benchmark mode:
yolo benchmark model=yolo26n.pt imgsz=640That sweeps PyTorch / ONNX / OpenVINO / TensorRT / etc. and reports fps + mAP per format. Run it on the device you'll deploy to. For cloud fleets, Triton Inference Server is the natural next step once you've picked a runtime.
Make a list: deployment hardware, latency target, model size budget, SDK constraint. Pick a runtime that satisfies all four. If two satisfy them, pick the one with better tooling on the target — usually TensorRT on NVIDIA, OpenVINO on Intel, ONNX everywhere else.
You can name the deployment hardware and the matching runtime.
You've thought through fleet size, SDK, and update story — not just latency.
You've verified the runtime is even available on your target before exporting.
We've picked a target. Next: actually export and verify the model.