Meet YOLO26: next-gen vision AI.
Build with Ultralytics Platform·Deploy and Monitor·Lesson 7/10
Lessonintermediate

Deploy a Model

From best.pt to a managed inference endpoint with one click.

Platform deployment is the same idea as Vercel for code: pick a region, click deploy, get a URL. Authentication, scaling, export-format optimization, and model-deployment observability come bundled. Your job is to point traffic at it and wire your client correctly.

Outcome

Deploy a trained model as a managed endpoint and call it from your application.

Fast Track
If you already know your way around, here's the short version.
  1. Pick the run, pick a region, click Deploy.

  2. Endpoint URL + API key.

  3. Test with curl or the SDK.

  4. Wire your application client; switch traffic when ready.

Hands-on

Link to this sectionWhat you get#

Ultralytics Platform deploy tab region map with latency

A Platform deployment is:

  • A unique HTTPS endpoint URL like https://predict-abc123.run.app/predict.
  • An API key for auth (Bearer token).
  • A single-tenant Cloud Run service with scale-to-zero behavior — you only pay for active inference time.
  • Built-in latency, error-rate, and request-volume dashboards plus structured logs.
  • Tied to a specific model run and region.

Link to this sectionDeploy#

In the UI: Deploy → New Deployment (or click Deploy on any model's Deploy tab). Pick:

SettingNotes
ModelThe training run with the model you want to deploy
RegionOne of 43 global regions — pick the closest to your callers
Deployment nameAuto-generated (e.g. yolo26n-iowa); editable in the dialog
CPU / MemoryDefault 1 vCPU, 2 GiB — fixed for now; scale-to-zero is on

Click Deploy. After ~5–45 seconds (cold-start, depending on whether the container image is already cached in the region) you have a Ready endpoint.

Link to this sectionCalling the endpoint#

curl -X POST "https://predict-abc123.run.app/predict" \
  -H "Authorization: Bearer $ULTRALYTICS_API_KEY" \
  -F "file=@test.jpg" \
  -F "conf=0.25" \
  -F "iou=0.7" \
  -F "imgsz=640"

That returns JSON with per-image detections — the response format:

{
  "images": [
    {
      "shape": [1080, 1920],
      "results": [
        {
          "class": 0,
          "name": "forklift",
          "confidence": 0.92,
          "box": {"x1": 120, "y1": 234, "x2": 480, "y2": 700}
        }
      ],
      "speed": {"preprocess": 1.2, "inference": 12.5, "postprocess": 2.3}
    }
  ],
  "metadata": {"imageCount": 1, "model": "model.pt"}
}
import os, requests

url = "https://predict-abc123.run.app/predict"
headers = {"Authorization": f"Bearer {os.environ['ULTRALYTICS_API_KEY']}"}
data = {"conf": 0.25, "iou": 0.7, "imgsz": 640}

with open("test.jpg", "rb") as f:
    response = requests.post(url, headers=headers, data=data, files={"file": f})
print(response.json())

The deployment card's Code tab shows ready-to-paste Python, JavaScript, and cURL snippets with your real endpoint URL and API key already filled in.

Link to this sectionCold starts and scale-to-zero#

Dedicated endpoints scale to zero when idle — you only pay for active inference time, but the first request after idle pays a cold start. Typical cold-start ranges:

ScenarioCold start
Cached container in the region~5–15 seconds
First deploy in a region~15–45 seconds

For real-time clients (browsers, IoT, edge AI callbacks) you have two options:

  • Send a periodic warmup request to keep a replica live.
  • For lower network latency or availability, deploy in multiple regions and route to the closest one — but warm each region that must avoid cold starts.

For batch jobs that tolerate cold starts, scale-to-zero is the right default — it keeps the cost line flat through quiet periods. The throughput-vs-latency guide, model deployment options, and practices guides go deeper on the tradeoffs.

Auth keys are secrets

The endpoint accepts requests with a valid Bearer token. Don't ship the token to the browser. Keep it on a server proxy that adds the header to incoming requests.

Link to this sectionReplacing or rolling back#

Regions are fixed once chosen — to move a deployment to a new region or swap the underlying model, delete the existing endpoint and create a new one. The replacement gets a fresh URL, so plan a brief client-config swap. For a true zero-downtime cutover, deploy the new model in a second region or to a new service ahead of time and switch traffic at your DNS or load balancer.

If you'd rather self-host, Triton Inference Server is the canonical option for a YOLO-based gateway running in your own cloud.

Link to this sectionCustom domain#

Custom domains are coming soon. Today, endpoints use the auto-generated platform URL — front them with your own gateway (Cloudflare, an API router on a domain you own) if you need a branded host name.

Link to this sectionPricing and quotas#

Basic dedicated endpoints are free on all plans today; higher-resource configurations (more vCPUs, more memory, warm-start) will be usage-based in the future. Endpoint count limits depend on plan:

PlanEndpoints
FreeUp to 3
ProUp to 10
EnterpriseUnlimited

Dedicated endpoints are not subject to the Platform shared-inference rate limits — throughput is bounded only by the endpoint's compute. If you ever need to pause billing without losing the URL, the deployment card has a Stop action that suspends the service (resume any time).

Try It

Deploy your trained model as a Platform endpoint. Hit it with curl and Python. Time the cold-start request (5–45 s) and a warm follow-up — the gap is what your users see after a quiet period.

Done When
You've finished the lesson when all of these are true.
  • Your model is deployed at a Platform endpoint.

  • You've called it from curl and gotten a JSON response with images[].results[].

  • You've decided on a warm-up strategy (none, periodic ping, or multi-region) based on your latency tolerance.

What's next

We're live. Now we have to know if we're still right — that's monitoring.