Skip to main content
Build with Ultralytics Platform·Deploy and Monitor·Lesson 7/10
Lessonintermediate

Deploy a Model

From best.pt to a managed inference endpoint with one click.

Platform deployment is the same idea as Vercel for code: pick a region, click deploy, get a URL. Authentication, scaling, export-format optimization, and model-deployment observability come bundled. Your job is to point traffic at it and wire your client correctly.

Outcome

Deploy a trained model as a managed endpoint and call it from your application.

Fast Track
If you already know your way around, here's the short version.
  1. Pick the run, pick a region, click Deploy.

  2. Endpoint URL + API key.

  3. Test with curl or the SDK.

  4. Wire your application client; switch traffic when ready.

Hands-on

What you get

Ultralytics Platform deploy tab region map with latency

A Platform deployment is:

  • An HTTPS endpoint URL (https://api.platform.ultralytics.com/v1/predict/<deployment-id>).
  • An API key for auth.
  • Auto-scaled compute behind it (the deployment scales to traffic).
  • Built-in latency and accuracy dashboards.
  • Versioned: each deployment is tied to a specific run + checkpoint.

Deploy

In the UI: Deploy → New Deployment. Pick:

SettingNotes
Source runThe training run with the model you want to deploy
RegionPlace compute close to where calls come from
HardwareCPU for low-volume / cheap; GPU for low-latency
Min/max replicasMin ≥ 1 to avoid cold starts; max bounded by your budget
Auto-export formatPlatform exports to the right format for the hardware automatically

Click Deploy. After ~30 seconds you have an endpoint.

Calling the endpoint

curl -X POST "https://api.platform.ultralytics.com/v1/predict/<id>" \
  -H "Authorization: Bearer $ULTRALYTICS_API_KEY" \
  -F "image=@test.jpg"

That returns JSON with detections:

{
  "deployment": "forklift_v3",
  "model_version": "v3_pallets_added",
  "detections": [
    {
      "class": "forklift",
      "class_id": 0,
      "confidence": 0.92,
      "box": [120, 234, 480, 700]
    }
  ],
  "inference_ms": 18,
  "request_id": "req_abc123"
}
from ultralytics_platform import Client

client = Client(api_key=os.environ["ULTRALYTICS_API_KEY"])
result = client.predict("forklift_v3", image="test.jpg")
for det in result.detections:
    print(f"{det.class_name} {det.confidence:.2f}")

Latency vs cost

Two knobs you'll touch most:

  • Min replicas. Min = 0 is cheapest but gives 5–20 s cold inference-latency starts on first request. Min = 1 keeps a hot replica.
  • Hardware tier. GPU vs CPU; smaller GPUs vs bigger. The throughput-vs-latency guide is a good companion when you need to choose between batched and per-request modes.

For real-time clients (browsers, IoT) — including edge AI deployments calling back into Platform — min ≥ 1 is mandatory. For batch jobs that tolerate cold starts, min = 0 is correct and saves a lot of money. The model deployment options and practices guides go deeper on the cost/latency tradeoffs.

Auth keys are secrets

The endpoint accepts requests with a valid Bearer token. Don't ship the token to the browser. Keep it on a server proxy that adds the header to incoming requests.

Versioned deploys and rollbacks

Each deployment is tied to a specific run. To deploy a new model:

  1. Train v4. Validate. Decide to ship.
  2. Click "Update deployment" → pick the v4 run.
  3. Platform rolls v4 in alongside v3, then switches traffic.
  4. If v4 misbehaves, click "Rollback" — instant return to v3.

This is the same gradual rollout pattern Vercel uses for code. Use it; never edit a live deployment by replacing files in place. If you'd rather self-host, Triton Inference Server is the canonical option for a YOLO-based gateway running in your own cloud.

Custom domain

Production usually wants detect.your-company.com instead of the platform-hosted URL. Add a CNAME to the deployment URL and Platform handles TLS.

Billing and quotas

Platform charges per inference (or per replica-hour for always-on deployments). Set quotas:

  • Hard quota → endpoint returns 429 above the quota.
  • Soft quota → endpoint warns; you can decide policy.

Quotas prevent a runaway client from blowing your budget.

Try It

Deploy your trained model as a Platform endpoint. Hit it with curl and the SDK. Note the cold-start latency on the first request and the warm latency on the second. Write down the difference — it's what your users will experience after a quiet period.

Done When
You've finished the lesson when all of these are true.
  • Your model is deployed at a Platform endpoint.

  • You've called it from curl and gotten a JSON response.

  • You've decided on min replicas based on your latency tolerance.

What's next

We're live. Now we have to know if we're still right — that's monitoring.