Deploy a Model
From best.pt to a managed inference endpoint with one click.
Platform deployment is the same idea as Vercel for code: pick a region, click deploy, get a URL. Authentication, scaling, export-format optimization, and model-deployment observability come bundled. Your job is to point traffic at it and wire your client correctly.
Deploy a trained model as a managed endpoint and call it from your application.
Pick the run, pick a region, click Deploy.
Endpoint URL + API key.
Test with
curlor the SDK.Wire your application client; switch traffic when ready.
Hands-on
Link to this sectionWhat you get#

A Platform deployment is:
- A unique HTTPS endpoint URL like
https://predict-abc123.run.app/predict. - An API key for auth (Bearer token).
- A single-tenant Cloud Run service with scale-to-zero behavior — you only pay for active inference time.
- Built-in latency, error-rate, and request-volume dashboards plus structured logs.
- Tied to a specific model run and region.
Link to this sectionDeploy#
In the UI: Deploy → New Deployment (or click Deploy on any model's Deploy tab). Pick:
| Setting | Notes |
|---|---|
| Model | The training run with the model you want to deploy |
| Region | One of 43 global regions — pick the closest to your callers |
| Deployment name | Auto-generated (e.g. yolo26n-iowa); editable in the dialog |
| CPU / Memory | Default 1 vCPU, 2 GiB — fixed for now; scale-to-zero is on |
Click Deploy. After ~5–45 seconds (cold-start, depending on whether the container image is already cached in the region) you have a Ready endpoint.
Link to this sectionCalling the endpoint#
curl -X POST "https://predict-abc123.run.app/predict" \
-H "Authorization: Bearer $ULTRALYTICS_API_KEY" \
-F "file=@test.jpg" \
-F "conf=0.25" \
-F "iou=0.7" \
-F "imgsz=640"That returns JSON with per-image detections — the response format:
{
"images": [
{
"shape": [1080, 1920],
"results": [
{
"class": 0,
"name": "forklift",
"confidence": 0.92,
"box": {"x1": 120, "y1": 234, "x2": 480, "y2": 700}
}
],
"speed": {"preprocess": 1.2, "inference": 12.5, "postprocess": 2.3}
}
],
"metadata": {"imageCount": 1, "model": "model.pt"}
}import os, requests
url = "https://predict-abc123.run.app/predict"
headers = {"Authorization": f"Bearer {os.environ['ULTRALYTICS_API_KEY']}"}
data = {"conf": 0.25, "iou": 0.7, "imgsz": 640}
with open("test.jpg", "rb") as f:
response = requests.post(url, headers=headers, data=data, files={"file": f})
print(response.json())The deployment card's Code tab shows ready-to-paste Python, JavaScript, and cURL snippets with your real endpoint URL and API key already filled in.
Link to this sectionCold starts and scale-to-zero#
Dedicated endpoints scale to zero when idle — you only pay for active inference time, but the first request after idle pays a cold start. Typical cold-start ranges:
| Scenario | Cold start |
|---|---|
| Cached container in the region | ~5–15 seconds |
| First deploy in a region | ~15–45 seconds |
For real-time clients (browsers, IoT, edge AI callbacks) you have two options:
- Send a periodic warmup request to keep a replica live.
- For lower network latency or availability, deploy in multiple regions and route to the closest one — but warm each region that must avoid cold starts.
For batch jobs that tolerate cold starts, scale-to-zero is the right default — it keeps the cost line flat through quiet periods. The throughput-vs-latency guide, model deployment options, and practices guides go deeper on the tradeoffs.
The endpoint accepts requests with a valid Bearer token. Don't ship the token to the browser. Keep it on a server proxy that adds the header to incoming requests.
Link to this sectionReplacing or rolling back#
Regions are fixed once chosen — to move a deployment to a new region or swap the underlying model, delete the existing endpoint and create a new one. The replacement gets a fresh URL, so plan a brief client-config swap. For a true zero-downtime cutover, deploy the new model in a second region or to a new service ahead of time and switch traffic at your DNS or load balancer.
If you'd rather self-host, Triton Inference Server is the canonical option for a YOLO-based gateway running in your own cloud.
Link to this sectionCustom domain#
Custom domains are coming soon. Today, endpoints use the auto-generated platform URL — front them with your own gateway (Cloudflare, an API router on a domain you own) if you need a branded host name.
Link to this sectionPricing and quotas#
Basic dedicated endpoints are free on all plans today; higher-resource configurations (more vCPUs, more memory, warm-start) will be usage-based in the future. Endpoint count limits depend on plan:
| Plan | Endpoints |
|---|---|
| Free | Up to 3 |
| Pro | Up to 10 |
| Enterprise | Unlimited |
Dedicated endpoints are not subject to the Platform shared-inference rate limits — throughput is bounded only by the endpoint's compute. If you ever need to pause billing without losing the URL, the deployment card has a Stop action that suspends the service (resume any time).
Deploy your trained model as a Platform endpoint. Hit it with curl and Python. Time the cold-start request (5–45 s) and a warm follow-up — the gap is what your users see after a quiet period.
Your model is deployed at a Platform endpoint.
You've called it from
curland gotten a JSON response withimages[].results[].You've decided on a warm-up strategy (none, periodic ping, or multi-region) based on your latency tolerance.
We're live. Now we have to know if we're still right — that's monitoring.