Meet YOLO26: next-gen vision AI.
Build with Ultralytics Platform·Train and Evaluate·Lesson 5/10
Lessonintermediate

Cloud Training

One click, one model size, one GPU class — and a trained model when you come back.

Local training is fine for small experiments. Production training wants more GPU than your laptop has, in less time than spinning up a cloud instance takes. Platform cloud training runs are managed: pick a model and a dataset version, click train, walk away. The artifacts come back the same shape as the train mode outputs you already know from local runs.

Outcome

Launch a training run on Platform, configure the right defaults for your task, and pull the trained checkpoint locally.

Fast Track
If you already know your way around, here's the short version.
  1. Pick model size based on your latency budget (course 2 lesson 3).

  2. Pick GPU class based on dataset size and patience.

  3. Default hyperparameters are good — change only with a reason.

  4. Pull the resulting best.pt for local validation.

Hands-on

Link to this sectionLaunch a cloud training run#

Ultralytics Platform training dialog GPU selector and cost

In the Platform UI: Train → New Run. Pick:

SettingDefaultWhen to change
Dataset(your dataset version)Always pick a tagged version, never the live working state
Modelyolo26nLarger only if benchmarks demand it
Epochs100Lower for tiny datasets, higher for very large ones
Image size640320 / 416 / 512 / 640 / 1280 in the dropdown; any multiple of 32 from 32–4096 in the YAML editor
Batch size-1 (auto)Manually set to fit GPU memory or stabilize gradients
Learning rate0.01Lower (0.001) when fine-tuning from a previous best.pt
GPURTX PRO 6000 (default)A100 / H100 / H200 / B200 / B300 for larger jobs; budget tier (RTX A5000, RTX 4090) for cheap iterations
Run name(auto)Override with a descriptive name

The defaults inherit from the train mode reference — the same flags you'd pass to model.train(...) locally. When you start fine-tuning a previous checkpoint, the transfer learning glossary entry covers the why behind the lower LR; for systematic sweeps, the hyperparameter tuning guide is the canonical reference.

Link to this sectionPick the GPU thoughtfully#

The Cloud Training dialog exposes the full range — see the cloud training docs for the live list. A useful mental model, with rough hourly cost and VRAM:

TierExamplesVRAM~$/hrBest for
BudgetRTX A5000, RTX 4000 Ada, RTX 409020–24 GB$0.25–$0.69Small datasets, yolo26n/s, cheap iterations
MidRTX 6000 Ada, L40S, RTX 509032–48 GB$0.77–$0.99yolo26m workloads, image size 1024+
ProA100 80 GB, RTX PRO 6000 (96 GB Blackwell — default)80–96 GB$1.39–$1.89Default workhorse for most jobs
EnterpriseH100 PCIe / SXM / NVL, H200, B200, B30080–288 GB$2.39–$7.39Very large datasets, fastest iteration (B200 / B300 require Pro / Enterprise)

For most academy-scale projects the default RTX PRO 6000 is the right call. As a reference point, an H100 SXM trains roughly 3.4× faster than an RTX 4090 — useful when you're paying for engineer time more than GPU time. Cost scales linearly with selected GPU rate × actual training hours; failed runs aren't billed. Higher tiers are also where distributed training and mixed precision start to pay back the configuration overhead.

Link to this sectionWatch the run live#

Every run has a live dashboard:

  • Training/val loss — should both descend.
  • Per-class AP — updates per epoch.
  • Sample predictions — visually verify the model is learning what you think.
  • Resource usage — GPU utilization should be > 80%; lower is wasted money.

Platform alerts you if a run looks unhealthy: loss diverging, GPU starved (input pipeline bottleneck), or out-of-memory.

Kill bad runs early

The first 5 epochs tell you most of what you need. If loss isn't descending or sample predictions look random, kill the run and investigate before burning cost. The metric to watch is val mAP — if it's flat after 5 epochs, something's wrong.

Link to this sectionPulling the model locally#

Every Platform run produces a downloadable best.pt. The file is identical to what you'd produce locally with the same config — download it from the model page, then load it with the open-source package:

from ultralytics import YOLO

model = YOLO("best.pt")     # path to the file you downloaded
model.export(format="onnx")

You can now run inference, validate, and export exactly as in course 2. If you'd rather train locally and just stream metrics into Platform, set ULTRALYTICS_API_KEY and pass project=username/my-project name=experiment-1 to model.train(...) — see remote training.

Link to this sectionFine-tune from a previous run#

When you want to keep iterating, pick a previous model as the base model in the New Model dialog. Platform initializes the new run from your last best.pt automatically — drop the learning rate (e.g. lr0=0.001) in Advanced Settings for fine-tuning. The fine-tuning guide covers layer freezing and two-stage training when defaults aren't enough; augmentation knobs in Advanced Settings map directly to the data augmentation guide. The Cloud Training dialog also accepts a YAML or JSON config copied from any previous run, so you can reproduce or tweak earlier experiments without re-typing every parameter.

Link to this sectionWhat a healthy run looks like#

After ~30 epochs of training a small model on a clean dataset, you should see:

  • Train and val loss both descending (not diverging).
  • Val mAP@0.5:0.95 climbing past 0.4.
  • Per-class AP not too unbalanced.
  • Sample predictions visually correct.

If any of those don't hold by epoch 30, the dataset usually needs work — not the hyperparameters. The next lesson on experiment tracking gives you the tools to compare runs and find the dataset change that fixed things, and the Comet integration shows how to mirror Platform metrics into a third-party dashboard if your team already lives there.

Try It

Launch a Platform run on your dataset with yolo26n for 100 epochs. While it trains, browse the live dashboard. Note GPU utilization — if it's < 70%, your input pipeline (image loading) is the bottleneck.

Done When
You've finished the lesson when all of these are true.
  • You've launched and completed a Platform training run.

  • You can pull the resulting best.pt locally.

  • You've seen the live dashboard and watched at least one run end-to-end.

What's next

Real CV work is many runs, not one. Next: experiment tracking and choosing the best model.