Skip to main content
Build with Ultralytics Platform·Train and Evaluate·Lesson 6/10
Lessonintermediate

Track Experiments

Compare runs, name them well, and keep the lessons.

After a couple of weeks on a project, you'll have ten or twenty training runs. Half of them won the experiment but never got documented; the other half failed in instructive ways nobody captured. Experiment tracking — whether on Platform models, Comet, Weights & Biases, or MLflow — is the discipline of making the lessons stick, and of finding the best run when you actually have to ship.

Outcome

Compare multiple Platform training runs, decide which to ship, and archive the lessons in a way your future self will find.

Fast Track
If you already know your way around, here's the short version.
  1. Name runs descriptively — what changed.

  2. Pin the dataset version to every run.

  3. Compare runs on the same metric — pick mAP@0.5:0.95 by default.

  4. Document why each run was launched and what it taught you.

Hands-on

Comparing runs

Ultralytics Platform project comparison table view

Platform's experiment view lets you select multiple runs and overlay their metric curves:

   mAP@0.5:0.95
   ▲
0.6│                     ◇ v3_pallets_added (best)
   │             ◇    ◇  ◇
0.5│        ◇ ◇        ◇
   │     ◇      ◇   ◇
0.4│  ◇           ◇
   │   ◇    ◇ v2_imgsz_1024
   │
0.3│ ◇ v1 (baseline)
   └────────────────────────▶  epoch

You're looking for:

  1. Best final value — the simple "which is best" question.
  2. Convergence speed — a model that hits 0.55 at epoch 30 is more efficient than one that hits 0.55 at epoch 100.
  3. Stability — a curve with wild swings is harder to trust than a smoothly rising one.

Naming runs

Bad: run_5, final, final_actually_final.

Good: v3_pallets_added_imgsz_640, v4_lr_0.0005_no_mosaic.

The pattern: <version>_<what_changed>. Future-you, looking through 30 runs three months later, will thank present-you.

Pin every run to a dataset version

It's easy to launch a run on "the dataset I had on Tuesday" and have no idea later what that meant. Platform attaches the dataset version to every run — but only if you reference a tagged version, not the working state.

Always train against a tagged dataset version. Never against the live working dataset.

Pick a single comparison metric

For most CV projects, mAP@0.5:0.95 is the right comparison metric — it's strict enough to catch sloppy boxes and forgiving enough to be stable. The YOLO performance metrics guide walks through how it's computed; the confusion matrix view in Platform is the other half of any honest run review. Other useful metrics in specific contexts:

MetricWhen
mAP@0.5Quick check, large-object datasets
Per-class AP for class XWhen one class drives business value
Recall @ precision = 0.95Safety / alarming systems
Inference latency on targetProduction fitness

Pick one before comparing — comparing on different metrics is where motivated reasoning sneaks in.

The decision template

When deciding whether a new run "wins":

  1. Did it improve the comparison metric by more than noise (~0.5 mAP points)?
  2. Does it pass a separate sanity check — at least 90% of v_previous's per-class APs?
  3. Has it been validated on the test set (not val)?
  4. Did latency stay within budget?

If all four, it wins. If only some, the answer is "more experiments" not "ship it."

Reserve the test set for the final compare

Tune hyperparameters on val. Reserve the test set for the final go/no-go before deployment. If you compare ten runs on the test set, the test set is now training data and your number is dishonest.

Archive the lessons

Every meaningful run should produce a one-paragraph note: what you tried, what you saw, and what to try next. Platform has a notes field per run — use it.

v3_pallets_added:
  Hypothesis: pallet AP was 0.31 because too few examples (7 in val).
  Change: collected 80 more pallet images from camera-3 and camera-7,
  retained existing labels, retrained from v2 best.pt with lr0=0.001.
  Result: pallet AP 0.31 → 0.58. Overall mAP 0.51 → 0.59.
  Next: forklift confidence still drops at night; collect dawn/dusk batches.

These notes are the most valuable artifact of the project. Models go stale; lessons don't. If your team prefers an external tracker, the ClearML and TensorBoard integrations expose the same per-epoch metrics Platform stores natively.

Try It

Launch two runs that differ in one thing — image size, model size, or dataset version. Compare them in Platform's experiment view. Write the one-paragraph note for each. The discipline is the value, not the runs.

Done When
You've finished the lesson when all of these are true.
  • You've launched at least two Platform runs and compared them.

  • Each run has a descriptive name and a one-paragraph note.

  • You've decided which run to deploy based on a single, pre-chosen metric.

What's next

We've picked a winner. Next: turning that winner into a deployed endpoint.