Meet YOLO26: next-gen vision AI.
Computer Vision Foundations·Pick a Task·Lesson 2/10
Lessonbeginner

The Six Vision Tasks

Match an output sentence to classification, detection, instance segmentation, semantic segmentation, pose, or oriented bounding boxes.

Examples of Ultralytics YOLO object detection

There are roughly six vision tasks you'll meet in production: classification, detection, instance segmentation, semantic segmentation, pose estimation, and oriented bounding boxes. They are not interchangeable. The wrong choice doubles your annotation budget or turns a 95% model into a 60% one. Pick deliberately.

Outcome

Match a project's output sentence to one of the six tasks and explain why in one line.

Fast Track
If you already know your way around, here's the short version.
  1. Image-level decision? Classification.

  2. Where and what (rough shape OK)? Detection.

  3. Exact shape for each object? Instance segmentation.

  4. Class label for every pixel? Semantic segmentation.

  5. Joints, keypoints, body language? Pose.

  6. Rectangular things at any angle (ships, vehicles from above, packages on a belt)? OBB.

Hands-on

Link to this sectionThe six tasks at a glance#

Annotation types: bounding boxes, polygons, masks, keypoints

TaskWhat you predictWhen it's the right answer
ClassificationOne label per image"Is this image of an X?" — no localization needed
DetectionClass + axis-aligned box per object"Where are the X's?" — counting, presence, rough location
Instance segmentationClass + pixel mask per object"What's the exact shape of each object?" — area, occlusion, boundaries
Semantic segmentationClass label for every pixel"What kind of region is every pixel?" — drivable area, land cover, scene parsing
PoseSet of keypoints per person/object"What is the body doing?" — fall detection, ergonomics, sport
OBBClass + rotated rectangle"Where, at what angle?" — aerial, top-down, conveyor

Link to this sectionDetection vs segmentation: the most common mistake#

Most teams reach for segmentation when detection would do, then run out of annotation budget. Mask annotation costs 3–10× more than box annotation per object. The right question to ask:

Does my downstream code need to know the shape of the object, or just where it is?

If you're counting cars in a parking lot, bounding boxes are fine. If you're computing the painted area of a wall, you need a mask.

Don't pay for masks you won't use

"Masks just in case" is a budget trap. Mask labels are slower to draw, slower to review, and need stricter rules for occlusion and edges. Start with boxes, add masks only when a real downstream consumer needs the shape.

Link to this sectionPose vs detection#

Pose returns keypoints — coordinates of joints (shoulders, hips, knees) or fixed points on an object. You use it when posture is the signal: a person fallen on the ground is in a very different pose than one walking past — a kind of action recognition.

You don't need pose to answer "is there a person here?" — that's just detection. Use pose when the body language is the prediction.

Link to this sectionOBB vs detection#

A regular detection box is axis-aligned: edges parallel to the image sides. An OBB rotates with the object. The classic OBB use cases are top-down or near-top-down imagery — aerial photos, satellite images, conveyor belts — where objects don't naturally line up with the frame.

   Detection box (axis-aligned)         OBB (rotated)
   ┌─────────────────┐                  ┌───────┐
   │       ▲         │                       \   \
   │      ╱ ╲        │   →  rotated object  → \   \  ← box hugs the actual shape
   │     ╱   ╲       │                         \___\
   │    ╱_____╲      │                          (fits)
   └─────────────────┘
   (lots of empty space)                       (tight)

If you're seeing big detection boxes that are mostly empty, that's a sign OBB is a better fit.

Link to this sectionChoosing in practice#

Walk through the questions in order. Stop at the first "yes."

  1. Is the output one label for the whole image? → Classification.
  2. Do you need to count, locate, or filter objects, but not draw them? → Detection. Add object tracking on top if you need cross-frame identity.
  3. Do you need the exact shape, boundary, or area of each object? → Instance segmentation.
  4. Do you need a class label for every pixel without separating instances? → Semantic segmentation.
  5. Is the posture of the subject the signal? → Pose.
  6. Are objects rotated relative to the frame? → OBB.
Try It

Take the output sentence from lesson 1. Walk down the six-question ladder and stop at the first "yes." That's your task. Write why you stopped there in one line — that one line is what you'll defend in a design review.

Done When
You've finished the lesson when all of these are true.
  • You've named one task — classification, detection, instance segmentation, semantic segmentation, pose, or OBB — for your project.

  • You can explain why the next task on the list is overkill.

  • You haven't accidentally picked segmentation just because it sounds more impressive than detection.

What's next

We'll go deep on object detection — the workhorse task and the one most people start with.