The Five Vision Tasks

Match an output sentence to classification, detection, segmentation, pose, or oriented bounding boxes.

There are roughly five vision tasks you'll meet in production: classification, detection, segmentation, pose estimation, and oriented bounding boxes. They are not interchangeable. The wrong choice doubles your annotation budget or turns a 95% model into a 60% one. Pick deliberately.

Outcome

Match a project's output sentence to one of the five tasks and explain why in one line.

Fast Track

If you already know your way around, here's the short version.

Image-level decision? Classification.
Where and what (rough shape OK)? Detection.
Exact shape, area, or boundary? Segmentation.
Joints, keypoints, body language? Pose.
Rectangular things at any angle (ships, vehicles from above, packages on a belt)? OBB.

Hands-on

The five tasks at a glance

Task	What you predict	When it's the right answer
Classification	One label per image	"Is this image of an X?" — no localization needed
Detection	Class + axis-aligned box per object	"Where are the X's?" — counting, presence, rough location
Segmentation	Class + pixel mask per object	"What's the exact shape?" — area, occlusion, boundaries
Pose	Set of keypoints per person/object	"What is the body doing?" — fall detection, ergonomics, sport
OBB	Class + rotated rectangle	"Where, at what angle?" — aerial, top-down, conveyor

Detection vs segmentation: the most common mistake

Most teams reach for segmentation when detection would do, then run out of annotation budget. Mask annotation costs 3–10× more than box annotation per object. The right question to ask:

Does my downstream code need to know the shape of the object, or just where it is?

If you're counting cars in a parking lot, bounding boxes are fine. If you're computing the painted area of a wall, you need a mask.

Don't pay for masks you won't use

"Masks just in case" is a budget trap. Mask labels are slower to draw, slower to review, and need stricter rules for occlusion and edges. Start with boxes, add masks only when a real downstream consumer needs the shape.

Pose vs detection

Pose returns keypoints — coordinates of joints (shoulders, hips, knees) or fixed points on an object. You use it when posture is the signal: a person fallen on the ground is in a very different pose than one walking past — a kind of action recognition.

You don't need pose to answer "is there a person here?" — that's just detection. Use pose when the body language is the prediction.

OBB vs detection

A regular detection box is axis-aligned: edges parallel to the image sides. An OBB rotates with the object. The classic OBB use cases are top-down or near-top-down imagery — aerial photos, satellite images, conveyor belts — where objects don't naturally line up with the frame.

   Detection box (axis-aligned)         OBB (rotated)
   ┌─────────────────┐                  ┌───────┐
   │       ▲         │                       \   \
   │      ╱ ╲        │   →  rotated object  → \   \  ← box hugs the actual shape
   │     ╱   ╲       │                         \___\
   │    ╱_____╲      │                          (fits)
   └─────────────────┘
   (lots of empty space)                       (tight)

If you're seeing big detection boxes that are mostly empty, that's a sign OBB is a better fit.

Choosing in practice

Walk through the questions in order. Stop at the first "yes."

Is the output the same for every pixel in the image? → Classification.
Do you need to count, locate, or filter objects, but not draw them? → Detection. Add object tracking on top if you need cross-frame identity.
Do you need the exact shape, boundary, or area? → Segmentation.
Is the posture of the subject the signal? → Pose.
Are objects rotated relative to the frame? → OBB.

Try It

Take the output sentence from lesson 1. Walk down the five-question ladder and stop at the first "yes." That's your task. Write why you stopped there in one line — that one line is what you'll defend in a design review.

Done When

You've finished the lesson when all of these are true.

You've named one task — classification, detection, segmentation, pose, or OBB — for your project.
You can explain why the next task on the list is overkill.
You haven't accidentally picked segmentation just because it sounds more impressive than detection.

What's next

We'll go deep on object detection — the workhorse task and the one most people start with.

Get Started