The Six Vision Tasks
Match an output sentence to classification, detection, instance segmentation, semantic segmentation, pose, or oriented bounding boxes.

There are roughly six vision tasks you'll meet in production: classification, detection, instance segmentation, semantic segmentation, pose estimation, and oriented bounding boxes. They are not interchangeable. The wrong choice doubles your annotation budget or turns a 95% model into a 60% one. Pick deliberately.
Match a project's output sentence to one of the six tasks and explain why in one line.
Image-level decision? Classification.
Where and what (rough shape OK)? Detection.
Exact shape for each object? Instance segmentation.
Class label for every pixel? Semantic segmentation.
Joints, keypoints, body language? Pose.
Rectangular things at any angle (ships, vehicles from above, packages on a belt)? OBB.
Hands-on
Link to this sectionThe six tasks at a glance#

| Task | What you predict | When it's the right answer |
|---|---|---|
| Classification | One label per image | "Is this image of an X?" — no localization needed |
| Detection | Class + axis-aligned box per object | "Where are the X's?" — counting, presence, rough location |
| Instance segmentation | Class + pixel mask per object | "What's the exact shape of each object?" — area, occlusion, boundaries |
| Semantic segmentation | Class label for every pixel | "What kind of region is every pixel?" — drivable area, land cover, scene parsing |
| Pose | Set of keypoints per person/object | "What is the body doing?" — fall detection, ergonomics, sport |
| OBB | Class + rotated rectangle | "Where, at what angle?" — aerial, top-down, conveyor |
Link to this sectionDetection vs segmentation: the most common mistake#
Most teams reach for segmentation when detection would do, then run out of annotation budget. Mask annotation costs 3–10× more than box annotation per object. The right question to ask:
Does my downstream code need to know the shape of the object, or just where it is?
If you're counting cars in a parking lot, bounding boxes are fine. If you're computing the painted area of a wall, you need a mask.
"Masks just in case" is a budget trap. Mask labels are slower to draw, slower to review, and need stricter rules for occlusion and edges. Start with boxes, add masks only when a real downstream consumer needs the shape.
Link to this sectionPose vs detection#
Pose returns keypoints — coordinates of joints (shoulders, hips, knees) or fixed points on an object. You use it when posture is the signal: a person fallen on the ground is in a very different pose than one walking past — a kind of action recognition.
You don't need pose to answer "is there a person here?" — that's just detection. Use pose when the body language is the prediction.
Link to this sectionOBB vs detection#
A regular detection box is axis-aligned: edges parallel to the image sides. An OBB rotates with the object. The classic OBB use cases are top-down or near-top-down imagery — aerial photos, satellite images, conveyor belts — where objects don't naturally line up with the frame.
Detection box (axis-aligned) OBB (rotated)
┌─────────────────┐ ┌───────┐
│ ▲ │ \ \
│ ╱ ╲ │ → rotated object → \ \ ← box hugs the actual shape
│ ╱ ╲ │ \___\
│ ╱_____╲ │ (fits)
└─────────────────┘
(lots of empty space) (tight)If you're seeing big detection boxes that are mostly empty, that's a sign OBB is a better fit.
Link to this sectionChoosing in practice#
Walk through the questions in order. Stop at the first "yes."
- Is the output one label for the whole image? → Classification.
- Do you need to count, locate, or filter objects, but not draw them? → Detection. Add object tracking on top if you need cross-frame identity.
- Do you need the exact shape, boundary, or area of each object? → Instance segmentation.
- Do you need a class label for every pixel without separating instances? → Semantic segmentation.
- Is the posture of the subject the signal? → Pose.
- Are objects rotated relative to the frame? → OBB.
Take the output sentence from lesson 1. Walk down the six-question ladder and stop at the first "yes." That's your task. Write why you stopped there in one line — that one line is what you'll defend in a design review.
You've named one task — classification, detection, instance segmentation, semantic segmentation, pose, or OBB — for your project.
You can explain why the next task on the list is overkill.
You haven't accidentally picked segmentation just because it sounds more impressive than detection.
We'll go deep on object detection — the workhorse task and the one most people start with.