Unit 5-Computer Vision

What is Computer Vision?
A domain of AI that enables machines to “see” like humans using images or visual data.
Machines can capture, process, analyze, and interpret visual information.

How Humans See

Eye → captures visuals
Brain → interprets

How Machines See

Camera / Sensor → captures image
Computer Vision Algorithms → analyze & interpret

Quick Overview of Computer Vision

Computer Vision = extracting useful information from:

Images
Videos
Text
Visual signals

It helps machines understand visuals just like humans do.

Relationship: AI → Computer Vision → Deep Learning

Artificial Intelligence (AI): Makes computers think intelligently.
Computer Vision (CV): Enables computers to see.
Deep Learning (DL): Helps CV models learn automatically from large image datasets.

Computer Vision vs Image Processing

Computer Vision

Focus: Understanding images.
What it does: Extracts meaningful information to make decisions.
Examples:
- Object detection
- Face recognition
- Handwriting recognition

Image Processing

Focus: Improving images.
What it does: Enhances or prepares images for further tasks.
Examples:
- Resizing
- Adjusting brightness
- Changing color tones

Relationship

Image Processing is a subset of Computer Vision.
Computer Vision is the bigger field.

Applications of Computer Vision

1. Facial Recognition
- Used in smart homes, smart cities, and security systems.
- Helps recognize guests, maintain visitor logs, and manage access.
- Schools use it for automatic attendance.
- CV identifies and matches facial features.
2. Face Filters (Instagram, Snapchat)
- Apps detect facial movements and landmarks.
- Filters (dog ears, glasses, beauty filters, etc.) are placed accurately on the face.
- Computer Vision tracks nose, eyes, lips, etc., in real time.
3. Google’s Search by Image
- Instead of typing text, you upload an image to search.
- CV compares features of the image with millions of images in Google’s database.
- Helps identify objects, locations, people, products, etc.
4. Computer Vision in Retail
a) Customer Tracking
- CV tracks customer movement inside stores.
- Helps understand walking paths and behavior.
b) Inventory Management
- Analyzes camera footage to estimate stock levels.
- Detects empty shelves, misplaced items, and stocking errors.
- Suggests better product placement for improved sales.
5. Self-Driving Cars
- Computer Vision is the core technology.
- Used for:
  - Object detection (cars, humans, traffic lights)
  - Lane detection
  - Route navigation
  - Environment monitoring
- Enables hands-free or autonomous driving.
6. Medical Imaging
- Helps doctors to see, analyze, and interpret medical images.
- Converts 2D scans like CT, MRI into 3D models.
- Provides detailed views of organs, helping diagnosis and treatment planning.
- Acts as an assistant to medical professionals.
7. Google Translate App (Live Translation)

Point your camera at foreign text → instantly translated.
Uses:
- OCR (Optical Character Recognition) to read text.
- Augmented Reality to display translated text on the screen.
Very useful for travel and communication.

Computer Vision Tasks

Computer Vision applications work by performing certain key tasks to extract information from images. These tasks help in prediction, analysis, and understanding of visuals.

🔹 1. Image Classification

Definition

Assigning a single label/category to an entire image.
Only identifies what object is present.

Key point

Works with one object per image (most basic CV task).

Examples

Cat vs Dog
Identifying handwritten digits
Classifying a fruit as apple/mango/orange

2. Classification + Localization

Definition

Performs two tasks together:
1. Classification – What object is present
2. Localization – Where it is located in the image

Key point

Works for a single object in an image.
Uses bounding boxes to show location.

Example

Detecting and marking where a football is in an image.

3. Object Detection

Definition

Identifies multiple objects in an image and their locations.
Detects all instances of objects like cars, people, animals, etc.

How it works

Uses features + learning algorithms.

Examples

Detecting pedestrians and vehicles in self-driving cars
Automated car parking systems
Face detection in camera apps

Key point

Gives bounding boxes for all detected objects.

4. Instance Segmentation

Definition

The most detailed CV task.
Detects each object, assigns it a category, and also labels each pixel belonging to that object.

Key point

Separates individual objects, even of the same type.

Example

Identifying each person in a crowd separately.
For two dogs in an image, it labels Dog 1 and Dog 2 pixel-wise.

Output

Creates segments/regions for different objects in the image.

Basics of Images

Images that we see on mobiles or computers are stored in a digital format.

1. Pixels

Pixel = picture element (smallest unit of an image).
An image is made of thousands or millions of pixels arranged in rows and columns.
When zoomed in, an image looks like many coloured or gray squares (pixels).
More pixels = clearer and sharper image.

2. Resolution

Resolution refers to the number of pixels in an image.

Two ways to express resolution:

Width × Height
- Example: 1280 × 1024 means:
  - 1280 pixels across
  - 1024 pixels vertically
Megapixels (MP)
- 1 megapixel = 1 million pixels
- Example:
  - 1280 × 1024 = 1,310,720 pixels
  - ≈ 1.31 megapixels

Higher resolution = more detail.

3. Pixel Value

Each pixel has a value that describes:

Brightness
Color

Most common format:

8-bit pixel → values range from 0 to 255

Meaning:

0 = black
255 = white
Middle values = gray shades (for grayscale images)

Why 0–255?

Pixel uses 1 byte = 8 bits
Each bit can be 0 or 1
Total combinations = 2⁸ = 256 values → 0 to 255

4. Grayscale Images

Contain shades of gray (no color).
0 = black, 255 = white, values in between = gray.
Each pixel requires 1 byte.
Stored as a 2D array (Height × Width).

Example:

A grayscale image might look black, white, or gray depending on pixel intensity.

5. RGB Images

RGB = Red, Green, Blue
These are the three primary colours used to make all other colours.

How RGB images are stored:

Stored using three channels:
1. R channel
2. G channel
3. B channel
Each pixel has three values (R, G, B) between 0 and 255.
All three channels combine to form the final color image.