UNIT-2 Data Science Methodology: An Analytic

Approach to Capstone Project

What is a Methodology?

A methodology is a structured way to plan and execute an AI project. It helps data scientists and teams decide on:

Methods (how to analyze the data)
Processes (steps to follow)
Strategies (best ways to get correct results)

Using a methodology ensures the project is organized, efficient, and cost-effective.

What is Data Science Methodology?

It is a step-by-step process that helps data scientists solve problems using data.

The process follows a prescribed sequence of steps.
It enables better understanding and handling of data.
It ensures a systematic approach to finding solutions.

Who Developed It?

Data Science Methodology was introduced by John Rollins, a Data Scientist at IBM Analytics.

It consists of 10 steps to guide AI projects from start to finish.
The methodology helps teams approach AI projects efficiently.

Modules of Data Science Methodology

The methodology is divided into five modules, each covering two stages:

From Problem to Approach → Identifying the problem and choosing the right strategy.
From Requirements to Collection → Gathering the necessary data for analysis.
From Understanding to Preparation → Cleaning and organizing data for better insights.
From Modeling to Evaluation → Creating AI models and checking their performance.
From Deployment to Feedback → Using AI solutions and improving them over time.

1. Understanding Business Problems (Problem Scoping)

Before we start an AI project, we must understand the problem we want to solve.

We ask questions to know exactly what people need.
We list down the goals and objectives to help solve the problem.
We can use a method called 5W1H Problem Canvas to deeply understand the issue (What, Why, Who, When, Where, and How).

2. Choosing the Right Data Science Approach

Once we understand the problem, we use Data Science Methods to find the right solution.

We use questions to guide our AI project.
Some questions include:
- How much or how many? → (Regression)
- What category does this belong to? → (Classification)
- Can the data be grouped? → (Clustering)
- Is there something unusual in the data? → (Anomaly Detection)
- What should we suggest to the user? → (Recommendation)

Four Types of Data Analytics

To understand data better, AI uses four types of analytics:

Descriptive Analytics → "What happened?"
- It looks at past data to find patterns.
- Example: Checking students' average marks in exams.
Diagnostic Analytics → "Why did it happen?"
- It finds reasons behind problems.
- Example: If a store has fewer sales, AI can check if it’s because of price changes or fewer customers.
Predictive Analytics → "What will happen next?"
- It predicts future trends based on past data.
- Example: AI forecasting what food will be popular next month.
Prescriptive Analytics → "What should we do?"
- It suggests actions based on data.
- Example: AI recommending the best price to sell a product during festival season.

4. Data Requirements – What Data Do We Need?

Before starting an AI project, we need specific data to solve the problem.
To figure out the right data, we ask:
✔ What type of data do we need? (Numbers, words, images)
✔ How should the data be organized? (Tables, files, or databases)
✔ Where can we collect this data from? (Websites, surveys, pictures)
✔ Does the data need cleaning? (Fixing errors, making it neat)

Data can be in three forms:
📊 Structured Data – Neatly arranged in tables (Example: Customer details)
📜 Semi-Structured Data – Some organization but not completely structured (Example: Emails, XML files)
🎨 Unstructured Data – No set structure (Example: Social media posts, images, videos)

Data Collection – Gathering the Information

To collect data, we need to find the right sources.

There are two types of data sources:
✅ Primary Data – Directly collected through surveys, interviews, observations. (Example: Feedback forms, sensors)
✅ Secondary Data – Already available data from websites, books, journals. (Example: Google search, Kaggle datasets)

💡 How do we collect large amounts of data?

Using websites, smart forms, and databases.
Storing data in computers, cloud storage, or AI databases.

5. Data Understanding – Is This Data Useful?

Once we collect data, we check if it helps in solving our problem!
We analyze:
✔ Is the data complete? (No missing details)
✔ Is the data correct? (No mistakes)
✔ Does the data match our needs? (Useful for the problem)

We use graphs, charts, and stats to study our data before using it.

6. Data Preparation – Getting Data Ready!

Before using data in AI, we must clean and prepare it so that the computer can understand it.
Here’s what happens in this stage:
✔ Cleaning Data – Fixing mistakes, removing duplicate entries, and making sure everything is neat.
✔ Combining Data – Merging information from different sources (tables, files, websites).
✔ Changing Data – Making sure numbers and words are in a format AI can use.

🔍 Feature Engineering – This means creating new useful features from raw data!
For example:
🏠 If we are predicting house prices, we can use:
✔ Age of the house = Current year – Year built
✔ Price per square foot = Total price / Area

7. AI Modeling – Making AI Learn from Data!

Once the data is ready, we train AI models to learn and make predictions.
AI models can be of two types:

1. Descriptive Modeling – Understanding Data!

This helps us summarize and describe data without making predictions.
🔹 We use graphs, charts, and numbers to study data.
Examples:
✔ Mean (average) – Finding the average marks in a class.
✔ Bar charts – Showing student attendance in a school.
✔ Histograms – Checking how many people bought different products.

2. Predictive Modeling – Guessing the Future!

This uses past data to predict future outcomes!
Examples:
✔ Forecasting weather based on old weather reports.
✔ Predicting exam scores based on previous test results.
✔ Suggesting movies based on what people watched before.

🔍 AI learns from a training set (a set of past data where the answers are already known).
By adjusting different AI algorithms, we make the model more accurate!

8. Evaluation – Is the AI Model Working Correctly?

Before using an AI model in real life, we test it to make sure it gives accurate answers.
We check the model using test data to see how well it performs.

There are two steps in the evaluation process:
✔ Diagnostic Check: Does the model give the right results?

If it's a predictive model, a decision tree can be used to compare its answers to the expected outcome.
If it's a descriptive model, we check patterns in past data.

✔ Statistical Test: This ensures the AI correctly understands and processes data without mistakes.

9. Deployment – How Do People Use AI in Real Life?

Once the AI model is tested and ready, we deploy it so people can use it!
✔ The AI tool is introduced to real users.
✔ Businesses may test it in small groups first.
✔ Different teams help launch the AI model across websites or apps.

💡 Example:
Netflix uses an AI model to recommend movies based on what people watched before. Once tested, it gets released so users can see smart movie suggestions!

10. Feedback – Can We Make AI Better?

After deployment, users give feedback on how well the AI performs.
✔ Users tell what works and what needs fixing.
✔ Developers improve the model based on feedback.
✔ AI keeps learning and updating to become smarter!

💡 Example:
AI chatbots improve their answers based on user feedback. If a chatbot's response is confusing, developers update it to give clearer replies next time!

Model Validation in Machine Learning

What is Model Validation?

Model validation checks how well a trained AI model performs on new, unseen data.
It helps us measure accuracy and reliability so that the model gives useful predictions.

Why is Model Validation Important?

✔ Improves model quality
✔ Reduces errors
✔ Prevents overfitting (model memorizing data) and underfitting (not learning enough)

Model Validation Techniques

There are different methods to check if an AI model is reliable, including:
✔ Train-Test Split
✔ K-Fold Cross Validation
✔ Leave-One-Out Cross Validation
✔ Time Series Cross Validation

**Syllabus includes

Train-Test Split

This is one of the simplest validation methods.
📌 The dataset is divided into two parts:
✔ Training Set → Used to train the AI model.
✔ Test Set → Used to check if the trained model gives correct predictions.

💡 Common Train-Test Splits:
✔ 80% Train, 20% Test
✔ 70% Train, 30% Test
✔ 67% Train, 33% Test

4K-Fold Cross Validation

Instead of splitting the data just once, K-Fold Cross Validation divides data into multiple folds (parts).
📌 The AI model is trained and tested multiple times on different folds for better accuracy.
Example:
✔ If K=5, the data is divided into 5 parts (each part is 20% of total data).
✔ The AI model is trained on 4 parts and tested on 1 part repeatedly.
✔ This helps get a more accurate result.

💡 Why use Cross Validation?
✔ Gives a better estimate of model accuracy.
✔ Helps avoid errors and wrong predictions.

Model Performance & Evaluation Metrics

Why Do We Evaluate AI Models?

📌 Evaluation metrics help check how good a model is at making predictions.
📌 They allow comparison between models to find the best one for a specific task.
📌 Classification models categorize data into groups (like Yes/No), while regression models predict continuous values (like prices).

Evaluation Metrics for Classification Models

These metrics check if a model correctly categorizes items into the right groups.

✅ Confusion Matrix – A table to compare actual results vs predictions.

✔ True Positives (TP) – Model predicted Yes, and actual result was Yes.
✔ True Negatives (TN) – Model predicted No, and actual result was No.
✔ False Positives (FP) – Model predicted Yes, but actual result was No.
✔ False Negatives (FN) – Model predicted No, but actual result was Yes.

✅ Precision & Recall – Checking how reliable predictions are!

✔ Precision = TP / (TP + FP) → How many predicted positives were actually correct?
✔ Recall = TP / (TP + FN) → How many actual positives were correctly predicted?

✅ F1 Score – A balanced score between precision & recall.

✔ F1 = 2 × (Precision × Recall) / (Precision + Recall)
✔ Best Score = 1, Worst Score = 0

✅ Accuracy – The percentage of correct predictions.

✔ Accuracy = (TP + TN) / (TP + FP + FN + TN)
✔ Higher accuracy = better model performance!

Evaluation Metrics for Regression Models

Regression models predict continuous values (like house prices).

✅ Mean Absolute Error (MAE)

✔ Measures the absolute difference between actual & predicted values.
✔ Lower MAE = Better predictions!

✅ Mean Squared Error (MSE)

✔ Calculates the average squared difference between actual & predicted values.
✔ Lower MSE = More accurate model!

✅ Root Mean Squared Error (RMSE)

✔ The square root of MSE, making it easier to understand.
✔ Smaller RMSE = More reliable model predictions!

Reference book:

CBSE handbook for class XII

Search This Blog

ARTIFICIAL INTELLGENCE (2025-26)

XII UNIT 2 Data Science Methodology: An Analytic Approach to Capstone Project