Thursday, June 18, 2026

Nvidia Agentic AI prep

If you're preparing for the NVIDIA Agentic AI and LLMs Certification, expect questions around LLM fundamentals, RAG, agents, vector databases, orchestration, tool calling, evaluation, deployment, and NVIDIA's AI stack.

LLM Fundamentals

Q1. What is the difference between pre-training and fine-tuning?

A: Pre-training learns general language patterns from large corpora; fine-tuning adapts the model to a specific task using labeled data.

Q2. What is a token?

A: A token is the basic unit processed by an LLM, representing words, subwords, or characters.

Q3. What causes hallucinations in LLMs?

A: Missing knowledge, ambiguous prompts, outdated training data, and probabilistic text generation.

Q4. What is the context window?

A: The maximum number of tokens an LLM can process in a single request.

---

RAG (Retrieval-Augmented Generation)

Q5. Why use RAG instead of fine-tuning?

A: RAG injects up-to-date knowledge without retraining the model.

Q6. What are the main components of a RAG pipeline?

A: Ingestion, chunking, embedding, vector store, retrieval, reranking, and generation.

Q7. Why is chunking important?

A: It improves retrieval accuracy by breaking documents into semantically meaningful sections.

Q8. What is embedding?

A: A numerical vector representation capturing semantic meaning of text.

Q9. How does semantic search differ from keyword search?

A: Semantic search retrieves based on meaning, while keyword search matches exact terms.

Q10. What metrics are used to evaluate retrieval quality?

A: Recall@K, Precision@K, MRR, and NDCG.

---

Vector Databases

Q11. Why use a vector database?

A: To efficiently store and search embeddings using nearest-neighbor algorithms.

Q12. What is ANN search?

A: Approximate Nearest Neighbor search trades slight accuracy for faster retrieval.

Q13. Why not store embeddings in a traditional database like MongoDB?

A: MongoDB is optimized for key-value/document retrieval, not high-dimensional similarity search.

Q14. What is cosine similarity?

A: A measure of similarity based on the angle between two vectors.

---

Agentic AI

Q15. What is an AI Agent?

A: An autonomous system that reasons, plans, uses tools, and executes actions to achieve goals.

Q16. How is Agentic AI different from a chatbot?

A: Agents can perform actions and interact with external systems; chatbots mainly generate responses.

Q17. What are the key components of an agent?

A: LLM, memory, planning, tools, and execution loop.

Q18. What is tool calling?

A: Allowing an LLM to invoke external APIs, databases, or functions.

Q19. What is agent memory?

A: Mechanisms for storing conversation history or long-term knowledge.

Q20. What is a planner agent?

A: An agent that decomposes complex tasks into executable subtasks.

---

Multi-Agent Systems

Q21. What is a multi-agent architecture?

A: Multiple specialized agents collaborating to solve a task.

Q22. How do agents communicate?

A: Through messages, shared memory, event buses, or orchestration frameworks.

Q23. When should you use multiple agents instead of one?

A: When tasks require specialized expertise or parallel execution.

Q24. What is a supervisor agent?

A: An agent that routes tasks and coordinates worker agents.

Q25. What are common multi-agent patterns?

A: Supervisor-worker, hierarchical, peer-to-peer, blackboard, and swarm.

---

Prompt Engineering

Q26. What is chain-of-thought prompting?

A: Prompting the model to reason through intermediate steps.

Q27. What is few-shot prompting?

A: Providing examples to guide model behavior.

Q28. What is prompt injection?

A: Malicious instructions intended to manipulate agent behavior.

Q29. How can prompt injection attacks be mitigated?

A: Input validation, instruction hierarchy, and tool access controls.

---

Evaluation

Q30. How do you evaluate an LLM application?

A: Measure answer quality, groundedness, latency, cost, and retrieval effectiveness.

Q31. What is groundedness?

A: The extent to which responses are supported by retrieved evidence.

Q32. Name hallucination benchmarks.

A: HaluEval, HaluBench, and RAGTruth.

---

NVIDIA-Specific Questions

Q33. What is NVIDIA NIM?

A: A containerized inference microservice for deploying AI models.

Q34. What is NVIDIA NeMo?

A: NVIDIA's framework for training, customizing, and deploying generative AI models.

Q35. What is NVIDIA TensorRT-LLM?

A: An inference optimization framework for accelerating LLMs on NVIDIA GPUs.

Q36. What is quantization?

A: Reducing numerical precision (FP16 → INT8/INT4) to improve inference efficiency.

Q37. Why use TensorRT-LLM?

A: Lower latency, higher throughput, and optimized GPU utilization.

Q38. What is KV Cache?

A: Cached attention states reused during generation to speed inference.

Q39. What is speculative decoding?

A: Using a smaller model to generate candidate tokens that a larger model verifies.

Q40. What is model parallelism?

A: Splitting a model across multiple GPUs to handle large parameter sizes.

---

Scenario-Based Questions

Q41. Your RAG system retrieves irrelevant chunks. What would you improve?

A: Chunking strategy, embeddings, metadata filtering, and reranking.

Q42. An agent repeatedly calls the same tool. How would you fix it?

A: Add memory, loop detection, and tool usage constraints.

Q43. Latency is too high in production. What optimizations can you apply?

A: Quantization, batching, KV caching, TensorRT-LLM, and smaller models.

Q44. When would you choose fine-tuning over RAG?

A: When changing model behavior or domain-specific reasoning rather than adding knowledge.

Q45. Design an agentic system for QE automation.

A: Supervisor agent → Requirement Analysis Agent → Test Case Generator → Automation Script Generator → Review Agent → Execution Agent → Reporting Agent.

These 45 questions cover roughly 80–90% of the concepts typically tested in Agentic AI, RAG, LLMs, and NVIDIA deployment-focused certifications.

Data Modeling with Databricks

Data modeling is the process of creating a visual blueprint of your business data to structure how it is collected, stored, and related. It translates real-world business rules into organized technical schemas, ensuring consistency, scalability, and efficiency in databases and data warehouses. [1, 2]

The 3 Levels of Data Modeling

Data models progress from abstract business ideas to concrete technical blueprints.

• Conceptual Data Model: The highest level. It defines what data is needed (e.g., customers, products, orders) and general business rules. It acts as a shared language between technical teams and business stakeholders.

• Logical Data Model: The middle layer. It outlines detailed data structures, attributes, and exact relationships. It is independent of any specific database management system.

• Physical Data Model: The technical implementation layer. It details how data will be physically stored in a specific system (e.g., SQL Server, Oracle, data lakehouse), including data types, indexes, and partitions. [1, 2]

Core Modeling Components

Regardless of the model, these are the fundamental building blocks:

• Entities: The "things" or concepts you want to track (e.g., Customer, Employee, Product). These typically become tables in a database.

• Attributes: The specific characteristics of an entity. For example, a Customer entity might have attributes like Name, Email, and Phone Number.

• Relationships: How entities interact with each other. For example, a Customer "places" an Order.

• Cardinality: Defines the numerical relationship between entities (e.g., One-to-One, One-to-Many, or Many-to-Many).

• Primary & Foreign Keys: Unique identifiers. A Primary Key uniquely identifies a specific record (like a Customer ID), while a Foreign Key is an attribute that links back to the primary key in another table, establishing a relationship. [1, 11, 12, 13, 14]

Key Methodologies

Depending on whether you are building a transactional application or an analytical dashboard, you'll use different modeling styles:

• Entity-Relationship (ER) Modeling: Used primarily for Operational/Transactional systems (OLTP). It focuses on reducing data redundancy through a process called normalization, ensuring every piece of data is stored in exactly one place.

• Dimensional Modeling: Used for Data Warehouses and Analytics (OLAP). It organizes data into Facts (quantitative events like sales transactions) and Dimensions (descriptive contexts like store locations or dates). [2]

Best Practices

• Understand the Business Purpose: Technical design must always serve business needs; knowing exactly what metrics the business wants to track dictates the model's structure.

• Avoid Fact-to-Fact Joins: In dimensional modeling, joining two fact tables directly often indicates an error in the model.

• Use Surrogate Keys: When building data warehouses, professionals on Reddit generally agree that using artificial, integer-based keys (surrogate keys) simplifies joining tables and managing historical data. [19, 20, 21]

AI can make mistakes, so double-check responses

[1] https://www.databricks.com/blog/what-is-data-modeling

[2] https://www.sap.com/resources/what-is-data-modeling

[3] https://www.mongodb.com/resources/basics/databases/data-modeling

[4] https://www.geeksforgeeks.org/data-analysis/data-modeling-a-comprehensive-guide-for-analysts/

[5] https://www.scribd.com/document/610970256/DATA-MODELLING

[6] https://learning.sap.com/courses/becoming-an-sap-data-architect/transforming-business-concepts-with-data-modeling

[7] https://community.sap.com/t5/technology-q-a/conceptual-logical-physical-modeling/qaq-p/11584240

[8] https://agiledata.org/essays/datamodeling101.html

[9] https://atlan.com/what-is/data-modeling-concepts/

[10] https://www.quest.com/learn/conceptual.aspx

[11] https://medium.com/business-architected/conceptual-data-modelling-start-with-business-use-cases-10b3f2670d47

[12] https://www.datamation.com/big-data/types-of-data-modeling/

[13] https://www.workday.com/en-us/perspectives/ai/intro-to-data-modeling.html

[14] https://jcsites.juniata.edu/faculty/rhodes/dbms/ermodel.htm

[15] https://www.packtpub.com/en-us/learning/how-to-tutorials/implementing-data-modeling-techniques-in-qlik-sense-tutorial

[16] https://www.sciencedirect.com/topics/computer-science/normalized-model

[17] https://atlan.com/what-is-data-modeling/

[18] https://www.red-gate.com/blog/database-design-patterns/

[19] https://www.reddit.com/r/dataengineering/comments/1onxcfo/data_modeling_what_is_the_most_important_concept/

[20] https://www.reddit.com/r/dataengineering/comments/1onxcfo/data_modeling_what_is_the_most_important_concept/

[21] https://www.reddit.com/r/dataengineering/comments/1onxcfo/data_modeling_what_is_the_most_important_concept/

Wednesday, June 17, 2026

Learning Classical Machine Learning

You should learn these five classical machine learning topics in the following order: Linear Regression $\rightarrow$ Logistic Regression $\rightarrow$ Naive Bayes $\rightarrow$ Support Vector Machines (SVM) $\rightarrow$ Matrix Factorization. [1, 2]

This specific sequence builds a smooth mathematical and conceptual path, moving from basic lines to probabilities, optimization boundaries, and finally unsupervised matrix decompositions.

------------------------------

## 1. Linear Regression (Start Here)

* Why first: It is the foundational stepping stone of all parametric machine learning.

* Core Concepts to Learn: You will master Loss Functions (Mean Squared Error), Gradient Descent (how weights update), and Regularization (L1/L2 or Lasso/Ridge).

* Math required: Basic algebra and simple derivatives. [3, 4, 5, 6, 7]

## 2. Logistic Regression

* Why second: As established, it uses the exact same core linear combination ($wx + b$) as Linear Regression but introduces a Sigmoid function to transform outputs into probabilities.

* Core Concepts to Learn: You will learn about Classification, Log Loss (Binary Cross-Entropy), and decision boundaries.

* Math required: Logarithms and exponent math. [3, 4, 8, 9, 10]

## 3. Naive Bayes

* Why third: This shifts your perspective from optimization (finding the best line) to pure probabilistic classification.

* Core Concepts to Learn: You will learn Bayes' Theorem, conditional probability, and text classification (like spam filtering). Learning this right after Logistic Regression allows you to easily compare Discriminative models (Logistic) with Generative models (Naive Bayes).

* Math required: Basic probability and conditional probability rules. [3, 4, 11, 12, 13]

## 4. Support Vector Machines (SVM)

* Why fourth: SVMs handle classification like Logistic Regression but use a much more advanced geometric concept. Instead of finding any line that separates the data, it finds the line with the absolute maximum margin. [11, 14, 15, 16, 17]

* Core Concepts to Learn: You will learn about Hyperplanes, Margin Maximization, and the Kernel Trick (which allows the model to project flat data into higher-dimensional spaces to find non-linear separations). [18, 19, 20]

* Math required: Vector geometry and optimization theory.

## 5. Matrix Factorization (End Here)

* Why last: This is a distinct shift into Unsupervised Learning and recommendation systems. It breaks a single large matrix down into smaller component matrices to find hidden relationships. [21, 22, 23, 24]

* Core Concepts to Learn: You will learn about Latent Factors, Collaborative Filtering (how Netflix or Spotify recommend content), and Singular Value Decomposition (SVD). [21, 25, 26, 27]

* Math required: Advanced Linear Algebra (matrix multiplication, dimensions, and rank). [28, 29]

he Machine Learning Lifecycle is the end-to-end process of building, deploying, monitoring, and maintaining an ML model. In production systems, it extends beyond just training to include data engineering, deployment, and continuous improvement.

Complete Machine Learning Lifecycle


                    Business Problem
                           │
                           ▼
                  Problem Definition
                           │
                           ▼
                  Data Collection
                           │
                           ▼
                  Data Exploration (EDA)
                           │
                           ▼
                  Data Cleaning
                           │
                           ▼
                 Feature Engineering
                           │
                           ▼
                  Feature Selection
                           │
                           ▼
              Train / Validation / Test Split
                           │
                           ▼
                  Model Selection
                           │
                           ▼
                  Model Training
                           │
                           ▼
              Hyperparameter Tuning
                           │
                           ▼
                  Model Evaluation
                           │
                           ▼
              Error Analysis & Iteration
                           │
                           ▼
                 Model Deployment
                           │
                           ▼
              Monitoring & Logging
                           │
                           ▼
          Retraining / Continuous Learning

1. Problem Definition

Clearly define:

Business objective
ML objective
Success metrics
Constraints

Example

Business Problem:

Reduce customer churn.

ML Problem:

Binary Classification

Success Metric:

Accuracy
F1-score
ROC-AUC

2. Data Collection

Collect data from various sources.

Examples:

SQL databases
CSV files
APIs
IoT sensors
Images
Audio
Text
Data lakes
Streaming systems (Kafka)

Example:


Customer Database

+

Website Clickstream

+

Purchase History

+

Support Tickets

3. Exploratory Data Analysis (EDA)

Understand the dataset before modeling.

Tasks include:

Data distribution
Missing values
Duplicate records
Outliers
Feature correlations
Class imbalance
Visualizations

Typical questions:

Which features have many missing values?
Is the target balanced?
Which features are highly correlated?

4. Data Cleaning

Prepare high-quality data.

Common operations:

Remove duplicates
Fill missing values
Remove invalid records
Handle outliers
Standardize formats
Correct inconsistent values

Example:


Age

25

NULL

31

↓

Age

25

28

31

5. Feature Engineering

Create better features from existing data.

Examples:

Date

↓


Purchase Date

↓

Day

Month

Quarter

Weekend

Text

↓

TF-IDF

Embeddings

Images

↓

CNN Features

Time Series

↓

Moving Average

Lag Features

Rolling Window

6. Feature Selection

Remove irrelevant or redundant features.

Methods:

Correlation
Chi-Square
Mutual Information
Recursive Feature Elimination (RFE)
Lasso (L1)
Random Forest Feature Importance

Benefits:

Faster training
Reduced overfitting
Better interpretability

7. Train / Validation / Test Split

Typical split:


Dataset

↓

70% Training

15% Validation

15% Testing

Purpose:

Training → Learn model parameters
Validation → Tune hyperparameters
Test → Final unbiased evaluation

8. Model Selection

Choose an algorithm based on the problem.

Examples:

Classification

Logistic Regression
Random Forest
XGBoost
Neural Networks

Regression

Linear Regression
Decision Trees
Gradient Boosting

Clustering

K-Means
DBSCAN

NLP

BERT
Llama
T5

Computer Vision

ResNet
Vision Transformer (ViT)

9. Model Training

The model learns patterns from the training data.

Example:


Features

↓

Model

↓

Predictions

↓

Loss Function

↓

Gradient Descent

↓

Update Weights

↓

Repeat

For deep learning, training typically involves:

Forward propagation
Loss computation
Backpropagation
Optimizer step (e.g., SGD, Adam)

10. Hyperparameter Tuning

Optimize parameters that are not learned automatically.

Examples:

Learning rate
Batch size
Number of trees
Tree depth
Number of layers
Dropout rate

Techniques:

Grid Search
Random Search
Bayesian Optimization
Hyperband
Optuna

11. Model Evaluation

Measure performance on unseen data.

Classification Metrics:

Accuracy
Precision
Recall
F1-score
ROC-AUC
PR-AUC
Confusion Matrix

Regression Metrics:

RMSE
MAE
MSE
R² Score

Clustering Metrics:

Silhouette Score
Davies-Bouldin Index

12. Error Analysis

Understand where the model fails.

Questions:

Which classes are confused?
Which features contribute to errors?
Is there bias?
Are certain groups underperforming?
Is more data needed?

This often leads back to:

More data
Better features
Different model
Better preprocessing

13. Model Deployment

Deploy the trained model.

Common deployment options:

REST API
gRPC
Batch inference
Edge devices
Mobile applications
Cloud services
Kubernetes

Example:


Client

↓

REST API

↓

ML Model

↓

Prediction

14. Monitoring

Production models require continuous monitoring.

Monitor:

Latency
Throughput
Error rate
Prediction distribution
Feature drift
Data drift
Concept drift
Model accuracy (when labels become available)

Example:


Production Data

↓

Drift Detection

↓

Alert

15. Retraining

Models degrade over time as data changes.

Triggers:

New customer behavior
Seasonal trends
New products
Regulatory changes
Data drift
Concept drift

Pipeline:


New Data

↓

Retraining

↓

Validation

↓

Deploy New Model

MLOps Lifecycle


Data Collection
        │
        ▼
Data Versioning
        │
        ▼
Model Training
        │
        ▼
Experiment Tracking
        │
        ▼
Model Registry
        │
        ▼
Deployment
        │
        ▼
Monitoring
        │
        ▼
Retraining

Common tools:

Stage	Popular Tools
Data Versioning	DVC, LakeFS
Experiment Tracking	MLflow, Weights & Biases
Feature Store	Feast, Tecton
Pipeline Orchestration	Apache Airflow, Kubeflow, Prefect
Model Registry	MLflow Model Registry, SageMaker Model Registry
Deployment	Docker, Kubernetes, KServe, BentoML, TensorFlow Serving
Monitoring	Evidently AI, WhyLabs, Arize AI, Prometheus, Grafana

Complete Lifecycle Summary

Stage	Goal	Typical Output
Problem Definition	Define business and ML objectives	Problem statement and success metrics
Data Collection	Gather raw data	Raw dataset
Exploratory Data Analysis	Understand data characteristics	Insights and quality report
Data Cleaning	Fix quality issues	Clean dataset
Feature Engineering	Create useful features	Engineered feature set
Feature Selection	Keep the most relevant features	Reduced feature set
Train/Validation/Test Split	Separate data for training and evaluation	Three datasets
Model Selection	Choose the appropriate algorithm	Candidate model(s)
Model Training	Learn patterns from data	Trained model
Hyperparameter Tuning	Optimize model configuration	Best hyperparameters
Model Evaluation	Measure performance	Evaluation metrics and reports
Error Analysis	Identify weaknesses	Improvement plan
Deployment	Serve the model in production	Production inference service
Monitoring	Track health and performance	Alerts, logs, drift reports
Retraining	Keep the model up to date	Updated production model

This lifecycle is iterative rather than linear. It is common to revisit earlier stages—such as feature engineering, model selection, or data collection—multiple times before achieving a model that meets business and technical requirements

Feature Selection

Feature selection is the process of selecting the most relevant features (columns) from a dataset while removing irrelevant, redundant, or noisy features. It improves model accuracy, reduces overfitting, decreases training time, and makes the model more interpretable.

There are three main approaches:

Method	How it works	Examples	Model dependent
Filter	Uses statistical measures before training	Correlation, Chi-Square, ANOVA, Mutual Information	No
Wrapper	Evaluates different feature subsets by repeatedly training the model	Recursive Feature Elimination (RFE), Forward Selection, Backward Elimination	Yes
Embedded	Performs feature selection during model training	Lasso (L1), Decision Trees, Random Forest, XGBoost	Yes

1. Filter Methods

These methods rank features independently of the machine learning algorithm.

a) Correlation

Removes highly correlated features.

Example:

Age	Experience	Salary
25	2	30000
30	7	60000
35	12	90000

If Age and Experience have a correlation of 0.98, keeping both adds little new information.

b) Chi-Square Test

Used for categorical features.

Measures dependence between feature and target.

Example:

Feature: Own House (Yes/No)

Target: Loan Default (Yes/No)

Higher Chi-square score ⇒ more useful feature.

c) ANOVA F-Test

Used for numerical features with categorical targets.

Example:

Determine whether average salary differs significantly across job categories.

d) Mutual Information

Measures how much information a feature provides about the target.

Unlike correlation, it captures non-linear relationships.

2. Wrapper Methods

These methods repeatedly train the model using different feature subsets.

a) Forward Selection

Start with zero features.


{}

↓

{Age}

↓

{Age, Salary}

↓

{Age, Salary, Experience}

Stop when adding another feature does not improve performance.

b) Backward Elimination

Start with every feature.


{Age, Salary, Experience, Gender, City}

↓

Remove City

↓

Remove Gender

↓

Final features

c) Recursive Feature Elimination (RFE)

Train the model.

Remove the least important feature.

Train again.

Repeat.

Example:


Age
Salary
City
Gender
Experience

Iteration 1:


Feature Importance

Salary      0.42
Experience  0.31
Age         0.18
Gender      0.06
City        0.03

Remove City.

Train again.

Repeat until desired number of features remain.

3. Embedded Methods

Feature selection happens during training.

a) Lasso Regression (L1)

Lasso pushes less useful coefficients to exactly zero.

Example:


Age          0.82
Salary       1.35
Experience   0
City         0

Features with zero coefficients are removed.

b) Decision Trees

Tree naturally selects important features.

Example:


        Salary
       /      \
     Age     Experience

Unused features are considered less important.

c) Random Forest

Average feature importance across many trees.

Example:


Salary       0.41
Experience   0.26
Age          0.20
Gender       0.08
City         0.05

Select top features.

d) Gradient Boosting / XGBoost

Boosted trees compute feature importance using metrics such as gain, cover, or frequency.

Dimensionality Reduction vs Feature Selection

Feature Selection	Dimensionality Reduction
Keeps original features	Creates new features
Easier to interpret	Harder to interpret
Removes irrelevant features	Combines information from multiple features
Examples: RFE, Lasso	Examples: PCA, t-SNE, UMAP

Typical Feature Selection Workflow

Remove features with many missing values.
Remove constant or near-constant features.
Remove duplicate features.
Remove highly correlated features (e.g., correlation > 0.9).
Apply a filter method (Mutual Information, Chi-Square, ANOVA) to rank features.
Use an embedded method (Lasso, Random Forest, XGBoost) to estimate feature importance.
Optionally refine the subset using a wrapper method such as RFE with cross-validation.
Evaluate model performance using cross-validation.
Select the smallest feature set that achieves the desired performance.
Validate the final model on a separate test set to ensure it generalizes well.

Choosing the Right Method

Scenario	Recommended Method
Very large dataset with thousands of features	Filter methods (Correlation, Mutual Information)
Maximum predictive performance	Wrapper methods (RFE, Forward/Backward Selection)
Tree-based models	Random Forest or XGBoost feature importance
Linear models	Lasso (L1 Regularization)
High-dimensional data (e.g., text, genomics)	Filter methods followed by Lasso or tree-based importance
Need fast preprocessing	Filter methods
Need interpretable selected features	Embedded methods (Lasso, Decision Trees)

In practice, practitioners often combine methods: use a filter method to quickly remove obviously irrelevant features, apply an embedded method to rank the remaining features, and, if computationally feasible, use a wrapper methodlike RFE to fine-tune the final feature subset. This balances computational efficiency with predictive performance.

Ensembling

Ensemble learning combines predictions from multiple models to produce a stronger and more robust model than any individual model. The main idea is that different models make different errors, so combining them reduces variance, bias, or both.

There are four major types of ensembling methods:

Method	Main Idea	Models Built	Training	Final Prediction	Examples
Bagging	Train multiple models independently on different samples	Parallel	Independent	Average / Majority Vote	Random Forest
Boosting	Train models sequentially, each correcting previous errors	Sequential	Dependent	Weighted Sum	AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost
Stacking	Train multiple different models and use another model to combine them	Parallel + Meta Model	Independent + Meta Learning	Meta-model prediction	Random Forest + SVM + XGBoost → Logistic Regression
Voting	Combine predictions from different models without additional training	Parallel	Independent	Majority Vote / Average	Hard Voting, Soft Voting

1. Bagging (Bootstrap Aggregating)

Idea

Reduce variance by training many models on different random subsets of the training data.

Each model:

sees a slightly different dataset
learns independently
prediction is combined

Step-by-step

Suppose dataset has 1000 rows.

Create bootstrap samples.


Dataset

1000 rows

↓

Sample 1 (1000 rows with replacement)

↓

Decision Tree 1

----------------

Sample 2

↓

Decision Tree 2

----------------

Sample 3

↓

Decision Tree 3

Each sample contains duplicate rows because sampling is done with replacement.

Final prediction

Classification


Tree1 → Cat

Tree2 → Dog

Tree3 → Dog

Tree4 → Dog

Tree5 → Cat

Majority Vote

↓

Dog

Regression


Tree1 → 150

Tree2 → 160

Tree3 → 155

Average

↓

155

Advantages

Reduces overfitting
Parallelizable
Works well with high-variance models

Example

Random Forest

Random Forest adds another level of randomness:

bootstrap samples
random subset of features

2. Boosting

Idea

Train models sequentially.

Every new model focuses on mistakes made by previous models.

Instead of many strong learners,

build many weak learners.

Example


Tree 1

Accuracy = 70%

↓

Wrong predictions identified

↓

Tree 2

focuses on those mistakes

↓

Tree 3

focuses on remaining mistakes

↓

Final weighted prediction

Unlike bagging


Bagging

Tree1

Tree2

Tree3

Independent


Boosting

Tree1

↓

Tree2

↓

Tree3

Sequential

AdaBoost

Every observation starts with equal weight.


Sample A  weight = 1

Sample B  weight = 1

Sample C  weight = 1

Wrong predictions receive larger weights.


Sample A  weight = 5

Sample B  weight = 1

Sample C  weight = 8

Next learner focuses more on A and C.

Gradient Boosting

Instead of changing sample weights,

fit the next model on the residual errors.

Example

Actual house prices

Predicted

Residuals

Next tree learns

Final prediction


Prediction

+

Residual prediction

XGBoost

Improves Gradient Boosting by adding:

Regularization
Parallel tree construction where possible
Missing value handling
Tree pruning
Faster optimization

LightGBM

Uses histogram-based learning.

Instead of evaluating every split,

it groups feature values into bins.


Age

18

19

20

21

22

↓

Bin 1

18-20

↓

Bin 2

21-22

Training becomes much faster.

CatBoost

Designed for categorical variables.

Instead of manual encoding,

it automatically converts categorical values.

Example


City

Delhi

Mumbai

Delhi

Pune

CatBoost learns useful numeric representations internally.

3. Stacking

Idea:

Different models capture different patterns.

Use another model to combine them.

Example

Level 1 models


Random Forest

↓

Prediction = 0.82

----------------

XGBoost

↓

Prediction = 0.91

----------------

Neural Network

↓

Prediction = 0.88

These predictions become features.

Meta model


Logistic Regression

↓

Final Prediction

Architecture


Training Data

↓

Random Forest

↓

Prediction

----------------

Training Data

↓

XGBoost

↓

Prediction

----------------

Training Data

↓

SVM

↓

Prediction

↓

Meta Model

↓

Final Output

Advantages

Often highest accuracy
Combines strengths of multiple algorithms
Can model complex relationships between base model predictions

Disadvantages

More computationally expensive
Requires careful cross-validation to avoid data leakage

4. Voting

Simplest ensemble.

No retraining.

Just combine predictions.

Hard Voting


Model A → Dog

Model B → Dog

Model C → Cat

↓

Dog

Majority wins.

Soft Voting

Uses probabilities.


Model A

Dog = 0.90

Cat = 0.10

----------------

Model B

Dog = 0.60

Cat = 0.40

----------------

Model C

Dog = 0.55

Cat = 0.45

Average


Dog = 0.68

Cat = 0.32

↓

Dog

Soft voting generally performs better because it considers each model's confidence.

Comparison

Method	Training Style	Base Models	Main Goal	Reduces	Parallelizable	Example
Bagging	Parallel	Usually same type	Reduce variance	Variance	Yes	Random Forest
Boosting	Sequential	Usually weak learners	Reduce bias and improve accuracy	Bias (and often variance)	Mostly No	AdaBoost, XGBoost
Stacking	Parallel + Meta Model	Different types	Learn optimal combination	Depends	Partly	RF + SVM + XGBoost
Voting	Parallel	Different or same	Combine predictions	Depends	Yes	Hard Voting, Soft Voting

When to Use Which?

Scenario	Best Choice	Reason
Decision trees overfit	Bagging / Random Forest	Reduces variance by averaging many trees
Need the highest predictive accuracy on structured/tabular data	Boosting (XGBoost, LightGBM, CatBoost)	Sequentially corrects previous errors and models complex patterns
Have several strong but diverse models	Stacking	Learns how to combine complementary strengths
Need a simple ensemble without extra training	Voting	Easy to implement and often improves robustness
Large datasets requiring fast training	LightGBM	Optimized histogram-based algorithm with efficient tree growth
Data contains many categorical features	CatBoost	Natively handles categorical variables with minimal preprocessing

Summary

Bagging: Build many independent models on bootstrapped data and aggregate their predictions to reduce variance.
Boosting: Build models sequentially, with each new model correcting errors made by previous ones to improve accuracy.
Stacking: Train diverse base models and a meta-model that learns the best way to combine their predictions.
Voting: Combine predictions from multiple models directly using majority vote (hard voting) or averaged probabilities (soft voting).

------------------------------

Would you like a curated list of hands-on projects or Python libraries to practice as you go through this learning path?

[1] [https://www.ncbi.nlm.nih.gov](https://www.ncbi.nlm.nih.gov/books/NBK597496/)

[2] [https://dokumen.pub](https://dokumen.pub/linear-algebra-and-optimization-for-machine-learning-a-textbook-1nbsped-3030403432-9783030403430.html)

[3] [https://www.linkedin.com](https://www.linkedin.com/posts/amit-shekhar-iitbhu_ai-machinelearning-activity-7415244847399460864-5c0g)

[4] [https://www.youtube.com](https://www.youtube.com/watch?v=E0Hmnixke2g&t=141)

[5] [https://cs-114.org](https://cs-114.org/wp-content/uploads/2025/01/LogisticRegression-1.pdf)

[6] [https://www.linkedin.com](https://www.linkedin.com/pulse/supervised-machine-learning-python-regression-simple-linear-maharaj-fwmjc)

[7] [https://www.craw.in](https://www.craw.in/machine-learning-interview-questions-and-answers-in-india)

[8] [https://www.youtube.com](https://www.youtube.com/watch?v=63Kr3HFECHM&t=122)

[9] [https://medium.com](https://medium.com/analytics-vidhya/math-behind-logistic-regression-that-will-make-you-a-data-scientist-2bce20ea53fd)

[10] [https://medium.com](https://medium.com/@prajun_t/linear-classifiers-7e46869844cc)

[11] [https://mrcet.com](https://mrcet.com/downloads/digital_notes/CSE/IV%20Year/MACHINE%20LEARNING%28R17A0534%29.pdf)

[12] [https://raman-singh-13-09.medium.com](https://raman-singh-13-09.medium.com/introduction-to-linear-regression-c98aca3a08f1)

[13] [https://www.cognixia.com](https://www.cognixia.com/blog/everything-you-need-to-know-about-the-naive-bayes-algorithm/)

[14] [https://link.springer.com](https://link.springer.com/protocol/10.1007/978-1-0716-3195-9_2)

[15] [https://www.geeksforgeeks.org](https://www.geeksforgeeks.org/machine-learning/machine-learning-algorithms/)

[16] [https://www.upgrad.com](https://www.upgrad.com/tutorials/ai-ml/machine-learning-tutorial/)

[17] [https://methods.sagepub.com](https://methods.sagepub.com/foundations/machine-learning)

[18] [https://www.upgrad.com](https://www.upgrad.com/blog/support-vector-machines/)

[19] [https://python.plainenglish.io](https://python.plainenglish.io/deep-dive-into-support-vector-machines-svms-for-efficient-data-classification-by-hand-8d3afce90d4a)

[20] [https://webmobtech.com](https://webmobtech.com/blog/understanding-ai-algorithms/)

[21] [https://www.sciencedirect.com](https://www.sciencedirect.com/topics/computer-science/machine-learning)

[22] [https://www.shaped.ai](https://www.shaped.ai/blog/matrix-factorization-the-bedrock-of-collaborative-filtering-recommendations)

[23] [https://saturncloud.io](https://saturncloud.io/glossary/matrix-factorization/)

[24] [https://www.lexalytics.com](https://www.lexalytics.com/blog/machine-learning-natural-language-processing/)

[25] [https://medium.com](https://medium.com/the-andela-way/foundations-of-machine-learning-singular-value-decomposition-svd-162ac796c27d)

[26] [https://www.simplilearn.com](https://www.simplilearn.com/tutorials/pyspark-tutorial/pyspark-mllib-for-ml)

[27] [https://bostoninstituteofanalytics.org](https://bostoninstituteofanalytics.org/blog/how-machine-learning-powers-recommendation-systems-netflix-amazon-spotify/)

[28] [https://wikidocs.net](https://wikidocs.net/216015)

[29] [https://vinuni.edu.vn](https://vinuni.edu.vn/data-science-skills/)

Thursday, June 18, 2026

Nvidia Agentic AI prep

Data Modeling with Databricks

Wednesday, June 17, 2026

Learning Classical Machine Learning

Complete Machine Learning Lifecycle

1. Problem Definition

2. Data Collection

3. Exploratory Data Analysis (EDA)

4. Data Cleaning

5. Feature Engineering

6. Feature Selection

7. Train / Validation / Test Split

8. Model Selection

9. Model Training

10. Hyperparameter Tuning

11. Model Evaluation

12. Error Analysis

13. Model Deployment

14. Monitoring

15. Retraining

MLOps Lifecycle

Complete Lifecycle Summary

1. Filter Methods

a) Correlation

b) Chi-Square Test

c) ANOVA F-Test

d) Mutual Information

2. Wrapper Methods

a) Forward Selection

b) Backward Elimination

c) Recursive Feature Elimination (RFE)

3. Embedded Methods

a) Lasso Regression (L1)

b) Decision Trees

c) Random Forest

d) Gradient Boosting / XGBoost

Dimensionality Reduction vs Feature Selection

Typical Feature Selection Workflow

Choosing the Right Method

1. Bagging (Bootstrap Aggregating)

Idea

Step-by-step

Advantages

Example

2. Boosting

Idea

AdaBoost

Gradient Boosting

XGBoost

LightGBM

CatBoost

3. Stacking

4. Voting

Comparison

When to Use Which?

Summary

Build Lakehouse using Iceberg