Fundamentals of AI and ML
Generative AI Concepts
Foundation Models
Prompt Engineering
Responsible AI
AWS AI Services
AWS Generative AI Services
Security and Compliance
AI Use Cases
Exam Tips

1. Fundamentals of AI and ML

Artificial Intelligence (AI)

Machines performing tasks that normally require human intelligence.

Machine Learning (ML)

Subset of AI where systems learn patterns from data.

Deep Learning

Uses neural networks with multiple layers.

Generative AI

Creates new content such as:

Text
Images
Audio
Video
Code

Training vs Inference

Term	Meaning
Training	Model learns from data
Inference	Model makes predictions using learned knowledge

2. Generative AI Concepts

Large Language Model (LLM)

Examples:

OpenAI GPT models
Anthropic Claude
Meta Llama

Tokens

Text is broken into small units called tokens.

Example:


I love AWS

May become:


[I] [love] [AWS]

Hallucination

Model generates incorrect information while sounding confident.

Context Window

Amount of information an LLM can consider at once.

3. Foundation Models

Foundation Model (FM)

Large pretrained model that can be adapted for many tasks.

Examples:

Text generation
Summarization
Classification
Translation
Chatbots

Multi Modality

Modalities	Example Models
Text → Text	GPT-4o, Claude Sonnet, Llama 3, Mistral
Image → Text	LLaVA, Qwen2.5-VL, BLIP-2, InstructBLIP
Image + Text → Text	GPT-4o, Gemini 2.5, Claude Sonnet, Kosmos-2
Image ↔ Text (Similarity/Retrieval)	CLIP, SigLIP, ALIGN, Florence
Text → Image	Stable Diffusion, DALL-E 3, Imagen, FLUX.1
Image → Image	Stable Diffusion XL, ControlNet, InstructPix2Pix
Text → Audio (Speech)	Tacotron 2, VALL-E, Bark
Audio → Text (ASR)	Whisper, wav2vec 2.0, Conformer
Text + Audio → Text	GPT-4o, Gemini 2.5
Audio ↔ Text	CLAP, AudioCLIP
Text → Video	Sora, Veo, Gen-3, Pika
Image → Video	Runway Gen-3, Pika, Luma Dream Machine
Video → Text	Video-LLaVA, VideoChatGPT, Gemini 2.5
Video + Text → Text	GPT-4o, Gemini 2.5, Qwen2.5-VL
Text + Image + Audio → Text	GPT-4o, Gemini 2.5
Text + Image + Audio + Video → Text	GPT-4o, Gemini 2.5
Text + Image + Audio + Video → Text + Audio	GPT-4o Realtime, Gemini Live

Fine Tuning

Retraining model with domain-specific data.

Retrieval Augmented Generation (RAG)

Instead of retraining:

Retrieve documents
Send documents to LLM
Generate response

Benefits:

Lower cost
More current data
Reduced hallucinations

RAG Pipeline: Model Types Used at Each Step

RAG Step	Purpose	Model Type	Example Models
1. Document Ingestion	Read PDFs, DOCX, HTML, Images	OCR / Document AI	Tesseract, LayoutLM, Donut
2. Chunking	Split documents into passages	Rule-based / NLP	Sentence Splitter, Recursive Text Splitter
3. Text → Embeddings	Convert chunks into vectors	Embedding Model (Encoder-only Transformer)	BERT, Sentence-BERT, E5, BGE
4. Vector Storage	Store embeddings	Vector Database	FAISS, Milvus, Weaviate, Pinecone
5. Query → Embedding	Convert user query to vector	Same Embedding Model	BGE, E5, SBERT
6. Retrieval	Find nearest chunks	ANN Search Algorithm	HNSW, IVF, Flat Search
7. Re-ranking (Optional)	Improve retrieved results	Cross Encoder	MonoBERT, Cohere Rerank, BGE Reranker
8. Context Construction	Build prompt with retrieved chunks	Prompt Builder	Template Engine
9. Answer Generation	Generate final answer	Decoder-only LLM	GPT-4o, Claude Sonnet, Llama 3, Mistral
10. Citation Generation (Optional)	Show sources	LLM / Metadata Layer	GPT-4o, Claude

Transformer Architecture Used at Each Step

Step	Transformer Type
Embedding Generation	Encoder-only
Re-ranking	Encoder-only (Cross Encoder)
Answer Generation	Decoder-only
Translation (optional)	Encoder-Decoder
Summarization (optional)	Encoder-Decoder
OCR Understanding	Encoder or Encoder-Decoder
Multimodal RAG	Vision Encoder + LLM Decoder

Common Models by Transformer Family

Transformer Family	Example Models	Used For
Encoder-only	BERT, RoBERTa, SBERT, E5, BGE	Embeddings, Retrieval
Decoder-only	GPT, Llama, Claude, Mistral, Qwen	Generation
Encoder-Decoder	T5, FLAN-T5, BART	Summarization, Translation
Vision Encoder	ViT, CLIP Vision Encoder	Image Embeddings
Vision-Language	LLaVA, Qwen-VL, GPT-4o	Multimodal RAG

Typical Modern RAG Stack

Layer	Common Choice
Chunking	LangChain Recursive Splitter
Embeddings	BGE-large, E5-large
Vector DB	FAISS, Milvus
Retrieval	HNSW
Re-ranker	BGE-Reranker
Generator	GPT-4o, Claude, Llama 3
Orchestration	LangChain, LlamaIndex

Mental Model


Documents
    ↓
Chunking
    ↓
Encoder Model
(BERT / E5 / BGE)
    ↓
Embeddings
    ↓
Vector DB
    ↓
End

User Query
    ↓
Encoder Model
(BERT / E5 / BGE)
    ↓
Similarity Search
    ↓
Top K Chunks
    ↓
Cross Encoder Re-ranker
(Optional)
    ↓
Prompt Construction
    ↓
Decoder LLM
(GPT / Claude / Llama)
    ↓
Final Answer

User Memory Pipeline (Mem0)
User Conversations
        ↓
Mem0
(Extracts Important Facts)
        ↓
Memory Store
(Vector DB / PostgreSQL / Redis)
        ↓
End

Query-Time Retrieval Pipeline
User Query
    ↓
Encoder Model
(BERT / E5 / BGE)
    ├─────────────────────────────┐
    │                             │
    ▼                             ▼
Similarity Search          Mem0 Memory Retrieval
(Vector DB)                (Relevant User Memories)
    │                             │
    ▼                             ▼
Top K Chunks              Relevant Memories
    │                             │
    └──────────────┬──────────────┘
                   ▼
      Cross Encoder Re-ranker
            (Optional, for documents)
                   ▼
        Prompt Construction
   (Documents + User Memory + Query)
                   ▼
          Decoder LLM
     (GPT / Claude / Llama)
                   ▼
            Final Answer
                   ▼
      Mem0 Memory Update
 (Store new preferences/facts if needed)

Mem0

Before generation: Retrieves relevant user preferences, past interactions, and long-term memory.
After generation: Extracts new important facts from the conversation and stores them for future use.
Difference from RAG: RAG retrieves knowledge from documents, while Mem0 retrieves knowledge about the user or previous interactions. Both are complementary and are typically combined before prompt construction.

Hybrid Search improves retrieval recall (finding the right documents).

HyDE improves query understanding (making difficult or ambiguous queries easier to retrieve).

Task	Model Type
Create Embeddings	Encoder-only
Retrieve Documents	Vector Search
Re-rank Results	Cross Encoder
Generate Answer	Decoder-only LLM
Summarize Documents	Encoder-Decoder
Multimodal Retrieval	CLIP / Vision Encoder
Multimodal Generation	GPT-4o / Gemini / Qwen-VL

4. Prompt Engineering

Zero-Shot Prompting


Translate this sentence to French.

One-Shot Prompting

Provide one example.

Few-Shot Prompting

Provide multiple examples.

Chain of Thought

Ask model to reason step-by-step.

Prompt Components

Role
Context
Instructions
Examples
Constraints

LLM Architecture

Large Language Model (LLM) Architecture

At a high level, an LLM is a Transformer-based neural network that converts input text into tokens, processes them through multiple Transformer blocks, and predicts the next token repeatedly to generate text.


                 Input Text
                     │
                     ▼
             Text Tokenization
                     │
                     ▼
              Token IDs (Integers)
                     │
                     ▼
            Token Embedding Layer
                     │
                     ▼
         Positional Encoding/Embedding
                     │
                     ▼
        N Transformer Decoder Blocks
                     │
                     ▼
           Final Hidden Representation
                     │
                     ▼
            Linear (Output Projection)
                     │
                     ▼
                 Softmax Layer
                     │
                     ▼
           Probability for Every Token
                     │
                     ▼
          Select Next Token (Sampling)
                     │
                     ▼
              Append to Input & Repeat

Components of an LLM

Component	Purpose
Tokenizer	Converts text into tokens
Embedding Layer	Converts token IDs into dense vectors
Positional Encoding	Gives the model information about token order
Transformer Decoder Blocks	Learns relationships between tokens
Multi-Head Self-Attention	Determines which words are important
Feed Forward Network (FFN)	Learns complex patterns
Residual Connections	Prevent information loss
Layer Normalization	Stabilizes training
Output Linear Layer	Maps hidden vectors to vocabulary logits
Softmax	Converts logits into probabilities
Sampling Strategy	Chooses the next token

Step 1: Tokenization

The tokenizer splits text into tokens.

Example:

Input


I love machine learning

Tokenizer


["I", "love", "machine", "learning"]

Convert to IDs


[52, 908, 3210, 4567]

The neural network only understands numbers.

Step 2: Embedding Layer

Each token ID is mapped to a dense vector.

Example

Vocabulary


I → 52
love → 908
machine → 3210

Embedding


52

↓

[0.21, -0.84, 1.02, ..., 0.67]

Instead of one integer,

every token becomes a high-dimensional vector.

Typical embedding size

Model	Embedding Dimension
GPT-2 Small	768
Llama 7B	4096
GPT-3	12288

Step 3: Positional Encoding

Attention alone has no concept of sequence order.

Without positional information,

these two sentences appear identical.


Dog bites man

Man bites dog

Positional embeddings add information like:


Dog     Position 0

bites   Position 1

man     Position 2

Final embedding


Token Embedding

+

Position Embedding

Step 4: Transformer Decoder Blocks

This is where most computation happens.

Modern LLMs stack many identical decoder blocks.

Examples:

Model	Decoder Blocks
GPT-2 Small	12
GPT-3 175B	96
Llama 3 8B	32
DeepSeek-R1 671B	Hundreds of expert layers (MoE architecture)

Each block contains:


Input
   │
   ▼
LayerNorm
   │
   ▼
Masked Multi-Head Attention
   │
   ▼
Residual Connection
   │
   ▼
LayerNorm
   │
   ▼
Feed Forward Network
   │
   ▼
Residual Connection
   │
   ▼
Output

Step 5: Multi-Head Self-Attention

This is the core innovation of Transformers.

Every word looks at every previous word.

Sentence


The animal didn't cross the road because it was tired.

When processing "it"

attention may focus on


animal

road

tired

The model learns that "it" refers to animal.

Query, Key and Value

Each token produces three vectors.


Embedding

↓

Query (Q)

Key (K)

Value (V)

Meaning

Vector	Purpose
Query	What am I looking for?
Key	What information do I contain?
Value	Information to pass forward

Attention score


Q × Kᵀ

↓

Similarity Score

↓

Softmax

↓

Attention Weights

↓

Weighted Sum of Values

Formula


Attention(Q,K,V)

=

softmax(QKᵀ / √dk)V

Step 6: Multi-Head Attention

Instead of one attention calculation,

multiple attention heads run in parallel.

Example


Head 1

Grammar

----------------

Head 2

Subject

----------------

Head 3

Object

----------------

Head 4

Long-range dependency

Outputs are concatenated.

Step 7: Feed Forward Network (FFN)

After attention,

every token passes through the same small neural network independently.

Typical structure


Linear

↓

Activation (GELU, SwiGLU, etc.)

↓

Linear

Purpose

Learn nonlinear patterns
Transform features
Increase model capacity

Step 8: Residual Connections

Instead of replacing the input,

the block adds the original input back.


Output

=

Input

+

Attention Output

Benefits

Prevents vanishing gradients
Preserves information
Enables very deep networks

Step 9: Layer Normalization

Keeps activations stable during training.

Without it,

training becomes unstable as models get deeper.

Step 10: Output Projection

Final hidden vector

↓

Linear layer

↓

Vocabulary size

Example vocabulary

50,000 words

Output


Dog     3.2

Cat     8.9

Car     1.4

Paris   10.1

These are logits (unnormalized scores).

Step 11: Softmax

Softmax converts logits into probabilities.

Example


Paris   0.72

London  0.12

Berlin  0.08

Rome    0.08

Probabilities sum to 1.

Step 12: Token Sampling

The next token is selected.

Methods include:

Method	Description
Greedy	Choose the highest probability token
Beam Search	Explore multiple candidate sequences simultaneously
Top-k Sampling	Sample only from the top k most probable tokens
Top-p (Nucleus) Sampling	Sample from the smallest set of tokens whose cumulative probability exceeds p
Temperature Sampling	Adjust randomness by scaling logits before Softmax

Generated token

↓

Append to prompt

↓

Run the decoder again

↓

Predict the next token

This repeats until an end-of-sequence token or another stopping condition is reached.

Why is it called a Decoder-only Transformer?

The original Transformer introduced in the paper "Attention Is All You Need" had two parts:


Encoder

↓

Decoder

LLMs such as GPT, Llama, Qwen, and DeepSeek use only the decoder stack with masked (causal) self-attention, so each token can attend only to itself and previous tokens. This enables autoregressive next-token prediction.

End-to-End Example

Input:


The capital of France is

Processing:


Text
   │
   ▼
Tokenizer
   │
   ▼
Token IDs
   │
   ▼
Embeddings + Position Embeddings
   │
   ▼
32–100+ Transformer Decoder Blocks
   │
   ▼
Hidden Representation
   │
   ▼
Linear Projection
   │
   ▼
Softmax
   │
   ▼
Predicted Token: "Paris"
   │
   ▼
Append "Paris" to the input
   │
   ▼
Predict the next token

Architecture Summary

Component	Function
Tokenizer	Converts text to token IDs
Embedding Layer	Converts token IDs to dense vectors
Positional Embedding	Encodes token order
Masked Multi-Head Self-Attention	Captures relationships with previous tokens
Feed Forward Network	Learns nonlinear transformations
Residual Connections	Preserve information and improve gradient flow
Layer Normalization	Stabilizes training
Linear Output Layer	Projects hidden states to vocabulary logits
Softmax	Produces probabilities over the vocabulary
Sampling	Selects the next token for generation

Modern LLMs extend this core architecture with optimizations such as Rotary Positional Embeddings (RoPE), Grouped Query Attention (GQA) or Multi-Query Attention (MQA), Mixture of Experts (MoE), FlashAttention, KV caching, and quantization, but the fundamental decoder-only Transformer pipeline remains the sam

Training

LLMs are trained in multiple stages, each with a different objective. Not every model goes through every stage, but modern models such as GPT, Llama, DeepSeek, and Qwen typically follow a pipeline similar to the one below.

Stage	Purpose	Input Data	Output
1. Pre-training	Learn language and world knowledge	Massive unlabeled text	Base model
2. Mid-training (Continued Pre-training)	Specialize in a domain or language	Domain-specific unlabeled data	Domain-adapted base model
3. Supervised Fine-Tuning (SFT)	Learn to follow instructions	Prompt-response pairs	Instruction-following model
4. Preference Alignment	Align responses with human preferences	Ranked responses or AI feedback	Aligned assistant
5. Task Fine-Tuning	Improve performance on a specific task	Task-specific labeled data	Specialized model
6. Distillation	Transfer knowledge from a large model	Teacher model outputs	Smaller model
7. Continuous Learning	Periodically update knowledge	New datasets	Updated model

1. Pre-training

This is where the model learns:

Grammar
Facts
Reasoning patterns
Coding syntax
Mathematics
General world knowledge

Dataset examples:

Books
Wikipedia
GitHub
Research papers
Web pages

Training objective:

Predict the next token.

Example:


The capital of France is _____

Target:


Paris

Output:


Base Model

Example models:

Llama Base
Qwen Base
DeepSeek Base

2. Mid-training (Continued Pre-training)

Sometimes called:

Continued Pre-training
Domain Adaptive Pre-training (DAPT)
Domain Adaptation

Purpose:

Teach the model a particular domain without changing its fundamental training objective.

Example:

General model

↓

Train on


100 million medical papers

↓

Medical model

The objective is still:


Predict next token

Examples:

General model →

Legal documents

↓

Legal LLM

General model →

Financial reports

↓

Finance LLM

General model →

AWS documentation

↓

Cloud Assistant

No instruction data is required.

3. Supervised Fine-Tuning (SFT)

Now teach the model how humans want it to answer.

Dataset:


Question

↓

Ideal Answer

Example:


Prompt

Explain recursion.

Target:


Recursion is a programming technique...

Loss:

Compare generated answer with target answer.

Output:

Instruction-following model.

Examples:

ChatGPT-style assistants
DeepSeek Chat
Llama Instruct

4. Preference Alignment

SFT teaches correctness.

Alignment teaches helpfulness.

Example

Question:


Explain Java.

Response A


Very detailed

Response B


Clear and concise

Humans prefer B.

The model learns:


B > A

Methods include:

RLHF

Reinforcement Learning from Human Feedback

Humans rank responses.

Reward model learns preferences.

Policy optimized using reinforcement learning (historically often PPO).

RLAIF

Reinforcement Learning from AI Feedback

Instead of humans,

another AI ranks responses.

Cheaper than RLHF.

DPO

Direct Preference Optimization

Modern alternative.

No reinforcement learning.

Directly optimizes preferred responses.

Much simpler training.

5. Task Fine-Tuning

Now optimize for one specific task.

Examples

Sentiment Analysis


Review

↓

Positive

NER


Sentence

↓

Entities

Translation


English

↓

French

Hallucination Detection


Question

↓

Hallucination score

Example

General assistant

↓

Fine tune on


Medical QA

↓

Medical Assistant

6. Distillation

Large models are expensive.

Teacher

↓

Generate answers

↓

Student learns

Example


GPT-4

↓

Produces millions of examples

↓

7B model

↓

Learns to imitate GPT-4

Result:

Smaller

Faster

Cheaper

Examples:

DistilBERT
DeepSeek-R1-Distill
Llama distilled variants

7. Continuous Learning

Models become outdated.

Periodically update using:

New research papers
New laws
Latest documentation
New programming languages

Example

2024 model

↓

Train on 2025 data

↓

Updated model

Many production systems combine this with retrieval-based methods rather than continuously retraining the base model.

Complete Training Pipeline


Raw Text Corpus
        │
        ▼
Pre-training
        │
        ▼
Base Model
        │
        ▼
Mid-training (Domain Adaptation)
        │
        ▼
Domain Base Model
        │
        ▼
Supervised Fine-Tuning (SFT)
        │
        ▼
Instruction Model
        │
        ▼
Preference Alignment
(RLHF / RLAIF / DPO)
        │
        ▼
Aligned Chat Model
        │
        ▼
Task Fine-Tuning (Optional)
        │
        ▼
Specialized Model
        │
        ▼
Distillation (Optional)
        │
        ▼
Smaller Efficient Model
        │
        ▼
Periodic Updates / Continued Training

Comparison

Stage	Learns	Uses Labels?	Training Objective
Pre-training	Language, reasoning, world knowledge	No	Next-token prediction
Mid-training	Domain knowledge (medical, legal, finance, code, etc.)	No	Next-token prediction on domain-specific data
Supervised Fine-Tuning (SFT)	Instruction following	Yes	Predict the target response
Preference Alignment (RLHF, RLAIF, DPO)	Human or AI preferences, safety, helpfulness	Preference pairs/rankings	Optimize preferred responses
Task Fine-Tuning	Specific downstream task	Yes	Task-specific objective (classification, generation, etc.)
Distillation	Mimic a larger model	Teacher-generated outputs	Match teacher behavior
Continuous Learning	New information and capabilities	Depends	Continued pre-training, fine-tuning, or other update methods

When to use each stage

Pre-training: Build a general-purpose foundation model from scratch.
Mid-training: Adapt a foundation model to a domain (e.g., medicine, finance, law, source code) without changing its general objective.
SFT: Make the model follow instructions and produce conversational, task-oriented responses.
Preference Alignment: Improve helpfulness, harmlessness, and response quality according to human or AI preferences.
Task Fine-Tuning: Maximize performance on a specific application such as summarization, code generation, or hallucination detection.
Distillation: Deploy a smaller, faster model while retaining much of a larger model's capability.
Continuous Learning: Keep models up to date as knowledge, data, and requirements evolve.

5. Responsible AI

Fairness

Avoid bias.

Explainability

Understand why model produced output.

Privacy

Protect user data.

Robustness

Model behaves reliably.

Transparency

Users know AI is involved.

Evaluation

Model Group	Evaluation	Description of the Evaluation
Agentic Models	AgentBench	Evaluates autonomous task execution, planning, tool usage, and multi-step reasoning.
Agentic Models	GAIA	Tests real-world assistant capabilities like searching, tool calling, and reasoning.
Agentic Models	SWE-bench	Measures ability to solve real GitHub issues by editing codebases.
Bi-Encoder	BEIR	Evaluates embedding-based retrieval across multiple datasets and domains.
Bi-Encoder	MTEB	Measures embedding quality across retrieval, clustering, classification, and reranking tasks.
Bi-Encoder	MS MARCO	Evaluates dense retrieval and passage ranking performance.
Cross-Encoder	BEIR	Measures pairwise query-document relevance scoring quality.
Cross-Encoder	MS MARCO	Evaluates reranking precision for query-passage relevance.
Cross-Encoder	TREC Deep Learning Track	Measures ranking quality for search relevance tasks.
Decoder-only	GSM8K	Evaluates arithmetic and multi-step mathematical reasoning.
Decoder-only	HellaSwag	Measures commonsense reasoning and next-sentence prediction.
Decoder-only	HumanEval	Evaluates code generation correctness using executable unit tests.
Decoder-only	MMLU	Measures broad knowledge and reasoning across many academic domains.
Decoder-only	MT-Bench	Tests instruction following and conversational quality.
Decoder-only	Needle-in-a-Haystack	Measures ability to retrieve specific information from long contexts.
Decoder-only	TruthfulQA	Tests factual consistency and resistance to hallucinations.
Encoder-decoder	BLEU	Measures overlap between generated text and reference text, mainly for translation.
Encoder-decoder	ROUGE	Evaluates summarization quality based on n-gram overlap.
Encoder-decoder	SQuAD	Measures extractive question-answering accuracy.
Encoder-only	GLUE	Evaluates language understanding tasks like sentiment, entailment, and similarity.
Encoder-only	MTEB	Measures embedding performance across multiple NLP tasks.
Encoder-only	STS-B	Measures how well embeddings capture sentence similarity.
Encoder-only	SuperGLUE	Harder version of GLUE for advanced reasoning tasks.
Long-context Models	InfiniteBench	Evaluates memory retention and reasoning over very long contexts.
Long-context Models	LongBench	Tests summarization, retrieval, and reasoning on long documents.
Long-context Models	Needle-in-a-Haystack	Measures retrieval accuracy from large contexts.
Multimodal Models	MMBench	Evaluates image understanding and multimodal reasoning.
Multimodal Models	MMMU	Measures multimodal reasoning across academic and professional domains.
Multimodal Models	MMVet	Tests advanced visual reasoning and perception.
RAG Systems	CRUD-RAG	Measures retrieval robustness and update handling in RAG pipelines.
RAG Systems	RAGAS	Evaluates faithfulness, context precision, context recall, and answer relevance in RAG.
Reward Models	RewardBench	Evaluates preference model quality and alignment performance.
Tool-use Models	ToolBench	Measures correctness in tool selection, API usage, and tool chaining.

6. AWS AI Services

Amazon Rekognition

Image analysis
Face detection
Object detection

Amazon Comprehend

Sentiment analysis
Entity extraction
Language detection

Amazon Transcribe

Speech to text

Amazon Polly

Text to speech

Amazon Textract

Extract text from documents

Amazon Translate

Language translation

Deep Racer

AWS DeepRacer is a cloud-based autonomous racing car platform used to learn, train, and evaluate reinforcement learning (RL) models through simulated and real-world racing.

AWS DeepLens

Run computer vision and deep learning models on an AI-enabled camera.

AWS DeepComposer

Learn generative AI and machine learning through music composition.

7. AWS Generative AI Services

Amazon Bedrock

Most important service for the exam.

Provides access to foundation models from:

Anthropic Claude
Meta Llama
Amazon Nova
Stability AI

Features:

RAG
Agents
Knowledge Bases
Guardrails
Fine-tuning

Streaming API

- using Websocket

- control buffer size for better UX

Amazon Q Business

Enterprise chatbot over company data.

Amazon Q Developer

Developer coding assistant.

Amazon SageMaker AI

Build, train and deploy ML models.

Sagemaker Inference

Sagemaker Pipeline

- Batch mode saves cost by preventing GPU under utilization.

8. Security and Compliance

Shared Responsibility Model

AWS secures:

Infrastructure
Hardware
Network

Customer secures:

Data
Access control
Configuration

IAM

Identity and access management.

Encryption

Data:

At rest
In transit

Lake Formation

- upto column level Security

Adversarial Input

- Evaluate your model periodically using Synthetic data to prevent risk of Adversarial input creating issue.

9. Common Use Cases

Classification


Spam or Not Spam

Sentiment Analysis


Positive / Negative

Summarization


Long article -> Short summary

Chatbots


Customer support

Code Generation


Generate Java/Python code

Document Processing


Invoice extraction

Frequently Tested Comparisons

Service	Purpose
Bedrock	Generative AI
SageMaker AI	Build/train ML models
Comprehend	NLP analysis
Rekognition	Image analysis
Textract	Document extraction
Transcribe	Speech to text
Polly	Text to speech
Translate	Translation

Tips :

Topic	Remember
Generative AI	Creates new content
Foundation Model	Large pretrained model
Hallucination	Confident wrong answer
RAG	Retrieve + Generate
Bedrock	Managed GenAI platform
SageMaker	ML lifecycle
Guardrails	Safety controls
Fine-Tuning	Retrain model
Inference	Model prediction
Token	Small text unit

There are several ways to represent the position of tokens in a Transformer. Although people often say "embeddings like RoPE," RoPE is actually a positional encoding/embedding technique, not a semantic embedding like a word embedding.

Below are the major types of positional embeddings/encodings used in Transformers.

Method	Learnable	Extrapolates to Longer Sequences	Used In
Absolute Positional Embedding (APE)	Yes	❌ No	GPT-2, BERT
Sinusoidal Positional Encoding	No	✅ Yes	Original Transformer
Relative Positional Embedding (RPE)	Usually Yes	Better than APE	T5, Transformer-XL
Rotary Positional Embedding (RoPE)	No	✅ Yes	Llama, Qwen, DeepSeek, GPT-NeoX
ALiBi (Attention with Linear Biases)	No	✅ Excellent	BLOOM, MPT
xPos	No	✅ Better than RoPE	Long-context models
Dynamic RoPE	No	✅ Improved long-context support	Some Llama variants
NTK-aware RoPE Scaling	No	✅ Extends context window	Llama long-context adaptations
YaRN (Yet another RoPE extensioN)	No	✅ Very good	Long-context fine-tuned LLMs
LeX (Length Extrapolation)	Varies	✅ Designed for long context	Research models

1. Absolute Positional Embedding (APE)

Each position has its own learnable vector.

Example


Position 0 → [0.1, 0.3, ...]
Position 1 → [0.4, 0.8, ...]
Position 2 → [0.2, 0.9, ...]

Final input


Token Embedding

+

Position Embedding

Advantages

Simple
Learns position information

Disadvantages

Cannot naturally handle sequences longer than those seen during training
Every position needs its own learned vector

Used in:

GPT-2
BERT

2. Sinusoidal Positional Encoding

Introduced in the original Transformer paper.

Uses sine and cosine functions.

Formula


PE(pos,2i)=sin(pos/10000^(2i/d))

PE(pos,2i+1)=cos(pos/10000^(2i/d))

Advantages

No training required
Infinite positions can be computed
Generalizes better to unseen sequence lengths

Disadvantages

Less expressive than learned approaches

Used in:

Original Transformer

3. Relative Positional Embedding (RPE)

Instead of absolute positions,

the model learns the relative distance.

Example


Dog is 2 words before cat.

The model learns


Distance = +2

instead of


Dog at position 5

Cat at position 7

Advantages

Better captures local relationships
Handles varying sequence lengths more naturally

Used in:

T5
Transformer-XL
DeBERTa (with variants)

4. RoPE (Rotary Positional Embedding)

The most popular method in modern LLMs.

Instead of adding a positional vector,

RoPE rotates the Query and Key vectors by an angle based on position.


Embedding

↓

Query

↓

Rotate by position angle

↓

Attention

Advantages

Excellent long-context behavior
Naturally preserves relative position information
No additional learned parameters

Used in:

Llama
Qwen
DeepSeek
GPT-NeoX
Mistral

5. ALiBi (Attention with Linear Biases)

Instead of embeddings,

add a linear bias directly to attention scores.


Attention Score

+

Distance Bias

Farther tokens receive a progressively larger penalty.

Advantages

Extremely simple
Strong length extrapolation
No positional vectors

Used in:

BLOOM
MPT

6. xPos

An extension of RoPE.

Designed to improve long-sequence stability.

Instead of using a fixed rotation,

it rescales rotations.

Advantages

Better than RoPE for very long contexts
Preserves attention stability

7. Dynamic RoPE

Standard RoPE uses fixed frequencies.

Dynamic RoPE adjusts frequencies dynamically for longer contexts.

Advantages

Better context extrapolation
Improved performance on long documents

8. NTK-aware RoPE Scaling

Originally developed for extending Llama's context window.

Idea

Stretch the RoPE frequencies.

Example


Original

4096 tokens

↓

Scaled

32768 tokens

No retraining is required in many implementations.

Advantages

Very popular
Simple
Enables longer contexts

9. YaRN (Yet another RoPE extensioN)

Improves on NTK scaling.

Combines interpolation and scaling techniques.

Advantages

Better long-context quality
Less degradation than naive scaling
Widely used for 128K+ context extensions

10. LeX (Length Extrapolation)

Research methods that explicitly optimize for longer contexts.

Goal

Train models that naturally generalize beyond the training context length.

Which models use which?

Model	Positional Method
GPT-2	Absolute Positional Embedding
BERT	Absolute Positional Embedding
Original Transformer	Sinusoidal Encoding
Transformer-XL	Relative Positional Encoding
T5	Relative Positional Encoding
DeBERTa	Relative Position Bias
GPT-NeoX	RoPE
Llama 1/2/3	RoPE
Qwen	RoPE
DeepSeek	RoPE
Mistral	RoPE
BLOOM	ALiBi
MPT	ALiBi

Other Types of Embeddings in LLMs

In addition to positional embeddings, LLMs use several other embedding types.

Embedding Type	Purpose
Token Embedding	Represents each token as a dense vector
Positional Embedding/Encoding	Represents token order (e.g., RoPE, ALiBi, APE)
Segment (Token Type) Embedding	Distinguishes sentence A from sentence B (used in BERT)
Word Embedding	Maps words/subwords to vectors
Character Embedding	Represents characters instead of words
Sentence Embedding	Represents an entire sentence with a single vector
Document Embedding	Represents an entire document
Instruction Embedding	Encodes task or instruction information in some architectures
Multimodal Embedding	Maps text, images, audio, etc., into a shared embedding space

Summary

Method	Basic Idea	Best For	Limitation
Absolute Positional Embedding	Add a learned vector for each position	Short, fixed-length sequences	Cannot extrapolate well
Sinusoidal Encoding	Use deterministic sine/cosine functions	General sequence modeling	Less expressive than learned methods
Relative Positional Embedding	Encode distances between tokens	Better relative reasoning	More complex attention computation
RoPE	Rotate Query/Key vectors based on position	Modern LLMs with long context	Standard RoPE still has finite context limits
ALiBi	Add a linear distance bias to attention scores	Efficient long-context inference	May underperform RoPE on some benchmarks
xPos / Dynamic RoPE / NTK Scaling / YaRN	Variants that improve or extend RoPE	Very long-context LLMs (32K–1M+ tokens)	Additional implementation complexity

Today, RoPE has become the de facto standard for decoder-only LLMs (Llama, Qwen, Mistral, DeepSeek), while ALiBi is valued for its simplicity and excellent extrapolation, and RoPE extensions such as NTK-aware scaling and YaRN are commonly used to extend context windows without retraining the entire model

Inference Engineering

Inference Engineering is the discipline of designing, optimizing, deploying, and scaling Large Language Models (LLMs) for efficient inference (prediction) in production environments.

While training focuses on making a model smarter, inference engineering focuses on making the model faster, cheaper, more scalable, and capable of serving millions of requests.

Why do we need Inference Engineering?

A trained model is usually huge.

Example:

Model	Parameters	FP16 Memory
Llama 3 8B	8 Billion	~16 GB
Llama 3 70B	70 Billion	~140 GB
DeepSeek R1 671B	671 Billion	>1.3 TB

Serving these models directly causes problems:

High GPU memory usage
Slow response time
High latency
Low throughput
Expensive GPUs
Poor utilization

Inference engineering solves these issues.

ML Lifecycle


Collect Data
      │
      ▼
Pretraining
      │
      ▼
Fine-tuning
      │
      ▼
Model Evaluation
      │
      ▼
Model Registry
      │
      ▼
Inference Engineering
      │
      ▼
Production APIs

Inference engineering starts after training is complete.

Responsibilities of an Inference Engineer

An inference engineer works on:

Loading huge models efficiently
Reducing GPU memory
Optimizing attention computation
Continuous batching
KV Cache optimization
Speculative decoding
Quantization
Tensor parallelism
Pipeline parallelism
Multi-GPU serving
Autoscaling
API serving
Streaming tokens
Monitoring GPU utilization

Architecture


                     User

                       │

               HTTP / gRPC API

                       │

               Inference Server
             (vLLM / SGLang / TGI)

                       │

         ------------------------------
         |            Scheduler         |
         ------------------------------

               Continuous Batching

                       │

               Token Generation

                       │

             CUDA / FlashAttention

                       │

                 GPU Memory

                       │

                 Llama / Qwen

Problems During Inference

1. Large Model Size

Example


70B model
↓

140GB FP16

↓

Need multiple GPUs

Solution

Quantization
Tensor Parallelism

2. Slow Token Generation

Generating


Hello
↓

How

↓

are

↓

you

One token at a time is expensive.

Solution

KV Cache
Flash Attention
Speculative Decoding

3. Multiple Users

Imagine


User A
User B
User C
User D

Without batching


GPU

Run A

Run B

Run C

Run D

GPU remains underutilized.

Solution

Continuous batching.

Major Components

1. Model Loading

Instead of


Load model

Wait

Serve request

Inference servers:

Lazy loading
Memory mapping
Sharded checkpoints

2. Scheduler

The scheduler decides


Which request?

Which GPU?

How many tokens?

Batch size?

3. KV Cache

Without cache

For every token


Input

↓

Transformer Layer 1

↓

Layer 2

↓

...

↓

Layer N

Entire sequence is recomputed.

With KV Cache


Past Keys

Past Values

↓

Reuse

↓

Only compute new token

Huge speedup.

4. Continuous Batching

Traditional batching


Request A

Request B

Request C

↓

Wait

↓

Run together

Bad because new users wait.

Continuous batching


GPU Running

↓

New request arrives

↓

Insert into running batch

↓

Continue execution

Much better GPU utilization.

5. FlashAttention

Normal attention


Q × K

↓

Huge matrix

↓

Softmax

↓

Multiply V

Consumes enormous memory.

FlashAttention

Tiles computation
Uses shared GPU memory
Avoids writing large intermediate matrices to HBM
Fuses multiple attention operations into one GPU kernel

Benefits:

Lower memory usage
Higher throughput
Faster inference, especially for long sequences

6. Quantization

Original weights


FP32

↓

FP16

↓

INT8

↓

INT4

Benefits

Smaller model
Lower VRAM
Faster inference
Slight accuracy tradeoff

7. Tensor Parallelism

Suppose


70B model

Split across GPUs


GPU1

First half

GPU2

Second half

Both GPUs compute simultaneously.

8. Pipeline Parallelism

Instead of splitting tensors

Split layers.


GPU1

Layers 1-20

↓

GPU2

21-40

↓

GPU3

41-60

9. Speculative Decoding

Use

Small model

↓

Guess tokens

↓

Large model verifies

↓

Accept if correct



If guesses are correct

Large speedup.

---

# Popular Inference Frameworks

| Framework | Company | Primary Focus |
|-----------|----------|---------------|
| vLLM | :contentReference[oaicite:0]{index=0} / community | High-throughput LLM serving |
| SGLang | :contentReference[oaicite:1]{index=1} | Programmable LLM inference and agent workflows |
| :contentReference[oaicite:2]{index=2} (TGI) | :contentReference[oaicite:3]{index=3} | Production inference |
| :contentReference[oaicite:4]{index=4} | :contentReference[oaicite:5]{index=5} | GPU-optimized inference |
| :contentReference[oaicite:6]{index=6} | :contentReference[oaicite:7]{index=7} | Cross-platform inference |
| :contentReference[oaicite:8]{index=8} | Community | CPU and edge inference |
| :contentReference[oaicite:9]{index=9} | :contentReference[oaicite:10]{index=10} | Efficient deployment |
| :contentReference[oaicite:11]{index=11} | :contentReference[oaicite:12]{index=12} | CPU inference |

---

# vLLM

vLLM is currently one of the most popular inference engines for serving LLMs efficiently.

Architecture

HTTP Request

↓

vLLM API Server

↓

Scheduler

↓

Continuous Batch

↓

PagedAttention

↓

GPU

↓

Generated Tokens



## Key Features

- Continuous batching
- PagedAttention (efficient KV cache management)
- Tensor parallelism
- Streaming responses
- OpenAI-compatible APIs
- Multi-GPU serving
- High throughput

### Why is vLLM fast?

Instead of storing KV cache in one large contiguous block, it uses **PagedAttention**, which organizes cache into fixed-size memory pages (similar to virtual memory in operating systems). This reduces fragmentation, allows cache sharing, and enables continuous batching without frequent memory reallocations.

---

# SGLang

SGLang is both an inference engine and a programming framework for building complex LLM applications.

Architecture

Application

↓

SGLang Runtime

↓

Request Scheduler

↓

Structured Generation

↓

vLLM/TensorRT Backend

↓

GPU



Unlike a simple REST API server, SGLang lets developers define workflows involving:

- Multiple prompts
- Tool calling
- Agent loops
- Structured outputs
- Parallel execution
- Cached intermediate computations

### Key Features

- High-performance inference
- Structured generation
- Multi-turn conversations
- Agent workflows
- Tool execution
- Grammar-constrained decoding (e.g., valid JSON)
- Can use vLLM or TensorRT-LLM as the execution backend

---

# vLLM vs SGLang

| Feature | vLLM | SGLang |
|----------|------|---------|
| Primary Goal | Fast LLM serving | LLM application runtime |
| Continuous Batching | ✅ | ✅ |
| PagedAttention | ✅ | Uses backend support |
| Structured Output | Limited | Excellent |
| Agent Workflows | External orchestration | Built in |
| Tool Calling | Basic | Advanced |
| Multi-step Reasoning | External | Native |
| OpenAI API Compatible | Yes | Yes |
| Backend | Native engine | Can use vLLM, TensorRT-LLM, etc. |

---

# Example Production Architecture


            Users

              │

      API Gateway

              │

    Load Balancer

              │

 ------------------------

 vLLM Instance 1

 vLLM Instance 2

 vLLM Instance 3

 ------------------------

              │

        GPU Cluster

  (Tensor Parallel)

              │

         Llama 3 70B



---

# Skills Required for an Inference Engineer

An inference engineer should be familiar with:

| Area | Topics |
|------|--------|
| GPU Architecture | CUDA, SMs, HBM, memory hierarchy |
| Transformer Internals | Attention, KV Cache, decoding strategies |
| Memory Optimization | Quantization, PagedAttention, cache management |
| Parallelism | Tensor, pipeline, data, and expert parallelism |
| Serving Frameworks | vLLM, SGLang, TGI, TensorRT-LLM |
| Deployment | Kubernetes, Docker, autoscaling, load balancing |
| APIs | REST, gRPC, OpenAI-compatible APIs |
| Performance | Throughput, latency, tokens/sec, GPU utilization |
| Monitoring | Prometheus, Grafana, tracing, logging |

---

# Summary

Inference engineering bridges the gap between a trained LLM and a production-ready AI service. It combines systems engineering, GPU optimization, distributed computing, and model-serving techniques to maximize throughput, minimize latency, and reduce infrastructure cost. Frameworks such as **vLLM** focus on efficient, high-throughput serving using innovations like continuous batching and PagedAttention, while **SGLang** builds on efficient inference to provide a runtime for complex agentic applications, structured generation, and tool-using workflows. Together, these tools enable organizations to deploy LLMs that are responsive, scalable, and cost-effective.

Agents :

A2A (Agent-to-Agent) is a communication protocol that allows AI agents built by different frameworks, vendors, or organizations to discover each other, exchange messages, delegate tasks, and collaborate in a standardized way.

Without A2A, each AI agent operates like an isolated application.

Why do we need A2A?

1. Agents are specialized

Instead of building one giant agent that knows everything, organizations build specialized agents.

Example:

HR Agent
Finance Agent
Travel Agent
Calendar Agent
Code Generation Agent

Suppose a user asks:

"Book a flight for my business trip and make sure it fits my team's calendar and budget."

No single agent may have all the required capabilities.

Using A2A:


User
   │
   ▼
Travel Agent
   │
   ├──► Calendar Agent → Check availability
   │
   ├──► Finance Agent → Verify budget
   │
   └──► Approval Agent → Manager approval

Each agent performs its specialty and returns the result.

2. Different teams build different agents

In large companies:

HR develops HR agents
Finance develops Finance agents
IT develops Infrastructure agents

Each team may use:

LangGraph
CrewAI
AutoGen
Semantic Kernel
OpenAI SDK

Without a common protocol, every integration requires custom APIs.

A2A provides a common language for communication.

3. Avoid custom integrations

Without A2A:


HR Agent
   │
Custom REST API
   │
Finance Agent

Travel Agent
   │
Different API
   │
Calendar Agent

Each pair of agents needs custom integration.

If there are N agents, integrations can grow roughly as N × (N - 1) / 2 in the worst case.

With A2A:


HR Agent
     │
Finance Agent
     │
Travel Agent
     │
Calendar Agent

Everyone speaks the same protocol.

4. Agent discovery

Suppose an agent needs legal advice.

Without A2A:

Hardcode endpoint
Hardcode authentication
Hardcode API

With A2A:


Planning Agent

↓

Discover

↓

Legal Agent

↓

Capabilities

↓

Can review contracts
Can summarize regulations

The planning agent discovers and uses the legal agent dynamically.

5. Task delegation

A planning agent doesn't need to solve every problem.

Example:


User

↓

Planning Agent

↓

Generate report

↓

Data Agent

↓

Fetch sales

↓

Analytics Agent

↓

Generate insights

↓

Visualization Agent

↓

Create charts

Each task is delegated to the most suitable agent.

6. Multi-vendor interoperability

Imagine:

Company A builds a Finance Agent.
Company B builds a Procurement Agent.
Company C builds a Compliance Agent.

Without A2A, these agents need custom integration.

With A2A, they can collaborate using a shared protocol, regardless of who built them.

7. Supports distributed systems

Agents may run:

On-premises
AWS
Azure
Google Cloud
Edge devices

A2A enables communication across these environments without tightly coupling implementations.

8. Reusability

Instead of building the same capability repeatedly:


Expense Agent

used by

Finance Team

HR Team

Travel Team

Audit Team

A single specialized agent can serve multiple workflows.

Real-world example

Consider an online shopping scenario.


User

↓

Shopping Agent

↓

Inventory Agent
Check stock

↓

Pricing Agent
Apply discounts

↓

Payment Agent
Process payment

↓

Shipping Agent
Arrange delivery

↓

Notification Agent
Send confirmation

Each agent focuses on its own domain, and A2A coordinates their interaction.

Benefits of A2A

Benefit	Explanation
Interoperability	Agents from different frameworks and vendors can work together.
Reusability	Specialized agents can be reused across multiple applications.
Scalability	New agents can be added without redesigning existing integrations.
Dynamic discovery	Agents can discover available capabilities at runtime.
Delegation	Complex tasks are split among specialized agents.
Maintainability	Reduces the need for numerous custom point-to-point integrations.
Vendor independence	Avoids locking systems into a single AI framework or provider.

A2A vs MCP

These protocols solve different problems and are often used together.

A2A (Agent-to-Agent)	MCP (Model Context Protocol)
Connects agents	Connects an AI model or agent to tools and data sources
Agent ↔ Agent	Agent ↔ Tool
Used for collaboration and delegation	Used for accessing capabilities like databases, APIs, files, and SaaS services
Enables multi-agent workflows	Enables tool invocation and context retrieval
Example: Travel Agent asks Finance Agent to approve a budget	Example: Finance Agent queries a PostgreSQL database or invokes a payment API

A common architecture is:


                User
                  │
                  ▼
           Planning Agent
                  │
        ┌─────────┴─────────┐
        ▼                   ▼
   Travel Agent        Finance Agent
        │                   │
      (MCP)               (MCP)
        │                   │
   Flight API         ERP Database
   Hotel API          Budget Service

Here, A2A allows the planning, travel, and finance agents to communicate, while MCP lets each agent interact with the external tools and data sources it needs. Together, they enable modular, interoperable, and scalable AI system

Building an LLM can mean very different things depending on your goal. There are three common paths:

Goal	Time	GPUs Needed	Cost	Example
Train from scratch	Months	Hundreds to thousands	Millions of dollars	GPT, Llama, DeepSeek
Continue pre-training an existing model	Days to weeks	8–128 GPUs	Thousands to tens of thousands	Domain-specific Llama
Fine-tune an existing model	Hours to days	1–8 GPUs	Tens to hundreds of dollars	Chatbot, coding assistant

If your goal is to understand how companies like OpenAI, Meta, or DeepSeek build an LLM, the lifecycle looks like this.

Complete LLM Lifecycle


                 Data Collection
                        │
                        ▼
                Data Cleaning
                        │
                        ▼
               Tokenizer Training
                        │
                        ▼
            Dataset Tokenization
                        │
                        ▼
             Transformer Design
                        │
                        ▼
              Distributed Training
                        │
                        ▼
              Checkpoint Saving
                        │
                        ▼
              Model Evaluation
                        │
                        ▼
            Instruction Fine-tuning
                        │
                        ▼
                  Preference Tuning
          (RLHF / DPO / GRPO / RFT)
                        │
                        ▼
              Safety Alignment
                        │
                        ▼
             Quantization (Optional)
                        │
                        ▼
            Inference Optimization
          (vLLM / SGLang / TensorRT)
                        │
                        ▼
              Production Deployment

Phase 1: Collect Data

An LLM learns from enormous amounts of text.

Typical sources include:

Books
Wikipedia
GitHub repositories
Research papers
Stack Overflow
News
Government documents
Web pages
Question-answer datasets
Conversations

Example:


Wikipedia

↓

50 TB

Books

↓

20 TB

GitHub

↓

30 TB

Research Papers

↓

10 TB

Total

↓

110 TB Raw Data

The raw data is noisy and cannot be used directly.

Phase 2: Clean the Data

Remove:

HTML
advertisements
spam
duplicate documents
corrupted files
offensive content (depending on policy)
very short documents
low-quality translations

Example:


110 TB

↓

Cleaning

↓

35 TB High Quality Data

Phase 3: Train a Tokenizer

The model cannot understand characters directly.

Instead it converts text into tokens.

Example:


Artificial Intelligence

↓

Artificial

Intelligence


playing

↓

play

ing

Modern models typically use:

Byte Pair Encoding (BPE)
SentencePiece
WordPiece

Vocabulary size:


32,000

or

50,000

or

128,000 tokens

Phase 4: Tokenize Everything

Every document becomes integers.

Example


Hello World

↓

[15496, 2787]

The neural network only sees numbers.

Phase 5: Build the Transformer

Typical architecture:


Input Tokens

↓

Embedding Layer

↓

Transformer Block

↓

Transformer Block

↓

Transformer Block

↓

...

↓

Final Linear Layer

↓

Vocabulary Probabilities

Each transformer block contains:


LayerNorm

↓

Multi-Head Attention

↓

Residual

↓

LayerNorm

↓

Feed Forward Network

↓

Residual

Phase 6: Configure Hyperparameters

Typical configuration:

Parameter	Example
Layers	32
Hidden Size	4096
Attention Heads	32
Context Length	8192
Vocabulary	128K
Parameters	7B

Larger models increase these values.

Phase 7: Pre-training

The objective is simple:

Predict the next token.

Example:


The capital of France is

↓

?

↓

Paris

Training samples:


The

↓

cat

↓

sat

↓

on

↓

the

↓

mat

For every token:


Input

↓

Forward Pass

↓

Prediction

↓

Loss

↓

Backpropagation

↓

Update Weights

This process repeats trillions of times.

Phase 8: Distributed Training

One GPU is not enough.

Example:


1000 GPUs

↓

Each GPU trains part of model

↓

Synchronize gradients

↓

Repeat

Common techniques:

Data Parallelism
Tensor Parallelism
Pipeline Parallelism
Expert Parallelism (Mixture of Experts)

Frameworks:

PyTorch Distributed
DeepSpeed
Megatron-LM
Fully Sharded Data Parallel (FSDP)

Phase 9: Save Checkpoints

Every few thousand training steps:


Model

↓

Checkpoint

↓

Resume Later

A checkpoint contains:

model weights
optimizer state
learning rate scheduler
training step
random number generator state

Phase 10: Evaluate

Benchmark the model.

Common evaluations:

MMLU
GSM8K
HumanEval
TruthfulQA
HellaSwag
ARC
GPQA

Metrics:

Perplexity
Accuracy
Pass@k
F1 Score
Exact Match

Phase 11: Instruction Fine-tuning (SFT)

Pre-trained models complete text but do not reliably follow instructions.

Train on instruction-response pairs.

Example:


Instruction

Explain recursion

↓

Response

Recursion is...

Datasets:

Alpaca
Dolly
OpenHermes
ShareGPT
Custom enterprise datasets

Methods:

Full fine-tuning
LoRA
QLoRA

Phase 12: Preference Tuning

Improve response quality based on human preferences.

Typical pipeline:


Prompt

↓

Two Responses

↓

Human chooses better one

↓

Preference Dataset

↓

Optimization

Methods:

RLHF
DPO
GRPO
Reinforcement Fine-Tuning (RFT)

These help produce responses that are more helpful, harmless, and aligned with user intent.

Phase 13: Safety Alignment

Reduce harmful or unsafe outputs.

Examples:

jailbreak resistance
toxicity reduction
hallucination mitigation
refusal behavior
bias evaluation

Phase 14: Quantization

Reduce model size.


FP32

↓

FP16

↓

INT8

↓

INT4

Benefits:

Lower GPU memory
Faster inference
Lower cost

Phase 15: Inference Optimization

Production serving includes:


Model

↓

vLLM

↓

Continuous Batching

↓

KV Cache

↓

Streaming

↓

REST API

Optimizations include:

PagedAttention
FlashAttention
Continuous batching
Speculative decoding
Tensor parallelism
KV cache management

Phase 16: Deploy

Typical architecture:


          Client

             │

             ▼

      API Gateway

             │

             ▼

      Load Balancer

             │

             ▼

   vLLM / SGLang Cluster

             │

             ▼

        GPU Servers

             │

             ▼

      Monitoring & Logs

Technologies Used

Stage	Common Tools
Data Collection	Common Crawl, Wikipedia dumps, GitHub archives
Data Processing	Apache Spark, Ray, Python
Tokenizer	SentencePiece, Hugging Face Tokenizers
Training	PyTorch
Distributed Training	DeepSpeed, Megatron-LM, FSDP
Experiment Tracking	Weights & Biases, MLflow
Storage	S3, HDFS
Fine-tuning	PEFT, LoRA, QLoRA
Evaluation	lm-evaluation-harness, custom benchmarks
Inference	vLLM, SGLang, TensorRT-LLM, TGI
Deployment	Docker, Kubernetes, NVIDIA GPUs

Skills Needed to Build an LLM

Mathematics
- Linear Algebra
- Calculus
- Probability
- Statistics
Machine Learning
- Gradient Descent
- Backpropagation
- Loss Functions
- Optimization
Deep Learning
- Neural Networks
- Attention Mechanism
- Transformers
- Positional Embeddings
Distributed Systems
- Multi-GPU training
- Parallelism strategies
- High-speed networking (e.g., NVLink, InfiniBand)
GPU Programming
- CUDA basics
- GPU memory hierarchy
- Kernel optimization
MLOps
- Model versioning
- Experiment tracking
- Deployment
- Monitoring

Learning Roadmap

If your goal is to build an LLM yourself rather than just use one, a practical progression is:

Build a character-level language model from scratch.
Implement a Transformer in PyTorch.
Train a small GPT (50M–150M parameters) on a public dataset.
Learn distributed training with multiple GPUs.
Fine-tune an open-weight model such as Llama or Qwen using LoRA/QLoRA.
Serve it efficiently with vLLM or SGLang.
Build a complete chat application with retrieval, tool calling, monitoring, and scalable deployment.

This path teaches the same core concepts used to build production LLM systems, while remaining feasible on accessible hardware before scaling up to larger models

High-Level AI Engineering Roles

Phase	Responsible Area	Typical Engineer
Data Collection	Data pipelines	Data Engineer
Data Cleaning	Data processing	Data Engineer
Tokenization	Tokenizer development	ML Engineer / Research Engineer
Model Architecture	Transformer design	AI Research Scientist
Pre-training	Training algorithms	Research Scientist
Distributed Training	Multi-GPU optimization	Training Systems Engineer
Gradient Accumulation	Memory optimization	Training Systems Engineer
Mixed Precision	Training optimization	Training Systems Engineer
Checkpointing	Fault tolerance	Training Systems Engineer
Fine-tuning	Model adaptation	ML Engineer
RLHF/DPO	Alignment	Alignment Engineer
Quantization	Compression	ML Systems Engineer
Inference	Serving optimization	Inference Engineer
Deployment	Production infrastructure	MLOps Engineer

Training Engineering

Inference Engineering focuses on making models serve requests efficiently, while Training Engineering focuses on making models train efficiently.

Typical responsibilities include:

Memory Optimization

Gradient Accumulation
Gradient Checkpointing
Activation Checkpointing
CPU Offloading
ZeRO Optimization
Optimizer State Sharding

Precision Optimization

FP32
BF16
FP16
Mixed Precision Training
Dynamic Loss Scaling

Parallel Training

Data Parallelism
Tensor Parallelism
Pipeline Parallelism
Sequence Parallelism
Expert Parallelism (MoE)

Distributed Communication

NCCL
AllReduce
ReduceScatter
AllGather
Broadcast

Optimizer Engineering

AdamW
Fused Adam
8-bit Adam
Lion
LAMB

Memory Management

Managing:


Weights

↓

Gradients

↓

Optimizer States

↓

Activations

↓

KV Cache (for training)

Complete AI Systems Engineering Pipeline


                Data Engineering
                       │
                       ▼
            Training Systems Engineering
        (Gradient Accumulation, ZeRO, FSDP)
                       │
                       ▼
             AI Research Engineering
         (Transformer, Attention, Losses)
                       │
                       ▼
              Alignment Engineering
        (RLHF, DPO, Safety, Preference)
                       │
                       ▼
            Model Compression Engineering
     (Quantization, Pruning, Distillation)
                       │
                       ▼
             Inference Engineering
   (vLLM, SGLang, FlashAttention, KV Cache)
                       │
                       ▼
               MLOps / Platform Engineering
      (Deployment, Autoscaling, Monitoring)

Common Optimizations by Domain

Training Systems Engineering	Inference Engineering
Gradient Accumulation	Continuous Batching
Gradient Checkpointing	KV Cache
Mixed Precision Training	Quantization
ZeRO Optimizer	PagedAttention
FSDP	FlashAttention
DeepSpeed	vLLM
Megatron-LM	SGLang
Activation Checkpointing	Speculative Decoding
Distributed Optimizers	TensorRT-LLM
Optimizer Sharding	Streaming Tokens

Popular Frameworks Used by Training Systems Engineers

Category	Common Frameworks
Distributed Training	PyTorch Distributed (DDP), FSDP, DeepSpeed, Megatron-LM
Memory Optimization	DeepSpeed ZeRO, Activation Checkpointing, Gradient Checkpointing
Precision	Automatic Mixed Precision (AMP), BF16, FP16
Communication	NCCL, Gloo
Experiment Tracking	Weights & Biases, MLflow
Cluster Scheduling	Kubernetes, Slurm, Ray

Thursday, June 18, 2026

AWS Generative AI Developer Prep

Table of Contents

1. Fundamentals of AI and ML

Artificial Intelligence (AI)

Machine Learning (ML)

Deep Learning

Generative AI

Training vs Inference

2. Generative AI Concepts

Large Language Model (LLM)

Tokens

Hallucination

Context Window

3. Foundation Models

Foundation Model (FM)

Multi Modality

Fine Tuning

Retrieval Augmented Generation (RAG)

RAG Pipeline: Model Types Used at Each Step

Transformer Architecture Used at Each Step

Common Models by Transformer Family

Typical Modern RAG Stack

Mental Model

User Memory Pipeline (Mem0)

Query-Time Retrieval Pipeline

Mem0

4. Prompt Engineering

Zero-Shot Prompting

One-Shot Prompting

Few-Shot Prompting

Chain of Thought

Prompt Components

Large Language Model (LLM) Architecture

Components of an LLM

Step 1: Tokenization

Step 2: Embedding Layer

Step 3: Positional Encoding

Step 4: Transformer Decoder Blocks

Step 5: Multi-Head Self-Attention

Query, Key and Value

Step 6: Multi-Head Attention

Step 7: Feed Forward Network (FFN)

Step 8: Residual Connections

Step 9: Layer Normalization

Step 10: Output Projection

Step 11: Softmax

Step 12: Token Sampling

Why is it called a Decoder-only Transformer?

End-to-End Example

Architecture Summary

1. Pre-training

2. Mid-training (Continued Pre-training)

3. Supervised Fine-Tuning (SFT)

4. Preference Alignment

RLHF

RLAIF

DPO

5. Task Fine-Tuning

6. Distillation

7. Continuous Learning

Complete Training Pipeline

Comparison

When to use each stage

5. Responsible AI

Fairness

Explainability

Privacy

Robustness

Transparency

6. AWS AI Services

7. AWS Generative AI Services

8. Security and Compliance

Shared Responsibility Model

IAM

Encryption

9. Common Use Cases

Classification

Sentiment Analysis

Summarization