Thursday, June 18, 2026

Build Lakehouse using Iceberg

 Flow Diagram of Data Lakehouse



While Data Lake is excels for Machine Learning , Data warehouse is used for Business Intelligence , Data Lakehouse excels at both.

Data Lakehouse supports :

  •  ACID 
  • Schema Enforcement and Evaluation

Flow Diagram of Data Warehouse




Flow Diagram of Data Lake 



Medallion Architecture

  • Bronze - Raw data receive for replay
  • Silver - Clean data , processed
  • Gold - Features


Kafka for Fan Out Design Pattern 



where one stream is converted to multiple allowing more than one consumer 


AWS Generative AI Developer Prep


Table of Contents

  1. Fundamentals of AI and ML
  2. Generative AI Concepts
  3. Foundation Models
  4. Prompt Engineering
  5. Responsible AI
  6. AWS AI Services
  7. AWS Generative AI Services
  8. Security and Compliance
  9. AI Use Cases
  10. Exam Tips

1. Fundamentals of AI and ML

Artificial Intelligence (AI)

Machines performing tasks that normally require human intelligence.

Machine Learning (ML)

Subset of AI where systems learn patterns from data.

Deep Learning

Uses neural networks with multiple layers.

Generative AI

Creates new content such as:

  • Text
  • Images
  • Audio
  • Video
  • Code

Training vs Inference

TermMeaning
TrainingModel learns from data
InferenceModel makes predictions using learned knowledge

2. Generative AI Concepts

Large Language Model (LLM)

Examples:

  • OpenAI GPT models
  • Anthropic Claude
  • Meta Llama

Tokens

Text is broken into small units called tokens.

Example:

I love AWS

May become:

[I] [love] [AWS]

Hallucination

Model generates incorrect information while sounding confident.

Context Window

Amount of information an LLM can consider at once.

3. Foundation Models

Foundation Model (FM)

Large pretrained model that can be adapted for many tasks.

Examples:

  • Text generation
  • Summarization
  • Classification
  • Translation
  • Chatbots

Multi Modality

 
ModalitiesExample Models
Text → TextGPT-4oClaude SonnetLlama 3Mistral
Image → TextLLaVAQwen2.5-VLBLIP-2InstructBLIP
Image + Text → TextGPT-4oGemini 2.5Claude SonnetKosmos-2
Image ↔ Text (Similarity/Retrieval)CLIPSigLIPALIGNFlorence
Text → ImageStable DiffusionDALL-E 3ImagenFLUX.1
Image → ImageStable Diffusion XLControlNetInstructPix2Pix
Text → Audio (Speech)Tacotron 2VALL-EBark
Audio → Text (ASR)Whisperwav2vec 2.0Conformer
Text + Audio → TextGPT-4oGemini 2.5
Audio ↔ TextCLAPAudioCLIP
Text → VideoSoraVeoGen-3Pika
Image → VideoRunway Gen-3PikaLuma Dream Machine
Video → TextVideo-LLaVAVideoChatGPTGemini 2.5
Video + Text → TextGPT-4oGemini 2.5Qwen2.5-VL
Text + Image + Audio → TextGPT-4oGemini 2.5
Text + Image + Audio + Video → TextGPT-4oGemini 2.5
Text + Image + Audio + Video → Text + AudioGPT-4o RealtimeGemini Live


Fine Tuning

Retraining model with domain-specific data.

Retrieval Augmented Generation (RAG)

Instead of retraining:

  1. Retrieve documents
  2. Send documents to LLM
  3. Generate response

Benefits:

  • Lower cost
  • More current data
  • Reduced hallucinations

RAG Pipeline: Model Types Used at Each Step

RAG StepPurposeModel TypeExample Models
1. Document IngestionRead PDFs, DOCX, HTML, ImagesOCR / Document AITesseractLayoutLMDonut
2. ChunkingSplit documents into passagesRule-based / NLPSentence Splitter, Recursive Text Splitter
3. Text → EmbeddingsConvert chunks into vectorsEmbedding Model (Encoder-only Transformer)BERTSentence-BERTE5BGE
4. Vector StorageStore embeddingsVector DatabaseFAISSMilvusWeaviatePinecone
5. Query → EmbeddingConvert user query to vectorSame Embedding ModelBGE, E5, SBERT
6. RetrievalFind nearest chunksANN Search AlgorithmHNSW, IVF, Flat Search
7. Re-ranking (Optional)Improve retrieved resultsCross EncoderMonoBERTCohere RerankBGE Reranker
8. Context ConstructionBuild prompt with retrieved chunksPrompt BuilderTemplate Engine
9. Answer GenerationGenerate final answerDecoder-only LLMGPT-4oClaude SonnetLlama 3Mistral
10. Citation Generation (Optional)Show sourcesLLM / Metadata LayerGPT-4o, Claude

Transformer Architecture Used at Each Step

StepTransformer Type
Embedding GenerationEncoder-only
Re-rankingEncoder-only (Cross Encoder)
Answer GenerationDecoder-only
Translation (optional)Encoder-Decoder
Summarization (optional)Encoder-Decoder
OCR UnderstandingEncoder or Encoder-Decoder
Multimodal RAGVision Encoder + LLM Decoder

Common Models by Transformer Family

Transformer FamilyExample ModelsUsed For
Encoder-onlyBERT, RoBERTa, SBERT, E5, BGEEmbeddings, Retrieval
Decoder-onlyGPT, Llama, Claude, Mistral, QwenGeneration
Encoder-DecoderT5, FLAN-T5, BARTSummarization, Translation
Vision EncoderViT, CLIP Vision EncoderImage Embeddings
Vision-LanguageLLaVA, Qwen-VL, GPT-4oMultimodal RAG

Typical Modern RAG Stack

LayerCommon Choice
ChunkingLangChain Recursive Splitter
EmbeddingsBGE-large, E5-large
Vector DBFAISS, Milvus
RetrievalHNSW
Re-rankerBGE-Reranker
GeneratorGPT-4o, Claude, Llama 3
OrchestrationLangChain, LlamaIndex

Mental Model

Documents

Chunking

Encoder Model
(BERT / E5 / BGE)

Embeddings

Vector DB

End

User Query

Encoder Model
(BERT / E5 / BGE)

Similarity Search

Top K Chunks

Cross Encoder Re-ranker
(Optional)

Prompt Construction

Decoder LLM
(GPT / Claude / Llama)

Final Answer

User Memory Pipeline (Mem0)

User Conversations

Mem0
(Extracts Important Facts)

Memory Store
(Vector DB / PostgreSQL / Redis)

End

Query-Time Retrieval Pipeline

User Query

Encoder Model
(BERT / E5 / BGE)
├─────────────────────────────┐
│ │
▼ ▼
Similarity Search Mem0 Memory Retrieval
(Vector DB) (Relevant User Memories)
│ │
▼ ▼
Top K Chunks Relevant Memories
│ │
└──────────────┬──────────────┘

Cross Encoder Re-ranker
(Optional, for documents)

Prompt Construction
(Documents + User Memory + Query)

Decoder LLM
(GPT / Claude / Llama)

Final Answer

Mem0 Memory Update
(Store new preferences/facts if needed)


Mem0  

  • Before generation: Retrieves relevant user preferences, past interactions, and long-term memory.
  • After generation: Extracts new important facts from the conversation and stores them for future use.
  • Difference from RAG: RAG retrieves knowledge from documents, while Mem0 retrieves knowledge about the user or previous interactions. Both are complementary and are typically combined before prompt construction.
  • Hybrid Search improves retrieval recall (finding the right documents).
  • HyDE improves query understanding (making difficult or ambiguous queries easier to retrieve).








  • TaskModel Type
    Create EmbeddingsEncoder-only
    Retrieve DocumentsVector Search
    Re-rank ResultsCross Encoder
    Generate AnswerDecoder-only LLM
    Summarize DocumentsEncoder-Decoder
    Multimodal RetrievalCLIP / Vision Encoder
    Multimodal GenerationGPT-4o / Gemini / Qwen-VL

    4. Prompt Engineering

    Zero-Shot Prompting

    Translate this sentence to French.

    One-Shot Prompting

    Provide one example.

    Few-Shot Prompting

    Provide multiple examples.

    Chain of Thought

    Ask model to reason step-by-step.

    Prompt Components

    • Role
    • Context
    • Instructions
    • Examples
    • Constraints

    LLM Architecture

    Large Language Model (LLM) Architecture

    At a high level, an LLM is a Transformer-based neural network that converts input text into tokens, processes them through multiple Transformer blocks, and predicts the next token repeatedly to generate text.

                     Input Text


    Text Tokenization


    Token IDs (Integers)


    Token Embedding Layer


    Positional Encoding/Embedding


    N Transformer Decoder Blocks


    Final Hidden Representation


    Linear (Output Projection)


    Softmax Layer


    Probability for Every Token


    Select Next Token (Sampling)


    Append to Input & Repeat

    Components of an LLM

    ComponentPurpose
    TokenizerConverts text into tokens
    Embedding LayerConverts token IDs into dense vectors
    Positional EncodingGives the model information about token order
    Transformer Decoder BlocksLearns relationships between tokens
    Multi-Head Self-AttentionDetermines which words are important
    Feed Forward Network (FFN)Learns complex patterns
    Residual ConnectionsPrevent information loss
    Layer NormalizationStabilizes training
    Output Linear LayerMaps hidden vectors to vocabulary logits
    SoftmaxConverts logits into probabilities
    Sampling StrategyChooses the next token

    Step 1: Tokenization

    The tokenizer splits text into tokens.

    Example:

    Input

    I love machine learning

    Tokenizer

    ["I", "love", "machine", "learning"]

    Convert to IDs

    [52, 908, 3210, 4567]

    The neural network only understands numbers.


    Step 2: Embedding Layer

    Each token ID is mapped to a dense vector.

    Example

    Vocabulary

    I → 52
    love → 908
    machine → 3210

    Embedding

    52



    [0.21, -0.84, 1.02, ..., 0.67]

    Instead of one integer,

    every token becomes a high-dimensional vector.

    Typical embedding size

    ModelEmbedding Dimension
    GPT-2 Small768
    Llama 7B4096
    GPT-312288

    Step 3: Positional Encoding

    Attention alone has no concept of sequence order.

    Without positional information,

    these two sentences appear identical.

    Dog bites man

    Man bites dog

    Positional embeddings add information like:

    Dog     Position 0

    bites Position 1

    man Position 2

    Final embedding

    Token Embedding

    +

    Position Embedding

    Step 4: Transformer Decoder Blocks

    This is where most computation happens.

    Modern LLMs stack many identical decoder blocks.

    Examples:

    ModelDecoder Blocks
    GPT-2 Small12
    GPT-3 175B96
    Llama 3 8B32
    DeepSeek-R1 671BHundreds of expert layers (MoE architecture)

    Each block contains:

    Input


    LayerNorm


    Masked Multi-Head Attention


    Residual Connection


    LayerNorm


    Feed Forward Network


    Residual Connection


    Output

    Step 5: Multi-Head Self-Attention

    This is the core innovation of Transformers.

    Every word looks at every previous word.

    Sentence

    The animal didn't cross the road because it was tired.

    When processing "it"

    attention may focus on

    animal

    road

    tired

    The model learns that "it" refers to animal.


    Query, Key and Value

    Each token produces three vectors.

    Embedding



    Query (Q)

    Key (K)

    Value (V)

    Meaning

    VectorPurpose
    QueryWhat am I looking for?
    KeyWhat information do I contain?
    ValueInformation to pass forward

    Attention score

    Q × Kᵀ



    Similarity Score



    Softmax



    Attention Weights



    Weighted Sum of Values

    Formula

    Attention(Q,K,V)

    =

    softmax(QKᵀ / √dk)V

    Step 6: Multi-Head Attention

    Instead of one attention calculation,

    multiple attention heads run in parallel.

    Example

    Head 1

    Grammar

    ----------------

    Head 2

    Subject

    ----------------

    Head 3

    Object

    ----------------

    Head 4

    Long-range dependency

    Outputs are concatenated.


    Step 7: Feed Forward Network (FFN)

    After attention,

    every token passes through the same small neural network independently.

    Typical structure

    Linear



    Activation (GELU, SwiGLU, etc.)



    Linear

    Purpose

    • Learn nonlinear patterns
    • Transform features
    • Increase model capacity

    Step 8: Residual Connections

    Instead of replacing the input,

    the block adds the original input back.

    Output

    =

    Input

    +

    Attention Output

    Benefits

    • Prevents vanishing gradients
    • Preserves information
    • Enables very deep networks

    Step 9: Layer Normalization

    Keeps activations stable during training.

    Without it,

    training becomes unstable as models get deeper.


    Step 10: Output Projection

    Final hidden vector

    Linear layer

    Vocabulary size

    Example vocabulary

    50,000 words

    Output

    Dog     3.2

    Cat 8.9

    Car 1.4

    Paris 10.1

    These are logits (unnormalized scores).


    Step 11: Softmax

    Softmax converts logits into probabilities.

    Example

    Paris   0.72

    London 0.12

    Berlin 0.08

    Rome 0.08

    Probabilities sum to 1.


    Step 12: Token Sampling

    The next token is selected.

    Methods include:

    MethodDescription
    GreedyChoose the highest probability token
    Beam SearchExplore multiple candidate sequences simultaneously
    Top-k SamplingSample only from the top k most probable tokens
    Top-p (Nucleus) SamplingSample from the smallest set of tokens whose cumulative probability exceeds p
    Temperature SamplingAdjust randomness by scaling logits before Softmax

    Generated token

    Append to prompt

    Run the decoder again

    Predict the next token

    This repeats until an end-of-sequence token or another stopping condition is reached.


    Why is it called a Decoder-only Transformer?

    The original Transformer introduced in the paper "Attention Is All You Need" had two parts:

    Encoder



    Decoder

    LLMs such as GPT, Llama, Qwen, and DeepSeek use only the decoder stack with masked (causal) self-attention, so each token can attend only to itself and previous tokens. This enables autoregressive next-token prediction.


    End-to-End Example

    Input:

    The capital of France is

    Processing:

    Text


    Tokenizer


    Token IDs


    Embeddings + Position Embeddings


    32–100+ Transformer Decoder Blocks


    Hidden Representation


    Linear Projection


    Softmax


    Predicted Token: "Paris"


    Append "Paris" to the input


    Predict the next token

    Architecture Summary

    ComponentFunction
    TokenizerConverts text to token IDs
    Embedding LayerConverts token IDs to dense vectors
    Positional EmbeddingEncodes token order
    Masked Multi-Head Self-AttentionCaptures relationships with previous tokens
    Feed Forward NetworkLearns nonlinear transformations
    Residual ConnectionsPreserve information and improve gradient flow
    Layer NormalizationStabilizes training
    Linear Output LayerProjects hidden states to vocabulary logits
    SoftmaxProduces probabilities over the vocabulary
    SamplingSelects the next token for generation

    Modern LLMs extend this core architecture with optimizations such as Rotary Positional Embeddings (RoPE)Grouped Query Attention (GQA) or Multi-Query Attention (MQA)Mixture of Experts (MoE)FlashAttentionKV caching, and quantization, but the fundamental decoder-only Transformer pipeline remains the sam


    Training

    LLMs are trained in multiple stages, each with a different objective. Not every model goes through every stage, but modern models such as GPT, Llama, DeepSeek, and Qwen typically follow a pipeline similar to the one below.

    StagePurposeInput DataOutput
    1. Pre-trainingLearn language and world knowledgeMassive unlabeled textBase model
    2. Mid-training (Continued Pre-training)Specialize in a domain or languageDomain-specific unlabeled dataDomain-adapted base model
    3. Supervised Fine-Tuning (SFT)Learn to follow instructionsPrompt-response pairsInstruction-following model
    4. Preference AlignmentAlign responses with human preferencesRanked responses or AI feedbackAligned assistant
    5. Task Fine-TuningImprove performance on a specific taskTask-specific labeled dataSpecialized model
    6. DistillationTransfer knowledge from a large modelTeacher model outputsSmaller model
    7. Continuous LearningPeriodically update knowledgeNew datasetsUpdated model

    1. Pre-training

    This is where the model learns:

    • Grammar
    • Facts
    • Reasoning patterns
    • Coding syntax
    • Mathematics
    • General world knowledge

    Dataset examples:

    • Books
    • Wikipedia
    • GitHub
    • Research papers
    • Web pages

    Training objective:

    Predict the next token.

    Example:

    The capital of France is _____

    Target:

    Paris

    Output:

    Base Model

    Example models:

    • Llama Base
    • Qwen Base
    • DeepSeek Base

    2. Mid-training (Continued Pre-training)

    Sometimes called:

    • Continued Pre-training
    • Domain Adaptive Pre-training (DAPT)
    • Domain Adaptation

    Purpose:

    Teach the model a particular domain without changing its fundamental training objective.

    Example:

    General model

    Train on

    100 million medical papers

    Medical model

    The objective is still:

    Predict next token

    Examples:

    General model →

    Legal documents

    Legal LLM

    General model →

    Financial reports

    Finance LLM

    General model →

    AWS documentation

    Cloud Assistant

    No instruction data is required.


    3. Supervised Fine-Tuning (SFT)

    Now teach the model how humans want it to answer.

    Dataset:

    Question



    Ideal Answer

    Example:

    Prompt

    Explain recursion.

    Target:

    Recursion is a programming technique...

    Loss:

    Compare generated answer with target answer.

    Output:

    Instruction-following model.

    Examples:

    • ChatGPT-style assistants
    • DeepSeek Chat
    • Llama Instruct

    4. Preference Alignment

    SFT teaches correctness.

    Alignment teaches helpfulness.

    Example

    Question:

    Explain Java.

    Response A

    Very detailed

    Response B

    Clear and concise

    Humans prefer B.

    The model learns:

    B > A

    Methods include:

    RLHF

    Reinforcement Learning from Human Feedback

    Humans rank responses.

    Reward model learns preferences.

    Policy optimized using reinforcement learning (historically often PPO).


    RLAIF

    Reinforcement Learning from AI Feedback

    Instead of humans,

    another AI ranks responses.

    Cheaper than RLHF.


    DPO

    Direct Preference Optimization

    Modern alternative.

    No reinforcement learning.

    Directly optimizes preferred responses.

    Much simpler training.


    5. Task Fine-Tuning

    Now optimize for one specific task.

    Examples

    Sentiment Analysis

    Review



    Positive

    NER

    Sentence



    Entities

    Translation

    English



    French

    Hallucination Detection

    Question



    Hallucination score

    Example

    General assistant

    Fine tune on

    Medical QA

    Medical Assistant


    6. Distillation

    Large models are expensive.

    Teacher

    Generate answers

    Student learns

    Example

    GPT-4



    Produces millions of examples



    7B model



    Learns to imitate GPT-4

    Result:

    Smaller

    Faster

    Cheaper

    Examples:

    • DistilBERT
    • DeepSeek-R1-Distill
    • Llama distilled variants

    7. Continuous Learning

    Models become outdated.

    Periodically update using:

    • New research papers
    • New laws
    • Latest documentation
    • New programming languages

    Example

    2024 model

    Train on 2025 data

    Updated model

    Many production systems combine this with retrieval-based methods rather than continuously retraining the base model.


    Complete Training Pipeline

    Raw Text Corpus


    Pre-training


    Base Model


    Mid-training (Domain Adaptation)


    Domain Base Model


    Supervised Fine-Tuning (SFT)


    Instruction Model


    Preference Alignment
    (RLHF / RLAIF / DPO)


    Aligned Chat Model


    Task Fine-Tuning (Optional)


    Specialized Model


    Distillation (Optional)


    Smaller Efficient Model


    Periodic Updates / Continued Training

    Comparison

    StageLearnsUses Labels?Training Objective
    Pre-trainingLanguage, reasoning, world knowledgeNoNext-token prediction
    Mid-trainingDomain knowledge (medical, legal, finance, code, etc.)NoNext-token prediction on domain-specific data
    Supervised Fine-Tuning (SFT)Instruction followingYesPredict the target response
    Preference Alignment (RLHF, RLAIF, DPO)Human or AI preferences, safety, helpfulnessPreference pairs/rankingsOptimize preferred responses
    Task Fine-TuningSpecific downstream taskYesTask-specific objective (classification, generation, etc.)
    DistillationMimic a larger modelTeacher-generated outputsMatch teacher behavior
    Continuous LearningNew information and capabilitiesDependsContinued pre-training, fine-tuning, or other update methods

    When to use each stage

    • Pre-training: Build a general-purpose foundation model from scratch.
    • Mid-training: Adapt a foundation model to a domain (e.g., medicine, finance, law, source code) without changing its general objective.
    • SFT: Make the model follow instructions and produce conversational, task-oriented responses.
    • Preference Alignment: Improve helpfulness, harmlessness, and response quality according to human or AI preferences.
    • Task Fine-Tuning: Maximize performance on a specific application such as summarization, code generation, or hallucination detection.
    • Distillation: Deploy a smaller, faster model while retaining much of a larger model's capability.
    • Continuous Learning: Keep models up to date as knowledge, data, and requirements evolve.

    5. Responsible AI

    Fairness

    Avoid bias.

    Explainability

    Understand why model produced output.

    Privacy

    Protect user data.

    Robustness

    Model behaves reliably.

    Transparency

    Users know AI is involved.


    Evaluation 

    Model GroupEvaluationDescription of the Evaluation
    Agentic ModelsAgentBenchEvaluates autonomous task execution, planning, tool usage, and multi-step reasoning.
    Agentic ModelsGAIATests real-world assistant capabilities like searching, tool calling, and reasoning.
    Agentic ModelsSWE-benchMeasures ability to solve real GitHub issues by editing codebases.
    Bi-EncoderBEIREvaluates embedding-based retrieval across multiple datasets and domains.
    Bi-EncoderMTEBMeasures embedding quality across retrieval, clustering, classification, and reranking tasks.
    Bi-EncoderMS MARCOEvaluates dense retrieval and passage ranking performance.
    Cross-EncoderBEIRMeasures pairwise query-document relevance scoring quality.
    Cross-EncoderMS MARCOEvaluates reranking precision for query-passage relevance.
    Cross-EncoderTREC Deep Learning TrackMeasures ranking quality for search relevance tasks.
    Decoder-onlyGSM8KEvaluates arithmetic and multi-step mathematical reasoning.
    Decoder-onlyHellaSwagMeasures commonsense reasoning and next-sentence prediction.
    Decoder-onlyHumanEvalEvaluates code generation correctness using executable unit tests.
    Decoder-onlyMMLUMeasures broad knowledge and reasoning across many academic domains.
    Decoder-onlyMT-BenchTests instruction following and conversational quality.
    Decoder-onlyNeedle-in-a-HaystackMeasures ability to retrieve specific information from long contexts.
    Decoder-onlyTruthfulQATests factual consistency and resistance to hallucinations.
    Encoder-decoderBLEUMeasures overlap between generated text and reference text, mainly for translation.
    Encoder-decoderROUGEEvaluates summarization quality based on n-gram overlap.
    Encoder-decoderSQuADMeasures extractive question-answering accuracy.
    Encoder-onlyGLUEEvaluates language understanding tasks like sentiment, entailment, and similarity.
    Encoder-onlyMTEBMeasures embedding performance across multiple NLP tasks.
    Encoder-onlySTS-BMeasures how well embeddings capture sentence similarity.
    Encoder-onlySuperGLUEHarder version of GLUE for advanced reasoning tasks.
    Long-context ModelsInfiniteBenchEvaluates memory retention and reasoning over very long contexts.
    Long-context ModelsLongBenchTests summarization, retrieval, and reasoning on long documents.
    Long-context ModelsNeedle-in-a-HaystackMeasures retrieval accuracy from large contexts.
    Multimodal ModelsMMBenchEvaluates image understanding and multimodal reasoning.
    Multimodal ModelsMMMUMeasures multimodal reasoning across academic and professional domains.
    Multimodal ModelsMMVetTests advanced visual reasoning and perception.
    RAG SystemsCRUD-RAGMeasures retrieval robustness and update handling in RAG pipelines.
    RAG SystemsRAGASEvaluates faithfulness, context precision, context recall, and answer relevance in RAG.
    Reward ModelsRewardBenchEvaluates preference model quality and alignment performance.
    Tool-use ModelsToolBenchMeasures correctness in tool selection, API usage, and tool chaining.

     

    6. AWS AI Services

    Amazon Rekognition

    • Image analysis
    • Face detection
    • Object detection

    Amazon Comprehend

    • Sentiment analysis
    • Entity extraction
    • Language detection

    Amazon Transcribe

    • Speech to text

    Amazon Polly

    • Text to speech

    Amazon Textract

    • Extract text from documents

    Amazon Translate

    • Language translation
    Deep Racer 

    • AWS DeepRacer is a cloud-based autonomous racing car platform used to learn, train, and evaluate reinforcement learning (RL) models through simulated and real-world racing.
    AWS DeepLens
    • Run computer vision and deep learning models on an AI-enabled camera.

    AWS DeepComposer
    • Learn generative AI and machine learning through music composition.

    7. AWS Generative AI Services

    Amazon Bedrock

    Most important service for the exam.

    Provides access to foundation models from:

    • Anthropic Claude
    • Meta Llama
    • Amazon Nova
    • Stability AI

    Features:

    • RAG
    • Agents
    • Knowledge Bases
    • Guardrails
    • Fine-tuning
    Streaming API
    - using Websocket
    - control buffer size for better UX


    Enterprise chatbot over company data.

    Amazon Q Developer

    Developer coding assistant.

    Amazon SageMaker AI

    Build, train and deploy ML models.

    Sagemaker Inference 

    Sagemaker Pipeline 

    - Batch mode saves cost by preventing GPU under utilization.




    8. Security and Compliance

    Shared Responsibility Model

    AWS secures:

    • Infrastructure
    • Hardware
    • Network

    Customer secures:

    • Data
    • Access control
    • Configuration

    IAM

    Identity and access management.

    Encryption

    Data:

    • At rest
    • In transit
    Lake Formation 
    - upto column level Security 

    Adversarial Input
    - Evaluate your model periodically using Synthetic data to prevent risk of Adversarial input creating issue.

    9. Common Use Cases

    Classification

    Spam or Not Spam

    Sentiment Analysis

    Positive / Negative

    Summarization

    Long article -> Short summary

    Chatbots

    Customer support

    Code Generation

    Generate Java/Python code

    Document Processing

    Invoice extraction

      Frequently Tested Comparisons

    ServicePurpose
    BedrockGenerative AI
    SageMaker AIBuild/train ML models
    ComprehendNLP analysis
    RekognitionImage analysis
    TextractDocument extraction
    TranscribeSpeech to text
    PollyText to speech
    TranslateTranslation

    Tips :
    TopicRemember
    Generative AICreates new content
    Foundation ModelLarge pretrained model
    HallucinationConfident wrong answer
    RAGRetrieve + Generate
    BedrockManaged GenAI platform
    SageMakerML lifecycle
    GuardrailsSafety controls
    Fine-TuningRetrain model
    InferenceModel prediction
    TokenSmall text unit


    There are several ways to represent the position of tokens in a Transformer. Although people often say "embeddings like RoPE," RoPE is actually a positional encoding/embedding technique, not a semantic embedding like a word embedding.

    Below are the major types of positional embeddings/encodings used in Transformers.

    MethodLearnableExtrapolates to Longer SequencesUsed In
    Absolute Positional Embedding (APE)Yes❌ NoGPT-2, BERT
    Sinusoidal Positional EncodingNo✅ YesOriginal Transformer
    Relative Positional Embedding (RPE)Usually YesBetter than APET5, Transformer-XL
    Rotary Positional Embedding (RoPE)No✅ YesLlama, Qwen, DeepSeek, GPT-NeoX
    ALiBi (Attention with Linear Biases)No✅ ExcellentBLOOM, MPT
    xPosNo✅ Better than RoPELong-context models
    Dynamic RoPENo✅ Improved long-context supportSome Llama variants
    NTK-aware RoPE ScalingNo✅ Extends context windowLlama long-context adaptations
    YaRN (Yet another RoPE extensioN)No✅ Very goodLong-context fine-tuned LLMs
    LeX (Length Extrapolation)Varies✅ Designed for long contextResearch models

    1. Absolute Positional Embedding (APE)

    Each position has its own learnable vector.

    Example

    Position 0 → [0.1, 0.3, ...]
    Position 1 → [0.4, 0.8, ...]
    Position 2 → [0.2, 0.9, ...]

    Final input

    Token Embedding

    +

    Position Embedding

    Advantages

    • Simple
    • Learns position information

    Disadvantages

    • Cannot naturally handle sequences longer than those seen during training
    • Every position needs its own learned vector

    Used in:

    • GPT-2
    • BERT

    2. Sinusoidal Positional Encoding

    Introduced in the original Transformer paper.

    Uses sine and cosine functions.

    Formula

    PE(pos,2i)=sin(pos/10000^(2i/d))

    PE(pos,2i+1)=cos(pos/10000^(2i/d))

    Advantages

    • No training required
    • Infinite positions can be computed
    • Generalizes better to unseen sequence lengths

    Disadvantages

    • Less expressive than learned approaches

    Used in:

    • Original Transformer

    3. Relative Positional Embedding (RPE)

    Instead of absolute positions,

    the model learns the relative distance.

    Example

    Dog is 2 words before cat.

    The model learns

    Distance = +2

    instead of

    Dog at position 5

    Cat at position 7

    Advantages

    • Better captures local relationships
    • Handles varying sequence lengths more naturally

    Used in:

    • T5
    • Transformer-XL
    • DeBERTa (with variants)

    4. RoPE (Rotary Positional Embedding)

    The most popular method in modern LLMs.

    Instead of adding a positional vector,

    RoPE rotates the Query and Key vectors by an angle based on position.

    Embedding



    Query



    Rotate by position angle



    Attention

    Advantages

    • Excellent long-context behavior
    • Naturally preserves relative position information
    • No additional learned parameters

    Used in:

    • Llama
    • Qwen
    • DeepSeek
    • GPT-NeoX
    • Mistral

    5. ALiBi (Attention with Linear Biases)

    Instead of embeddings,

    add a linear bias directly to attention scores.

    Attention Score

    +

    Distance Bias

    Farther tokens receive a progressively larger penalty.

    Advantages

    • Extremely simple
    • Strong length extrapolation
    • No positional vectors

    Used in:

    • BLOOM
    • MPT

    6. xPos

    An extension of RoPE.

    Designed to improve long-sequence stability.

    Instead of using a fixed rotation,

    it rescales rotations.

    Advantages

    • Better than RoPE for very long contexts
    • Preserves attention stability

    7. Dynamic RoPE

    Standard RoPE uses fixed frequencies.

    Dynamic RoPE adjusts frequencies dynamically for longer contexts.

    Advantages

    • Better context extrapolation
    • Improved performance on long documents

    8. NTK-aware RoPE Scaling

    Originally developed for extending Llama's context window.

    Idea

    Stretch the RoPE frequencies.

    Example

    Original

    4096 tokens



    Scaled

    32768 tokens

    No retraining is required in many implementations.

    Advantages

    • Very popular
    • Simple
    • Enables longer contexts

    9. YaRN (Yet another RoPE extensioN)

    Improves on NTK scaling.

    Combines interpolation and scaling techniques.

    Advantages

    • Better long-context quality
    • Less degradation than naive scaling
    • Widely used for 128K+ context extensions

    10. LeX (Length Extrapolation)

    Research methods that explicitly optimize for longer contexts.

    Goal

    Train models that naturally generalize beyond the training context length.


    Which models use which?

    ModelPositional Method
    GPT-2Absolute Positional Embedding
    BERTAbsolute Positional Embedding
    Original TransformerSinusoidal Encoding
    Transformer-XLRelative Positional Encoding
    T5Relative Positional Encoding
    DeBERTaRelative Position Bias
    GPT-NeoXRoPE
    Llama 1/2/3RoPE
    QwenRoPE
    DeepSeekRoPE
    MistralRoPE
    BLOOMALiBi
    MPTALiBi

    Other Types of Embeddings in LLMs

    In addition to positional embeddings, LLMs use several other embedding types.

    Embedding TypePurpose
    Token EmbeddingRepresents each token as a dense vector
    Positional Embedding/EncodingRepresents token order (e.g., RoPE, ALiBi, APE)
    Segment (Token Type) EmbeddingDistinguishes sentence A from sentence B (used in BERT)
    Word EmbeddingMaps words/subwords to vectors
    Character EmbeddingRepresents characters instead of words
    Sentence EmbeddingRepresents an entire sentence with a single vector
    Document EmbeddingRepresents an entire document
    Instruction EmbeddingEncodes task or instruction information in some architectures
    Multimodal EmbeddingMaps text, images, audio, etc., into a shared embedding space

    Summary

    MethodBasic IdeaBest ForLimitation
    Absolute Positional EmbeddingAdd a learned vector for each positionShort, fixed-length sequencesCannot extrapolate well
    Sinusoidal EncodingUse deterministic sine/cosine functionsGeneral sequence modelingLess expressive than learned methods
    Relative Positional EmbeddingEncode distances between tokensBetter relative reasoningMore complex attention computation
    RoPERotate Query/Key vectors based on positionModern LLMs with long contextStandard RoPE still has finite context limits
    ALiBiAdd a linear distance bias to attention scoresEfficient long-context inferenceMay underperform RoPE on some benchmarks
    xPos / Dynamic RoPE / NTK Scaling / YaRNVariants that improve or extend RoPEVery long-context LLMs (32K–1M+ tokens)Additional implementation complexity

    Today, RoPE has become the de facto standard for decoder-only LLMs (Llama, Qwen, Mistral, DeepSeek), while ALiBi is valued for its simplicity and excellent extrapolation, and RoPE extensions such as NTK-aware scaling and YaRN are commonly used to extend context windows without retraining the entire model


    Inference Engineering

    Inference Engineering is the discipline of designing, optimizing, deploying, and scaling Large Language Models (LLMs) for efficient inference (prediction) in production environments.

    While training focuses on making a model smarter, inference engineering focuses on making the model faster, cheaper, more scalable, and capable of serving millions of requests.


    Why do we need Inference Engineering?

    A trained model is usually huge.

    Example:

    ModelParametersFP16 Memory
    Llama 3 8B8 Billion~16 GB
    Llama 3 70B70 Billion~140 GB
    DeepSeek R1 671B671 Billion>1.3 TB

    Serving these models directly causes problems:

    • High GPU memory usage
    • Slow response time
    • High latency
    • Low throughput
    • Expensive GPUs
    • Poor utilization

    Inference engineering solves these issues.


    ML Lifecycle

    Collect Data


    Pretraining


    Fine-tuning


    Model Evaluation


    Model Registry


    Inference Engineering


    Production APIs

    Inference engineering starts after training is complete.


    Responsibilities of an Inference Engineer

    An inference engineer works on:

    • Loading huge models efficiently
    • Reducing GPU memory
    • Optimizing attention computation
    • Continuous batching
    • KV Cache optimization
    • Speculative decoding
    • Quantization
    • Tensor parallelism
    • Pipeline parallelism
    • Multi-GPU serving
    • Autoscaling
    • API serving
    • Streaming tokens
    • Monitoring GPU utilization

    Architecture

                         User



    HTTP / gRPC API



    Inference Server
    (vLLM / SGLang / TGI)



    ------------------------------
    | Scheduler |
    ------------------------------

    Continuous Batching



    Token Generation



    CUDA / FlashAttention



    GPU Memory



    Llama / Qwen

    Problems During Inference

    1. Large Model Size

    Example

    70B model


    140GB FP16



    Need multiple GPUs

    Solution

    • Quantization
    • Tensor Parallelism

    2. Slow Token Generation

    Generating

    Hello


    How



    are



    you

    One token at a time is expensive.

    Solution

    • KV Cache
    • Flash Attention
    • Speculative Decoding

    3. Multiple Users

    Imagine

    User A
    User B
    User C
    User D

    Without batching

    GPU

    Run A

    Run B

    Run C

    Run D

    GPU remains underutilized.

    Solution

    Continuous batching.


    Major Components


    1. Model Loading

    Instead of

    Load model

    Wait

    Serve request

    Inference servers:

    • Lazy loading
    • Memory mapping
    • Sharded checkpoints

    2. Scheduler

    The scheduler decides

    Which request?

    Which GPU?

    How many tokens?

    Batch size?

    3. KV Cache

    Without cache

    For every token

    Input



    Transformer Layer 1



    Layer 2



    ...



    Layer N

    Entire sequence is recomputed.

    With KV Cache

    Past Keys

    Past Values



    Reuse



    Only compute new token

    Huge speedup.


    4. Continuous Batching

    Traditional batching

    Request A

    Request B

    Request C



    Wait



    Run together

    Bad because new users wait.

    Continuous batching

    GPU Running



    New request arrives



    Insert into running batch



    Continue execution

    Much better GPU utilization.


    5. FlashAttention

    Normal attention

    Q × K



    Huge matrix



    Softmax



    Multiply V

    Consumes enormous memory.

    FlashAttention

    • Tiles computation
    • Uses shared GPU memory
    • Avoids writing large intermediate matrices to HBM
    • Fuses multiple attention operations into one GPU kernel

    Benefits:

    • Lower memory usage
    • Higher throughput
    • Faster inference, especially for long sequences

    6. Quantization

    Original weights

    FP32



    FP16



    INT8



    INT4

    Benefits

    • Smaller model
    • Lower VRAM
    • Faster inference
    • Slight accuracy tradeoff

    7. Tensor Parallelism

    Suppose

    70B model

    Split across GPUs

    GPU1

    First half

    GPU2

    Second half

    Both GPUs compute simultaneously.


    8. Pipeline Parallelism

    Instead of splitting tensors

    Split layers.

    GPU1

    Layers 1-20



    GPU2

    21-40



    GPU3

    41-60

    9. Speculative Decoding

    Use

    Small model

    Guess tokens

    Large model verifies

    Accept if correct


    If guesses are correct

    Large speedup.

    ---

    # Popular Inference Frameworks

    | Framework | Company | Primary Focus |
    |-----------|----------|---------------|
    | vLLM | :contentReference[oaicite:0]{index=0} / community | High-throughput LLM serving |
    | SGLang | :contentReference[oaicite:1]{index=1} | Programmable LLM inference and agent workflows |
    | :contentReference[oaicite:2]{index=2} (TGI) | :contentReference[oaicite:3]{index=3} | Production inference |
    | :contentReference[oaicite:4]{index=4} | :contentReference[oaicite:5]{index=5} | GPU-optimized inference |
    | :contentReference[oaicite:6]{index=6} | :contentReference[oaicite:7]{index=7} | Cross-platform inference |
    | :contentReference[oaicite:8]{index=8} | Community | CPU and edge inference |
    | :contentReference[oaicite:9]{index=9} | :contentReference[oaicite:10]{index=10} | Efficient deployment |
    | :contentReference[oaicite:11]{index=11} | :contentReference[oaicite:12]{index=12} | CPU inference |

    ---

    # vLLM

    vLLM is currently one of the most popular inference engines for serving LLMs efficiently.

    Architecture

    HTTP Request

    vLLM API Server

    Scheduler

    Continuous Batch

    PagedAttention

    GPU

    Generated Tokens


    ## Key Features

    - Continuous batching
    - PagedAttention (efficient KV cache management)
    - Tensor parallelism
    - Streaming responses
    - OpenAI-compatible APIs
    - Multi-GPU serving
    - High throughput

    ### Why is vLLM fast?

    Instead of storing KV cache in one large contiguous block, it uses **PagedAttention**, which organizes cache into fixed-size memory pages (similar to virtual memory in operating systems). This reduces fragmentation, allows cache sharing, and enables continuous batching without frequent memory reallocations.

    ---

    # SGLang

    SGLang is both an inference engine and a programming framework for building complex LLM applications.

    Architecture

    Application

    SGLang Runtime

    Request Scheduler

    Structured Generation

    vLLM/TensorRT Backend

    GPU


    Unlike a simple REST API server, SGLang lets developers define workflows involving:

    - Multiple prompts
    - Tool calling
    - Agent loops
    - Structured outputs
    - Parallel execution
    - Cached intermediate computations

    ### Key Features

    - High-performance inference
    - Structured generation
    - Multi-turn conversations
    - Agent workflows
    - Tool execution
    - Grammar-constrained decoding (e.g., valid JSON)
    - Can use vLLM or TensorRT-LLM as the execution backend

    ---

    # vLLM vs SGLang

    | Feature | vLLM | SGLang |
    |----------|------|---------|
    | Primary Goal | Fast LLM serving | LLM application runtime |
    | Continuous Batching | ✅ | ✅ |
    | PagedAttention | ✅ | Uses backend support |
    | Structured Output | Limited | Excellent |
    | Agent Workflows | External orchestration | Built in |
    | Tool Calling | Basic | Advanced |
    | Multi-step Reasoning | External | Native |
    | OpenAI API Compatible | Yes | Yes |
    | Backend | Native engine | Can use vLLM, TensorRT-LLM, etc. |

    ---

    # Example Production Architecture
                Users



    API Gateway



    Load Balancer



    ------------------------

    vLLM Instance 1

    vLLM Instance 2

    vLLM Instance 3

    ------------------------



    GPU Cluster

    (Tensor Parallel)



    Llama 3 70B

    ---

    # Skills Required for an Inference Engineer

    An inference engineer should be familiar with:

    | Area | Topics |
    |------|--------|
    | GPU Architecture | CUDA, SMs, HBM, memory hierarchy |
    | Transformer Internals | Attention, KV Cache, decoding strategies |
    | Memory Optimization | Quantization, PagedAttention, cache management |
    | Parallelism | Tensor, pipeline, data, and expert parallelism |
    | Serving Frameworks | vLLM, SGLang, TGI, TensorRT-LLM |
    | Deployment | Kubernetes, Docker, autoscaling, load balancing |
    | APIs | REST, gRPC, OpenAI-compatible APIs |
    | Performance | Throughput, latency, tokens/sec, GPU utilization |
    | Monitoring | Prometheus, Grafana, tracing, logging |

    ---

    # Summary

    Inference engineering bridges the gap between a trained LLM and a production-ready AI service. It combines systems engineering, GPU optimization, distributed computing, and model-serving techniques to maximize throughput, minimize latency, and reduce infrastructure cost. Frameworks such as **vLLM** focus on efficient, high-throughput serving using innovations like continuous batching and PagedAttention, while **SGLang** builds on efficient inference to provide a runtime for complex agentic applications, structured generation, and tool-using workflows. Together, these tools enable organizations to deploy LLMs that are responsive, scalable, and cost-effective.


    Agents :

    A2A (Agent-to-Agent) is a communication protocol that allows AI agents built by different frameworks, vendors, or organizations to discover each other, exchange messages, delegate tasks, and collaborate in a standardized way.

    Without A2A, each AI agent operates like an isolated application.


    Why do we need A2A?

    1. Agents are specialized

    Instead of building one giant agent that knows everything, organizations build specialized agents.

    Example:

    • HR Agent
    • Finance Agent
    • Travel Agent
    • Calendar Agent
    • Code Generation Agent

    Suppose a user asks:

    "Book a flight for my business trip and make sure it fits my team's calendar and budget."

    No single agent may have all the required capabilities.

    Using A2A:

    User


    Travel Agent

    ├──► Calendar Agent → Check availability

    ├──► Finance Agent → Verify budget

    └──► Approval Agent → Manager approval

    Each agent performs its specialty and returns the result.


    2. Different teams build different agents

    In large companies:

    • HR develops HR agents
    • Finance develops Finance agents
    • IT develops Infrastructure agents

    Each team may use:

    • LangGraph
    • CrewAI
    • AutoGen
    • Semantic Kernel
    • OpenAI SDK

    Without a common protocol, every integration requires custom APIs.

    A2A provides a common language for communication.


    3. Avoid custom integrations

    Without A2A:

    HR Agent

    Custom REST API

    Finance Agent

    Travel Agent

    Different API

    Calendar Agent

    Each pair of agents needs custom integration.

    If there are N agents, integrations can grow roughly as N × (N - 1) / 2 in the worst case.

    With A2A:

    HR Agent

    Finance Agent

    Travel Agent

    Calendar Agent

    Everyone speaks the same protocol.


    4. Agent discovery

    Suppose an agent needs legal advice.

    Without A2A:

    • Hardcode endpoint
    • Hardcode authentication
    • Hardcode API

    With A2A:

    Planning Agent



    Discover



    Legal Agent



    Capabilities



    Can review contracts
    Can summarize regulations

    The planning agent discovers and uses the legal agent dynamically.


    5. Task delegation

    A planning agent doesn't need to solve every problem.

    Example:

    User



    Planning Agent



    Generate report



    Data Agent



    Fetch sales



    Analytics Agent



    Generate insights



    Visualization Agent



    Create charts

    Each task is delegated to the most suitable agent.


    6. Multi-vendor interoperability

    Imagine:

    • Company A builds a Finance Agent.
    • Company B builds a Procurement Agent.
    • Company C builds a Compliance Agent.

    Without A2A, these agents need custom integration.

    With A2A, they can collaborate using a shared protocol, regardless of who built them.


    7. Supports distributed systems

    Agents may run:

    • On-premises
    • AWS
    • Azure
    • Google Cloud
    • Edge devices

    A2A enables communication across these environments without tightly coupling implementations.


    8. Reusability

    Instead of building the same capability repeatedly:

    Expense Agent

    used by

    Finance Team

    HR Team

    Travel Team

    Audit Team

    A single specialized agent can serve multiple workflows.


    Real-world example

    Consider an online shopping scenario.

    User



    Shopping Agent



    Inventory Agent
    Check stock



    Pricing Agent
    Apply discounts



    Payment Agent
    Process payment



    Shipping Agent
    Arrange delivery



    Notification Agent
    Send confirmation

    Each agent focuses on its own domain, and A2A coordinates their interaction.


    Benefits of A2A

    BenefitExplanation
    InteroperabilityAgents from different frameworks and vendors can work together.
    ReusabilitySpecialized agents can be reused across multiple applications.
    ScalabilityNew agents can be added without redesigning existing integrations.
    Dynamic discoveryAgents can discover available capabilities at runtime.
    DelegationComplex tasks are split among specialized agents.
    MaintainabilityReduces the need for numerous custom point-to-point integrations.
    Vendor independenceAvoids locking systems into a single AI framework or provider.

    A2A vs MCP

    These protocols solve different problems and are often used together.

    A2A (Agent-to-Agent)MCP (Model Context Protocol)
    Connects agentsConnects an AI model or agent to tools and data sources
    Agent ↔ AgentAgent ↔ Tool
    Used for collaboration and delegationUsed for accessing capabilities like databases, APIs, files, and SaaS services
    Enables multi-agent workflowsEnables tool invocation and context retrieval
    Example: Travel Agent asks Finance Agent to approve a budgetExample: Finance Agent queries a PostgreSQL database or invokes a payment API

    A common architecture is:

                    User


    Planning Agent

    ┌─────────┴─────────┐
    ▼ ▼
    Travel Agent Finance Agent
    │ │
    (MCP) (MCP)
    │ │
    Flight API ERP Database
    Hotel API Budget Service

    Here, A2A allows the planning, travel, and finance agents to communicate, while MCP lets each agent interact with the external tools and data sources it needs. Together, they enable modular, interoperable, and scalable AI system

     


    Building an LLM can mean very different things depending on your goal. There are three common paths:

    GoalTimeGPUs NeededCostExample
    Train from scratchMonthsHundreds to thousandsMillions of dollarsGPT, Llama, DeepSeek
    Continue pre-training an existing modelDays to weeks8–128 GPUsThousands to tens of thousandsDomain-specific Llama
    Fine-tune an existing modelHours to days1–8 GPUsTens to hundreds of dollarsChatbot, coding assistant

    If your goal is to understand how companies like OpenAI, Meta, or DeepSeek build an LLM, the lifecycle looks like this.

    Complete LLM Lifecycle

                     Data Collection


    Data Cleaning


    Tokenizer Training


    Dataset Tokenization


    Transformer Design


    Distributed Training


    Checkpoint Saving


    Model Evaluation


    Instruction Fine-tuning


    Preference Tuning
    (RLHF / DPO / GRPO / RFT)


    Safety Alignment


    Quantization (Optional)


    Inference Optimization
    (vLLM / SGLang / TensorRT)


    Production Deployment

    Phase 1: Collect Data

    An LLM learns from enormous amounts of text.

    Typical sources include:

    • Books
    • Wikipedia
    • GitHub repositories
    • Research papers
    • Stack Overflow
    • News
    • Government documents
    • Web pages
    • Question-answer datasets
    • Conversations

    Example:

    Wikipedia



    50 TB

    Books



    20 TB

    GitHub



    30 TB

    Research Papers



    10 TB

    Total



    110 TB Raw Data

    The raw data is noisy and cannot be used directly.


    Phase 2: Clean the Data

    Remove:

    • HTML
    • advertisements
    • spam
    • duplicate documents
    • corrupted files
    • offensive content (depending on policy)
    • very short documents
    • low-quality translations

    Example:

    110 TB



    Cleaning



    35 TB High Quality Data

    Phase 3: Train a Tokenizer

    The model cannot understand characters directly.

    Instead it converts text into tokens.

    Example:

    Artificial Intelligence



    Artificial

    Intelligence

    or

    playing



    play

    ing

    Modern models typically use:

    • Byte Pair Encoding (BPE)
    • SentencePiece
    • WordPiece

    Vocabulary size:

    32,000

    or

    50,000

    or

    128,000 tokens

    Phase 4: Tokenize Everything

    Every document becomes integers.

    Example

    Hello World



    [15496, 2787]

    The neural network only sees numbers.


    Phase 5: Build the Transformer

    Typical architecture:

    Input Tokens



    Embedding Layer



    Transformer Block



    Transformer Block



    Transformer Block



    ...



    Final Linear Layer



    Vocabulary Probabilities

    Each transformer block contains:

    LayerNorm



    Multi-Head Attention



    Residual



    LayerNorm



    Feed Forward Network



    Residual

    Phase 6: Configure Hyperparameters

    Typical configuration:

    ParameterExample
    Layers32
    Hidden Size4096
    Attention Heads32
    Context Length8192
    Vocabulary128K
    Parameters7B

    Larger models increase these values.


    Phase 7: Pre-training

    The objective is simple:

    Predict the next token.

    Example:

    The capital of France is



    ?



    Paris

    Training samples:

    The



    cat



    sat



    on



    the



    mat

    For every token:

    Input



    Forward Pass



    Prediction



    Loss



    Backpropagation



    Update Weights

    This process repeats trillions of times.


    Phase 8: Distributed Training

    One GPU is not enough.

    Example:

    1000 GPUs



    Each GPU trains part of model



    Synchronize gradients



    Repeat

    Common techniques:

    • Data Parallelism
    • Tensor Parallelism
    • Pipeline Parallelism
    • Expert Parallelism (Mixture of Experts)

    Frameworks:

    • PyTorch Distributed
    • DeepSpeed
    • Megatron-LM
    • Fully Sharded Data Parallel (FSDP)

    Phase 9: Save Checkpoints

    Every few thousand training steps:

    Model



    Checkpoint



    Resume Later

    A checkpoint contains:

    • model weights
    • optimizer state
    • learning rate scheduler
    • training step
    • random number generator state

    Phase 10: Evaluate

    Benchmark the model.

    Common evaluations:

    • MMLU
    • GSM8K
    • HumanEval
    • TruthfulQA
    • HellaSwag
    • ARC
    • GPQA

    Metrics:

    • Perplexity
    • Accuracy
    • Pass@k
    • F1 Score
    • Exact Match

    Phase 11: Instruction Fine-tuning (SFT)

    Pre-trained models complete text but do not reliably follow instructions.

    Train on instruction-response pairs.

    Example:

    Instruction

    Explain recursion



    Response

    Recursion is...

    Datasets:

    • Alpaca
    • Dolly
    • OpenHermes
    • ShareGPT
    • Custom enterprise datasets

    Methods:

    • Full fine-tuning
    • LoRA
    • QLoRA

    Phase 12: Preference Tuning

    Improve response quality based on human preferences.

    Typical pipeline:

    Prompt



    Two Responses



    Human chooses better one



    Preference Dataset



    Optimization

    Methods:

    • RLHF
    • DPO
    • GRPO
    • Reinforcement Fine-Tuning (RFT)

    These help produce responses that are more helpful, harmless, and aligned with user intent.


    Phase 13: Safety Alignment

    Reduce harmful or unsafe outputs.

    Examples:

    • jailbreak resistance
    • toxicity reduction
    • hallucination mitigation
    • refusal behavior
    • bias evaluation

    Phase 14: Quantization

    Reduce model size.

    FP32



    FP16



    INT8



    INT4

    Benefits:

    • Lower GPU memory
    • Faster inference
    • Lower cost

    Phase 15: Inference Optimization

    Production serving includes:

    Model



    vLLM



    Continuous Batching



    KV Cache



    Streaming



    REST API

    Optimizations include:

    • PagedAttention
    • FlashAttention
    • Continuous batching
    • Speculative decoding
    • Tensor parallelism
    • KV cache management

    Phase 16: Deploy

    Typical architecture:

              Client





    API Gateway





    Load Balancer





    vLLM / SGLang Cluster





    GPU Servers





    Monitoring & Logs

    Technologies Used

    StageCommon Tools
    Data CollectionCommon Crawl, Wikipedia dumps, GitHub archives
    Data ProcessingApache Spark, Ray, Python
    TokenizerSentencePiece, Hugging Face Tokenizers
    TrainingPyTorch
    Distributed TrainingDeepSpeed, Megatron-LM, FSDP
    Experiment TrackingWeights & Biases, MLflow
    StorageS3, HDFS
    Fine-tuningPEFT, LoRA, QLoRA
    Evaluationlm-evaluation-harness, custom benchmarks
    InferencevLLM, SGLang, TensorRT-LLM, TGI
    DeploymentDocker, Kubernetes, NVIDIA GPUs

    Skills Needed to Build an LLM

    1. Mathematics
      • Linear Algebra
      • Calculus
      • Probability
      • Statistics
    2. Machine Learning
      • Gradient Descent
      • Backpropagation
      • Loss Functions
      • Optimization
    3. Deep Learning
      • Neural Networks
      • Attention Mechanism
      • Transformers
      • Positional Embeddings
    4. Distributed Systems
      • Multi-GPU training
      • Parallelism strategies
      • High-speed networking (e.g., NVLink, InfiniBand)
    5. GPU Programming
      • CUDA basics
      • GPU memory hierarchy
      • Kernel optimization
    6. MLOps
      • Model versioning
      • Experiment tracking
      • Deployment
      • Monitoring

    Learning Roadmap

    If your goal is to build an LLM yourself rather than just use one, a practical progression is:

    1. Build a character-level language model from scratch.
    2. Implement a Transformer in PyTorch.
    3. Train a small GPT (50M–150M parameters) on a public dataset.
    4. Learn distributed training with multiple GPUs.
    5. Fine-tune an open-weight model such as Llama or Qwen using LoRA/QLoRA.
    6. Serve it efficiently with vLLM or SGLang.
    7. Build a complete chat application with retrieval, tool calling, monitoring, and scalable deployment.

    This path teaches the same core concepts used to build production LLM systems, while remaining feasible on accessible hardware before scaling up to larger models



    High-Level AI Engineering Roles

    PhaseResponsible AreaTypical Engineer
    Data CollectionData pipelinesData Engineer
    Data CleaningData processingData Engineer
    TokenizationTokenizer developmentML Engineer / Research Engineer
    Model ArchitectureTransformer designAI Research Scientist
    Pre-trainingTraining algorithmsResearch Scientist
    Distributed TrainingMulti-GPU optimizationTraining Systems Engineer
    Gradient AccumulationMemory optimizationTraining Systems Engineer
    Mixed PrecisionTraining optimizationTraining Systems Engineer
    CheckpointingFault toleranceTraining Systems Engineer
    Fine-tuningModel adaptationML Engineer
    RLHF/DPOAlignmentAlignment Engineer
    QuantizationCompressionML Systems Engineer
    InferenceServing optimizationInference Engineer
    DeploymentProduction infrastructureMLOps Engineer

      

    Training Engineering

    Inference Engineering focuses on making models serve requests efficiently, while Training Engineering focuses on making models train efficiently.

    Typical responsibilities include:

    Memory Optimization

    • Gradient Accumulation
    • Gradient Checkpointing
    • Activation Checkpointing
    • CPU Offloading
    • ZeRO Optimization
    • Optimizer State Sharding

    Precision Optimization

    • FP32
    • BF16
    • FP16
    • Mixed Precision Training
    • Dynamic Loss Scaling

    Parallel Training

    • Data Parallelism
    • Tensor Parallelism
    • Pipeline Parallelism
    • Sequence Parallelism
    • Expert Parallelism (MoE)

    Distributed Communication

    • NCCL
    • AllReduce
    • ReduceScatter
    • AllGather
    • Broadcast

    Optimizer Engineering

    • AdamW
    • Fused Adam
    • 8-bit Adam
    • Lion
    • LAMB

    Memory Management

    Managing:

    Weights



    Gradients



    Optimizer States



    Activations



    KV Cache (for training)

    Complete AI Systems Engineering Pipeline

                    Data Engineering


    Training Systems Engineering
    (Gradient Accumulation, ZeRO, FSDP)


    AI Research Engineering
    (Transformer, Attention, Losses)


    Alignment Engineering
    (RLHF, DPO, Safety, Preference)


    Model Compression Engineering
    (Quantization, Pruning, Distillation)


    Inference Engineering
    (vLLM, SGLang, FlashAttention, KV Cache)


    MLOps / Platform Engineering
    (Deployment, Autoscaling, Monitoring)

    Common Optimizations by Domain

    Training Systems EngineeringInference Engineering
    Gradient AccumulationContinuous Batching
    Gradient CheckpointingKV Cache
    Mixed Precision TrainingQuantization
    ZeRO OptimizerPagedAttention
    FSDPFlashAttention
    DeepSpeedvLLM
    Megatron-LMSGLang
    Activation CheckpointingSpeculative Decoding
    Distributed OptimizersTensorRT-LLM
    Optimizer ShardingStreaming Tokens

    Popular Frameworks Used by Training Systems Engineers

    CategoryCommon Frameworks
    Distributed TrainingPyTorch Distributed (DDP), FSDP, DeepSpeed, Megatron-LM
    Memory OptimizationDeepSpeed ZeRO, Activation Checkpointing, Gradient Checkpointing
    PrecisionAutomatic Mixed Precision (AMP), BF16, FP16
    CommunicationNCCL, Gloo
    Experiment TrackingWeights & Biases, MLflow
    Cluster SchedulingKubernetes, Slurm, Ray

     

    Build Lakehouse using Iceberg

     Flow Diagram of Data Lakehouse While Data Lake is excels for Machine Learning , Data warehouse is used for Business Intelligence , Data Lak...