Table of Contents
- Fundamentals of AI and ML
- Generative AI Concepts
- Foundation Models
- Prompt Engineering
- Responsible AI
- AWS AI Services
- AWS Generative AI Services
- Security and Compliance
- AI Use Cases
- Exam Tips
1. Fundamentals of AI and ML
Artificial Intelligence (AI)
Machines performing tasks that normally require human intelligence.
Machine Learning (ML)
Subset of AI where systems learn patterns from data.
Deep Learning
Uses neural networks with multiple layers.
Generative AI
Creates new content such as:
Training vs Inference
| Term | Meaning |
|---|
| Training | Model learns from data |
| Inference | Model makes predictions using learned knowledge |
2. Generative AI Concepts
Large Language Model (LLM)
Examples:
- OpenAI GPT models
- Anthropic Claude
- Meta Llama
Tokens
Text is broken into small units called tokens.
Example:
May become:
Hallucination
Model generates incorrect information while sounding confident.
Context Window
Amount of information an LLM can consider at once.
3. Foundation Models
Foundation Model (FM)
Large pretrained model that can be adapted for many tasks.
Examples:
- Text generation
- Summarization
- Classification
- Translation
- Chatbots
Multi Modality
Fine Tuning
Retraining model with domain-specific data.
Retrieval Augmented Generation (RAG)
Instead of retraining:
- Retrieve documents
- Send documents to LLM
- Generate response
Benefits:
- Lower cost
- More current data
- Reduced hallucinations
RAG Pipeline: Model Types Used at Each Step
| RAG Step | Purpose | Model Type | Example Models |
|---|
| 1. Document Ingestion | Read PDFs, DOCX, HTML, Images | OCR / Document AI | Tesseract, LayoutLM, Donut |
| 2. Chunking | Split documents into passages | Rule-based / NLP | Sentence Splitter, Recursive Text Splitter |
| 3. Text → Embeddings | Convert chunks into vectors | Embedding Model (Encoder-only Transformer) | BERT, Sentence-BERT, E5, BGE |
| 4. Vector Storage | Store embeddings | Vector Database | FAISS, Milvus, Weaviate, Pinecone |
| 5. Query → Embedding | Convert user query to vector | Same Embedding Model | BGE, E5, SBERT |
| 6. Retrieval | Find nearest chunks | ANN Search Algorithm | HNSW, IVF, Flat Search |
| 7. Re-ranking (Optional) | Improve retrieved results | Cross Encoder | MonoBERT, Cohere Rerank, BGE Reranker |
| 8. Context Construction | Build prompt with retrieved chunks | Prompt Builder | Template Engine |
| 9. Answer Generation | Generate final answer | Decoder-only LLM | GPT-4o, Claude Sonnet, Llama 3, Mistral |
| 10. Citation Generation (Optional) | Show sources | LLM / Metadata Layer | GPT-4o, Claude |
Transformer Architecture Used at Each Step
| Step | Transformer Type |
|---|
| Embedding Generation | Encoder-only |
| Re-ranking | Encoder-only (Cross Encoder) |
| Answer Generation | Decoder-only |
| Translation (optional) | Encoder-Decoder |
| Summarization (optional) | Encoder-Decoder |
| OCR Understanding | Encoder or Encoder-Decoder |
| Multimodal RAG | Vision Encoder + LLM Decoder |
Common Models by Transformer Family
| Transformer Family | Example Models | Used For |
|---|
| Encoder-only | BERT, RoBERTa, SBERT, E5, BGE | Embeddings, Retrieval |
| Decoder-only | GPT, Llama, Claude, Mistral, Qwen | Generation |
| Encoder-Decoder | T5, FLAN-T5, BART | Summarization, Translation |
| Vision Encoder | ViT, CLIP Vision Encoder | Image Embeddings |
| Vision-Language | LLaVA, Qwen-VL, GPT-4o | Multimodal RAG |
Typical Modern RAG Stack
| Layer | Common Choice |
|---|
| Chunking | LangChain Recursive Splitter |
| Embeddings | BGE-large, E5-large |
| Vector DB | FAISS, Milvus |
| Retrieval | HNSW |
| Re-ranker | BGE-Reranker |
| Generator | GPT-4o, Claude, Llama 3 |
| Orchestration | LangChain, LlamaIndex |
Mental Model
Mem0
- Before generation: Retrieves relevant user preferences, past interactions, and long-term memory.
- After generation: Extracts new important facts from the conversation and stores them for future use.
- Difference from RAG: RAG retrieves knowledge from documents, while Mem0 retrieves knowledge about the user or previous interactions. Both are complementary and are typically combined before prompt construction.
Hybrid Search improves retrieval recall (finding the right documents).HyDE improves query understanding (making difficult or ambiguous queries easier to retrieve).
| Task | Model Type |
|---|
| Create Embeddings | Encoder-only |
| Retrieve Documents | Vector Search |
| Re-rank Results | Cross Encoder |
| Generate Answer | Decoder-only LLM |
| Summarize Documents | Encoder-Decoder |
| Multimodal Retrieval | CLIP / Vision Encoder |
| Multimodal Generation | GPT-4o / Gemini / Qwen-VL |
4. Prompt Engineering
Zero-Shot Prompting
One-Shot Prompting
Provide one example.
Few-Shot Prompting
Provide multiple examples.
Chain of Thought
Ask model to reason step-by-step.
Prompt Components
- Role
- Context
- Instructions
- Examples
- Constraints
LLM Architecture
Large Language Model (LLM) Architecture
At a high level, an LLM is a Transformer-based neural network that converts input text into tokens, processes them through multiple Transformer blocks, and predicts the next token repeatedly to generate text.
Components of an LLM
| Component | Purpose |
|---|
| Tokenizer | Converts text into tokens |
| Embedding Layer | Converts token IDs into dense vectors |
| Positional Encoding | Gives the model information about token order |
| Transformer Decoder Blocks | Learns relationships between tokens |
| Multi-Head Self-Attention | Determines which words are important |
| Feed Forward Network (FFN) | Learns complex patterns |
| Residual Connections | Prevent information loss |
| Layer Normalization | Stabilizes training |
| Output Linear Layer | Maps hidden vectors to vocabulary logits |
| Softmax | Converts logits into probabilities |
| Sampling Strategy | Chooses the next token |
Step 1: Tokenization
The tokenizer splits text into tokens.
Example:
Input
Tokenizer
Convert to IDs
The neural network only understands numbers.
Step 2: Embedding Layer
Each token ID is mapped to a dense vector.
Example
Vocabulary
Embedding
Instead of one integer,
every token becomes a high-dimensional vector.
Typical embedding size
| Model | Embedding Dimension |
|---|
| GPT-2 Small | 768 |
| Llama 7B | 4096 |
| GPT-3 | 12288 |
Step 3: Positional Encoding
Attention alone has no concept of sequence order.
Without positional information,
these two sentences appear identical.
Positional embeddings add information like:
Final embedding
Step 4: Transformer Decoder Blocks
This is where most computation happens.
Modern LLMs stack many identical decoder blocks.
Examples:
| Model | Decoder Blocks |
|---|
| GPT-2 Small | 12 |
| GPT-3 175B | 96 |
| Llama 3 8B | 32 |
| DeepSeek-R1 671B | Hundreds of expert layers (MoE architecture) |
Each block contains:
Step 5: Multi-Head Self-Attention
This is the core innovation of Transformers.
Every word looks at every previous word.
Sentence
When processing "it"
attention may focus on
The model learns that "it" refers to animal.
Query, Key and Value
Each token produces three vectors.
Meaning
| Vector | Purpose |
|---|
| Query | What am I looking for? |
| Key | What information do I contain? |
| Value | Information to pass forward |
Attention score
Formula
Step 6: Multi-Head Attention
Instead of one attention calculation,
multiple attention heads run in parallel.
Example
Outputs are concatenated.
Step 7: Feed Forward Network (FFN)
After attention,
every token passes through the same small neural network independently.
Typical structure
Purpose
- Learn nonlinear patterns
- Transform features
- Increase model capacity
Step 8: Residual Connections
Instead of replacing the input,
the block adds the original input back.
Benefits
- Prevents vanishing gradients
- Preserves information
- Enables very deep networks
Step 9: Layer Normalization
Keeps activations stable during training.
Without it,
training becomes unstable as models get deeper.
Step 10: Output Projection
Final hidden vector
↓
Linear layer
↓
Vocabulary size
Example vocabulary
50,000 words
Output
These are logits (unnormalized scores).
Step 11: Softmax
Softmax converts logits into probabilities.
Example
Probabilities sum to 1.
Step 12: Token Sampling
The next token is selected.
Methods include:
| Method | Description |
|---|
| Greedy | Choose the highest probability token |
| Beam Search | Explore multiple candidate sequences simultaneously |
| Top-k Sampling | Sample only from the top k most probable tokens |
| Top-p (Nucleus) Sampling | Sample from the smallest set of tokens whose cumulative probability exceeds p |
| Temperature Sampling | Adjust randomness by scaling logits before Softmax |
Generated token
↓
Append to prompt
↓
Run the decoder again
↓
Predict the next token
This repeats until an end-of-sequence token or another stopping condition is reached.
Why is it called a Decoder-only Transformer?
The original Transformer introduced in the paper "Attention Is All You Need" had two parts:
LLMs such as GPT, Llama, Qwen, and DeepSeek use only the decoder stack with masked (causal) self-attention, so each token can attend only to itself and previous tokens. This enables autoregressive next-token prediction.
End-to-End Example
Input:
Processing:
Architecture Summary
| Component | Function |
|---|
| Tokenizer | Converts text to token IDs |
| Embedding Layer | Converts token IDs to dense vectors |
| Positional Embedding | Encodes token order |
| Masked Multi-Head Self-Attention | Captures relationships with previous tokens |
| Feed Forward Network | Learns nonlinear transformations |
| Residual Connections | Preserve information and improve gradient flow |
| Layer Normalization | Stabilizes training |
| Linear Output Layer | Projects hidden states to vocabulary logits |
| Softmax | Produces probabilities over the vocabulary |
| Sampling | Selects the next token for generation |
Modern LLMs extend this core architecture with optimizations such as Rotary Positional Embeddings (RoPE), Grouped Query Attention (GQA) or Multi-Query Attention (MQA), Mixture of Experts (MoE), FlashAttention, KV caching, and quantization, but the fundamental decoder-only Transformer pipeline remains the sam
Training
5. Responsible AI
Fairness
Avoid bias.
Explainability
Understand why model produced output.
Privacy
Protect user data.
Robustness
Model behaves reliably.
Transparency
Users know AI is involved.
Evaluation
| Model Group | Evaluation | Description of the Evaluation |
|---|
| Agentic Models | AgentBench | Evaluates autonomous task execution, planning, tool usage, and multi-step reasoning. |
| Agentic Models | GAIA | Tests real-world assistant capabilities like searching, tool calling, and reasoning. |
| Agentic Models | SWE-bench | Measures ability to solve real GitHub issues by editing codebases. |
| Bi-Encoder | BEIR | Evaluates embedding-based retrieval across multiple datasets and domains. |
| Bi-Encoder | MTEB | Measures embedding quality across retrieval, clustering, classification, and reranking tasks. |
| Bi-Encoder | MS MARCO | Evaluates dense retrieval and passage ranking performance. |
| Cross-Encoder | BEIR | Measures pairwise query-document relevance scoring quality. |
| Cross-Encoder | MS MARCO | Evaluates reranking precision for query-passage relevance. |
| Cross-Encoder | TREC Deep Learning Track | Measures ranking quality for search relevance tasks. |
| Decoder-only | GSM8K | Evaluates arithmetic and multi-step mathematical reasoning. |
| Decoder-only | HellaSwag | Measures commonsense reasoning and next-sentence prediction. |
| Decoder-only | HumanEval | Evaluates code generation correctness using executable unit tests. |
| Decoder-only | MMLU | Measures broad knowledge and reasoning across many academic domains. |
| Decoder-only | MT-Bench | Tests instruction following and conversational quality. |
| Decoder-only | Needle-in-a-Haystack | Measures ability to retrieve specific information from long contexts. |
| Decoder-only | TruthfulQA | Tests factual consistency and resistance to hallucinations. |
| Encoder-decoder | BLEU | Measures overlap between generated text and reference text, mainly for translation. |
| Encoder-decoder | ROUGE | Evaluates summarization quality based on n-gram overlap. |
| Encoder-decoder | SQuAD | Measures extractive question-answering accuracy. |
| Encoder-only | GLUE | Evaluates language understanding tasks like sentiment, entailment, and similarity. |
| Encoder-only | MTEB | Measures embedding performance across multiple NLP tasks. |
| Encoder-only | STS-B | Measures how well embeddings capture sentence similarity. |
| Encoder-only | SuperGLUE | Harder version of GLUE for advanced reasoning tasks. |
| Long-context Models | InfiniteBench | Evaluates memory retention and reasoning over very long contexts. |
| Long-context Models | LongBench | Tests summarization, retrieval, and reasoning on long documents. |
| Long-context Models | Needle-in-a-Haystack | Measures retrieval accuracy from large contexts. |
| Multimodal Models | MMBench | Evaluates image understanding and multimodal reasoning. |
| Multimodal Models | MMMU | Measures multimodal reasoning across academic and professional domains. |
| Multimodal Models | MMVet | Tests advanced visual reasoning and perception. |
| RAG Systems | CRUD-RAG | Measures retrieval robustness and update handling in RAG pipelines. |
| RAG Systems | RAGAS | Evaluates faithfulness, context precision, context recall, and answer relevance in RAG. |
| Reward Models | RewardBench | Evaluates preference model quality and alignment performance. |
| Tool-use Models | ToolBench | Measures correctness in tool selection, API usage, and tool chaining. |
6. AWS AI Services
- Image analysis
- Face detection
- Object detection
- Sentiment analysis
- Entity extraction
- Language detection
- Extract text from documents
Deep Racer
7. AWS Generative AI Services
Most important service for the exam.
Provides access to foundation models from:
- Anthropic Claude
- Meta Llama
- Amazon Nova
- Stability AI
Features:
- RAG
- Agents
- Knowledge Bases
- Guardrails
- Fine-tuning
Streaming API
- using Websocket
- control buffer size for better UX
Enterprise chatbot over company data.
Developer coding assistant.
Build, train and deploy ML models.
Sagemaker Inference
Sagemaker Pipeline
- Batch mode saves cost by preventing GPU under utilization.
8. Security and Compliance
Shared Responsibility Model
AWS secures:
- Infrastructure
- Hardware
- Network
Customer secures:
- Data
- Access control
- Configuration
IAM
Identity and access management.
Encryption
Data:
Lake Formation
- upto column level Security
Adversarial Input
- Evaluate your model periodically using Synthetic data to prevent risk of Adversarial input creating issue.
9. Common Use Cases
Classification
Sentiment Analysis
Summarization
Chatbots
Code Generation
Document Processing
Frequently Tested Comparisons
| Service | Purpose |
|---|
| Bedrock | Generative AI |
| SageMaker AI | Build/train ML models |
| Comprehend | NLP analysis |
| Rekognition | Image analysis |
| Textract | Document extraction |
| Transcribe | Speech to text |
| Polly | Text to speech |
| Translate | Translation |
Tips :
| Topic | Remember |
|---|
| Generative AI | Creates new content |
| Foundation Model | Large pretrained model |
| Hallucination | Confident wrong answer |
| RAG | Retrieve + Generate |
| Bedrock | Managed GenAI platform |
| SageMaker | ML lifecycle |
| Guardrails | Safety controls |
| Fine-Tuning | Retrain model |
| Inference | Model prediction |
| Token | Small text unit |
There are several ways to represent the position of tokens in a Transformer. Although people often say "embeddings like RoPE," RoPE is actually a positional encoding/embedding technique, not a semantic embedding like a word embedding.
Below are the major types of positional embeddings/encodings used in Transformers.
| Method | Learnable | Extrapolates to Longer Sequences | Used In |
|---|
| Absolute Positional Embedding (APE) | Yes | ❌ No | GPT-2, BERT |
| Sinusoidal Positional Encoding | No | ✅ Yes | Original Transformer |
| Relative Positional Embedding (RPE) | Usually Yes | Better than APE | T5, Transformer-XL |
| Rotary Positional Embedding (RoPE) | No | ✅ Yes | Llama, Qwen, DeepSeek, GPT-NeoX |
| ALiBi (Attention with Linear Biases) | No | ✅ Excellent | BLOOM, MPT |
| xPos | No | ✅ Better than RoPE | Long-context models |
| Dynamic RoPE | No | ✅ Improved long-context support | Some Llama variants |
| NTK-aware RoPE Scaling | No | ✅ Extends context window | Llama long-context adaptations |
| YaRN (Yet another RoPE extensioN) | No | ✅ Very good | Long-context fine-tuned LLMs |
| LeX (Length Extrapolation) | Varies | ✅ Designed for long context | Research models |
1. Absolute Positional Embedding (APE)
Each position has its own learnable vector.
Example
Final input
Advantages
- Simple
- Learns position information
Disadvantages
- Cannot naturally handle sequences longer than those seen during training
- Every position needs its own learned vector
Used in:
2. Sinusoidal Positional Encoding
Introduced in the original Transformer paper.
Uses sine and cosine functions.
Formula
Advantages
- No training required
- Infinite positions can be computed
- Generalizes better to unseen sequence lengths
Disadvantages
- Less expressive than learned approaches
Used in:
3. Relative Positional Embedding (RPE)
Instead of absolute positions,
the model learns the relative distance.
Example
The model learns
instead of
Advantages
- Better captures local relationships
- Handles varying sequence lengths more naturally
Used in:
- T5
- Transformer-XL
- DeBERTa (with variants)
4. RoPE (Rotary Positional Embedding)
The most popular method in modern LLMs.
Instead of adding a positional vector,
RoPE rotates the Query and Key vectors by an angle based on position.
Advantages
- Excellent long-context behavior
- Naturally preserves relative position information
- No additional learned parameters
Used in:
- Llama
- Qwen
- DeepSeek
- GPT-NeoX
- Mistral
5. ALiBi (Attention with Linear Biases)
Instead of embeddings,
add a linear bias directly to attention scores.
Farther tokens receive a progressively larger penalty.
Advantages
- Extremely simple
- Strong length extrapolation
- No positional vectors
Used in:
6. xPos
An extension of RoPE.
Designed to improve long-sequence stability.
Instead of using a fixed rotation,
it rescales rotations.
Advantages
- Better than RoPE for very long contexts
- Preserves attention stability
7. Dynamic RoPE
Standard RoPE uses fixed frequencies.
Dynamic RoPE adjusts frequencies dynamically for longer contexts.
Advantages
- Better context extrapolation
- Improved performance on long documents
8. NTK-aware RoPE Scaling
Originally developed for extending Llama's context window.
Idea
Stretch the RoPE frequencies.
Example
No retraining is required in many implementations.
Advantages
- Very popular
- Simple
- Enables longer contexts
9. YaRN (Yet another RoPE extensioN)
Improves on NTK scaling.
Combines interpolation and scaling techniques.
Advantages
- Better long-context quality
- Less degradation than naive scaling
- Widely used for 128K+ context extensions
10. LeX (Length Extrapolation)
Research methods that explicitly optimize for longer contexts.
Goal
Train models that naturally generalize beyond the training context length.
Which models use which?
| Model | Positional Method |
|---|
| GPT-2 | Absolute Positional Embedding |
| BERT | Absolute Positional Embedding |
| Original Transformer | Sinusoidal Encoding |
| Transformer-XL | Relative Positional Encoding |
| T5 | Relative Positional Encoding |
| DeBERTa | Relative Position Bias |
| GPT-NeoX | RoPE |
| Llama 1/2/3 | RoPE |
| Qwen | RoPE |
| DeepSeek | RoPE |
| Mistral | RoPE |
| BLOOM | ALiBi |
| MPT | ALiBi |
Other Types of Embeddings in LLMs
In addition to positional embeddings, LLMs use several other embedding types.
| Embedding Type | Purpose |
|---|
| Token Embedding | Represents each token as a dense vector |
| Positional Embedding/Encoding | Represents token order (e.g., RoPE, ALiBi, APE) |
| Segment (Token Type) Embedding | Distinguishes sentence A from sentence B (used in BERT) |
| Word Embedding | Maps words/subwords to vectors |
| Character Embedding | Represents characters instead of words |
| Sentence Embedding | Represents an entire sentence with a single vector |
| Document Embedding | Represents an entire document |
| Instruction Embedding | Encodes task or instruction information in some architectures |
| Multimodal Embedding | Maps text, images, audio, etc., into a shared embedding space |
Summary
| Method | Basic Idea | Best For | Limitation |
|---|
| Absolute Positional Embedding | Add a learned vector for each position | Short, fixed-length sequences | Cannot extrapolate well |
| Sinusoidal Encoding | Use deterministic sine/cosine functions | General sequence modeling | Less expressive than learned methods |
| Relative Positional Embedding | Encode distances between tokens | Better relative reasoning | More complex attention computation |
| RoPE | Rotate Query/Key vectors based on position | Modern LLMs with long context | Standard RoPE still has finite context limits |
| ALiBi | Add a linear distance bias to attention scores | Efficient long-context inference | May underperform RoPE on some benchmarks |
| xPos / Dynamic RoPE / NTK Scaling / YaRN | Variants that improve or extend RoPE | Very long-context LLMs (32K–1M+ tokens) | Additional implementation complexity |
Today, RoPE has become the de facto standard for decoder-only LLMs (Llama, Qwen, Mistral, DeepSeek), while ALiBi is valued for its simplicity and excellent extrapolation, and RoPE extensions such as NTK-aware scaling and YaRN are commonly used to extend context windows without retraining the entire model
Inference Engineering
Inference Engineering is the discipline of designing, optimizing, deploying, and scaling Large Language Models (LLMs) for efficient inference (prediction) in production environments.
While training focuses on making a model smarter, inference engineering focuses on making the model faster, cheaper, more scalable, and capable of serving millions of requests.
Why do we need Inference Engineering?
A trained model is usually huge.
Example:
| Model | Parameters | FP16 Memory |
|---|
| Llama 3 8B | 8 Billion | ~16 GB |
| Llama 3 70B | 70 Billion | ~140 GB |
| DeepSeek R1 671B | 671 Billion | >1.3 TB |
Serving these models directly causes problems:
- High GPU memory usage
- Slow response time
- High latency
- Low throughput
- Expensive GPUs
- Poor utilization
Inference engineering solves these issues.
ML Lifecycle
Inference engineering starts after training is complete.
Responsibilities of an Inference Engineer
An inference engineer works on:
- Loading huge models efficiently
- Reducing GPU memory
- Optimizing attention computation
- Continuous batching
- KV Cache optimization
- Speculative decoding
- Quantization
- Tensor parallelism
- Pipeline parallelism
- Multi-GPU serving
- Autoscaling
- API serving
- Streaming tokens
- Monitoring GPU utilization
Architecture
Problems During Inference
1. Large Model Size
Example
Solution
- Quantization
- Tensor Parallelism
2. Slow Token Generation
Generating
One token at a time is expensive.
Solution
- KV Cache
- Flash Attention
- Speculative Decoding
3. Multiple Users
Imagine
Without batching
GPU remains underutilized.
Solution
Continuous batching.
Major Components
1. Model Loading
Instead of
Inference servers:
- Lazy loading
- Memory mapping
- Sharded checkpoints
2. Scheduler
The scheduler decides
3. KV Cache
Without cache
For every token
Entire sequence is recomputed.
With KV Cache
Huge speedup.
4. Continuous Batching
Traditional batching
Bad because new users wait.
Continuous batching
Much better GPU utilization.
5. FlashAttention
Normal attention
Consumes enormous memory.
FlashAttention
- Tiles computation
- Uses shared GPU memory
- Avoids writing large intermediate matrices to HBM
- Fuses multiple attention operations into one GPU kernel
Benefits:
- Lower memory usage
- Higher throughput
- Faster inference, especially for long sequences
6. Quantization
Original weights
Benefits
- Smaller model
- Lower VRAM
- Faster inference
- Slight accuracy tradeoff
7. Tensor Parallelism
Suppose
Split across GPUs
Both GPUs compute simultaneously.
8. Pipeline Parallelism
Instead of splitting tensors
Split layers.
9. Speculative Decoding
Use
Small model
↓
Guess tokens
↓
Large model verifies
↓
Accept if correct
HTTP Request
↓
vLLM API Server
↓
Scheduler
↓
Continuous Batch
↓
PagedAttention
↓
GPU
↓
Generated Tokens
Application
↓
SGLang Runtime
↓
Request Scheduler
↓
Structured Generation
↓
vLLM/TensorRT Backend
↓
GPU
Agents :
A2A (Agent-to-Agent) is a communication protocol that allows AI agents built by different frameworks, vendors, or organizations to discover each other, exchange messages, delegate tasks, and collaborate in a standardized way.
Without A2A, each AI agent operates like an isolated application.
Why do we need A2A?
1. Agents are specialized
Instead of building one giant agent that knows everything, organizations build specialized agents.
Example:
- HR Agent
- Finance Agent
- Travel Agent
- Calendar Agent
- Code Generation Agent
Suppose a user asks:
"Book a flight for my business trip and make sure it fits my team's calendar and budget."
No single agent may have all the required capabilities.
Using A2A:
Each agent performs its specialty and returns the result.
2. Different teams build different agents
In large companies:
- HR develops HR agents
- Finance develops Finance agents
- IT develops Infrastructure agents
Each team may use:
- LangGraph
- CrewAI
- AutoGen
- Semantic Kernel
- OpenAI SDK
Without a common protocol, every integration requires custom APIs.
A2A provides a common language for communication.
3. Avoid custom integrations
Without A2A:
Each pair of agents needs custom integration.
If there are N agents, integrations can grow roughly as N × (N - 1) / 2 in the worst case.
With A2A:
Everyone speaks the same protocol.
4. Agent discovery
Suppose an agent needs legal advice.
Without A2A:
- Hardcode endpoint
- Hardcode authentication
- Hardcode API
With A2A:
The planning agent discovers and uses the legal agent dynamically.
5. Task delegation
A planning agent doesn't need to solve every problem.
Example:
Each task is delegated to the most suitable agent.
6. Multi-vendor interoperability
Imagine:
- Company A builds a Finance Agent.
- Company B builds a Procurement Agent.
- Company C builds a Compliance Agent.
Without A2A, these agents need custom integration.
With A2A, they can collaborate using a shared protocol, regardless of who built them.
7. Supports distributed systems
Agents may run:
- On-premises
- AWS
- Azure
- Google Cloud
- Edge devices
A2A enables communication across these environments without tightly coupling implementations.
8. Reusability
Instead of building the same capability repeatedly:
A single specialized agent can serve multiple workflows.
Real-world example
Consider an online shopping scenario.
Each agent focuses on its own domain, and A2A coordinates their interaction.
Benefits of A2A
| Benefit | Explanation |
|---|
| Interoperability | Agents from different frameworks and vendors can work together. |
| Reusability | Specialized agents can be reused across multiple applications. |
| Scalability | New agents can be added without redesigning existing integrations. |
| Dynamic discovery | Agents can discover available capabilities at runtime. |
| Delegation | Complex tasks are split among specialized agents. |
| Maintainability | Reduces the need for numerous custom point-to-point integrations. |
| Vendor independence | Avoids locking systems into a single AI framework or provider. |
A2A vs MCP
These protocols solve different problems and are often used together.
| A2A (Agent-to-Agent) | MCP (Model Context Protocol) |
|---|
| Connects agents | Connects an AI model or agent to tools and data sources |
| Agent ↔ Agent | Agent ↔ Tool |
| Used for collaboration and delegation | Used for accessing capabilities like databases, APIs, files, and SaaS services |
| Enables multi-agent workflows | Enables tool invocation and context retrieval |
| Example: Travel Agent asks Finance Agent to approve a budget | Example: Finance Agent queries a PostgreSQL database or invokes a payment API |
A common architecture is:
Here, A2A allows the planning, travel, and finance agents to communicate, while MCP lets each agent interact with the external tools and data sources it needs. Together, they enable modular, interoperable, and scalable AI system
Building an LLM can mean very different things depending on your goal. There are three common paths:
| Goal | Time | GPUs Needed | Cost | Example |
|---|
| Train from scratch | Months | Hundreds to thousands | Millions of dollars | GPT, Llama, DeepSeek |
| Continue pre-training an existing model | Days to weeks | 8–128 GPUs | Thousands to tens of thousands | Domain-specific Llama |
| Fine-tune an existing model | Hours to days | 1–8 GPUs | Tens to hundreds of dollars | Chatbot, coding assistant |
If your goal is to understand how companies like OpenAI, Meta, or DeepSeek build an LLM, the lifecycle looks like this.
Complete LLM Lifecycle
Phase 1: Collect Data
An LLM learns from enormous amounts of text.
Typical sources include:
- Books
- Wikipedia
- GitHub repositories
- Research papers
- Stack Overflow
- News
- Government documents
- Web pages
- Question-answer datasets
- Conversations
Example:
The raw data is noisy and cannot be used directly.
Phase 2: Clean the Data
Remove:
- HTML
- advertisements
- spam
- duplicate documents
- corrupted files
- offensive content (depending on policy)
- very short documents
- low-quality translations
Example:
Phase 3: Train a Tokenizer
The model cannot understand characters directly.
Instead it converts text into tokens.
Example:
or
Modern models typically use:
- Byte Pair Encoding (BPE)
- SentencePiece
- WordPiece
Vocabulary size:
Phase 4: Tokenize Everything
Every document becomes integers.
Example
The neural network only sees numbers.
Phase 5: Build the Transformer
Typical architecture:
Each transformer block contains:
Phase 6: Configure Hyperparameters
Typical configuration:
| Parameter | Example |
|---|
| Layers | 32 |
| Hidden Size | 4096 |
| Attention Heads | 32 |
| Context Length | 8192 |
| Vocabulary | 128K |
| Parameters | 7B |
Larger models increase these values.
Phase 7: Pre-training
The objective is simple:
Predict the next token.
Example:
Training samples:
For every token:
This process repeats trillions of times.
Phase 8: Distributed Training
One GPU is not enough.
Example:
Common techniques:
- Data Parallelism
- Tensor Parallelism
- Pipeline Parallelism
- Expert Parallelism (Mixture of Experts)
Frameworks:
- PyTorch Distributed
- DeepSpeed
- Megatron-LM
- Fully Sharded Data Parallel (FSDP)
Phase 9: Save Checkpoints
Every few thousand training steps:
A checkpoint contains:
- model weights
- optimizer state
- learning rate scheduler
- training step
- random number generator state
Phase 10: Evaluate
Benchmark the model.
Common evaluations:
- MMLU
- GSM8K
- HumanEval
- TruthfulQA
- HellaSwag
- ARC
- GPQA
Metrics:
- Perplexity
- Accuracy
- Pass@k
- F1 Score
- Exact Match
Phase 11: Instruction Fine-tuning (SFT)
Pre-trained models complete text but do not reliably follow instructions.
Train on instruction-response pairs.
Example:
Datasets:
- Alpaca
- Dolly
- OpenHermes
- ShareGPT
- Custom enterprise datasets
Methods:
- Full fine-tuning
- LoRA
- QLoRA
Phase 12: Preference Tuning
Improve response quality based on human preferences.
Typical pipeline:
Methods:
- RLHF
- DPO
- GRPO
- Reinforcement Fine-Tuning (RFT)
These help produce responses that are more helpful, harmless, and aligned with user intent.
Phase 13: Safety Alignment
Reduce harmful or unsafe outputs.
Examples:
- jailbreak resistance
- toxicity reduction
- hallucination mitigation
- refusal behavior
- bias evaluation
Phase 14: Quantization
Reduce model size.
Benefits:
- Lower GPU memory
- Faster inference
- Lower cost
Phase 15: Inference Optimization
Production serving includes:
Optimizations include:
- PagedAttention
- FlashAttention
- Continuous batching
- Speculative decoding
- Tensor parallelism
- KV cache management
Phase 16: Deploy
Typical architecture:
Technologies Used
| Stage | Common Tools |
|---|
| Data Collection | Common Crawl, Wikipedia dumps, GitHub archives |
| Data Processing | Apache Spark, Ray, Python |
| Tokenizer | SentencePiece, Hugging Face Tokenizers |
| Training | PyTorch |
| Distributed Training | DeepSpeed, Megatron-LM, FSDP |
| Experiment Tracking | Weights & Biases, MLflow |
| Storage | S3, HDFS |
| Fine-tuning | PEFT, LoRA, QLoRA |
| Evaluation | lm-evaluation-harness, custom benchmarks |
| Inference | vLLM, SGLang, TensorRT-LLM, TGI |
| Deployment | Docker, Kubernetes, NVIDIA GPUs |
Skills Needed to Build an LLM
- Mathematics
- Linear Algebra
- Calculus
- Probability
- Statistics
- Machine Learning
- Gradient Descent
- Backpropagation
- Loss Functions
- Optimization
- Deep Learning
- Neural Networks
- Attention Mechanism
- Transformers
- Positional Embeddings
- Distributed Systems
- Multi-GPU training
- Parallelism strategies
- High-speed networking (e.g., NVLink, InfiniBand)
- GPU Programming
- CUDA basics
- GPU memory hierarchy
- Kernel optimization
- MLOps
- Model versioning
- Experiment tracking
- Deployment
- Monitoring
Learning Roadmap
If your goal is to build an LLM yourself rather than just use one, a practical progression is:
- Build a character-level language model from scratch.
- Implement a Transformer in PyTorch.
- Train a small GPT (50M–150M parameters) on a public dataset.
- Learn distributed training with multiple GPUs.
- Fine-tune an open-weight model such as Llama or Qwen using LoRA/QLoRA.
- Serve it efficiently with vLLM or SGLang.
- Build a complete chat application with retrieval, tool calling, monitoring, and scalable deployment.
This path teaches the same core concepts used to build production LLM systems, while remaining feasible on accessible hardware before scaling up to larger models
High-Level AI Engineering Roles
| Phase | Responsible Area | Typical Engineer |
|---|
| Data Collection | Data pipelines | Data Engineer |
| Data Cleaning | Data processing | Data Engineer |
| Tokenization | Tokenizer development | ML Engineer / Research Engineer |
| Model Architecture | Transformer design | AI Research Scientist |
| Pre-training | Training algorithms | Research Scientist |
| Distributed Training | Multi-GPU optimization | Training Systems Engineer |
| Gradient Accumulation | Memory optimization | Training Systems Engineer |
| Mixed Precision | Training optimization | Training Systems Engineer |
| Checkpointing | Fault tolerance | Training Systems Engineer |
| Fine-tuning | Model adaptation | ML Engineer |
| RLHF/DPO | Alignment | Alignment Engineer |
| Quantization | Compression | ML Systems Engineer |
| Inference | Serving optimization | Inference Engineer |
| Deployment | Production infrastructure | MLOps Engineer |
Training Engineering
Inference Engineering focuses on making models serve requests efficiently, while Training Engineering focuses on making models train efficiently.
Typical responsibilities include:
Memory Optimization
- Gradient Accumulation
- Gradient Checkpointing
- Activation Checkpointing
- CPU Offloading
- ZeRO Optimization
- Optimizer State Sharding
Precision Optimization
- FP32
- BF16
- FP16
- Mixed Precision Training
- Dynamic Loss Scaling
Parallel Training
- Data Parallelism
- Tensor Parallelism
- Pipeline Parallelism
- Sequence Parallelism
- Expert Parallelism (MoE)
Distributed Communication
- NCCL
- AllReduce
- ReduceScatter
- AllGather
- Broadcast
Optimizer Engineering
- AdamW
- Fused Adam
- 8-bit Adam
- Lion
- LAMB
Memory Management
Managing:
Complete AI Systems Engineering Pipeline
Common Optimizations by Domain
| Training Systems Engineering | Inference Engineering |
|---|
| Gradient Accumulation | Continuous Batching |
| Gradient Checkpointing | KV Cache |
| Mixed Precision Training | Quantization |
| ZeRO Optimizer | PagedAttention |
| FSDP | FlashAttention |
| DeepSpeed | vLLM |
| Megatron-LM | SGLang |
| Activation Checkpointing | Speculative Decoding |
| Distributed Optimizers | TensorRT-LLM |
| Optimizer Sharding | Streaming Tokens |
Popular Frameworks Used by Training Systems Engineers
| Category | Common Frameworks |
|---|
| Distributed Training | PyTorch Distributed (DDP), FSDP, DeepSpeed, Megatron-LM |
| Memory Optimization | DeepSpeed ZeRO, Activation Checkpointing, Gradient Checkpointing |
| Precision | Automatic Mixed Precision (AMP), BF16, FP16 |
| Communication | NCCL, Gloo |
| Experiment Tracking | Weights & Biases, MLflow |
| Cluster Scheduling | Kubernetes, Slurm, Ray |