Transformer Architectures

Source Context

Origin: Hugging Face LLM Course - Transformer Architectures Core Idea: Most Transformer models use one of three architecture patterns: encoder-only, decoder-only, or encoder-decoder. Choosing the right pattern depends on whether the task is mainly understanding, generation, or sequence transformation.

Raw Takeaways

Directly summarized from the source and keep the same structure of the source:

Transformer architectures fall into three main variants. Encoder-only, decoder-only, and encoder-decoder models solve different task families.
Architecture choice is a model-selection tool. Understanding the difference helps decide which model to use for a specific NLP task.

Encoder Models

Encoder models use only the encoder of the Transformer. Their attention layers can access all words in the input sentence at each stage.
They use bidirectional attention. This means each token can attend to both left and right context.
They are often called auto-encoding models. Their pre-training usually corrupts an input sentence, such as by masking random words, and trains the model to reconstruct the original input.
Encoder models are best for understanding tasks. Good examples include sentence classification, named entity recognition, word classification, and extractive question answering.
Representative encoder models include: BERT, DistilBERT, and ModernBERT.

Decoder Models

Decoder models use only the decoder of the Transformer. For a given token, attention can only access tokens that come before it.
They are often called autoregressive models. Their pre-training usually focuses on predicting the next word or token in a sequence.
Decoder models are best for generation tasks. They are designed to produce text one token at a time.
Representative decoder model families include: SmolLM, Llama, Gemma, and DeepSeek V3.

Modern Large Language Models (LLMs)

Most modern LLMs use decoder-only architecture. This is the pattern behind many chat and generative AI systems.
Modern LLMs are usually trained in two phases: pre-training to predict the next token on large text corpora, then instruction tuning to follow user instructions and generate helpful responses.
Decoder-based LLMs support many capabilities: text generation, summarization, translation, question answering, code generation, reasoning, and few-shot learning.
This explains why one decoder model can appear broadly capable. Its next-token prediction objective can be adapted through prompting and instruction tuning to many tasks.

Sequence-to-Sequence Models

Encoder-decoder models use both parts of the Transformer. The encoder can access the full input context, while the decoder generates output auto-regressively.
They are also called sequence-to-sequence models. They are suited for tasks where one sequence must be transformed into another.
Their pretraining often involves reconstruction. For example, T5 replaces random spans of text with a mask token and trains the model to predict the missing text.
Sequence-to-sequence models are best for translation, summarization, and generative question answering.
Representative encoder-decoder models include: BART, mBART, Marian, and T5.

Practical Applications

Sequence-to-sequence models are useful when meaning must be preserved while form changes. Examples include machine translation, text summarization, data-to-text generation, grammar correction, and context-based question answering.
BART and T5 are strong summarization examples. Marian and T5 are common translation examples.

Choosing The Right Architecture

Use encoder models for understanding existing text. Examples include sentiment classification, topic classification, named entity recognition, and extractive question answering.
Use decoder models for generating text. Examples include creative writing, conversational AI, code generation, and general text completion.
Use encoder-decoder models for transforming one sequence into another. Examples include translation, summarization, grammar correction, and generative question answering.
Three guiding questions help architecture selection: Does the task need bidirectional understanding? Is the goal to generate new text? Does the task transform one sequence into another?

The Evolution Of LLMs

LLMs have evolved rapidly. Each generation has improved model size, training scale, and capability.
The dominant architecture for many modern LLMs remains decoder-only. This makes architecture knowledge important when evaluating current AI systems.

Attention Mechanisms

Full attention can become a computational bottleneck. Standard attention has quadratic complexity with respect to sequence length.
Longer inputs require more efficient attention patterns. Models such as Reformer and Longformer introduce alternatives.

LSH Attention

Reformer uses LSH attention. Instead of comparing every query to every key, it uses hashing to focus on keys that are likely to be close to a query.
Multiple hash rounds improve robustness. Because hashing can be random, several hash functions can be used and averaged.

Local Attention

Longformer uses local attention. A token attends mainly to nearby tokens within a window.
Stacked local attention layers expand the effective receptive field. Even if each layer has a small window, deeper layers can build a broader representation.
Some tokens can receive global attention. Selected tokens can attend to all tokens and be attended to by all tokens.
Sparse attention enables longer sequence lengths. Reducing the attention matrix cost allows models to process longer inputs.

Axial Positional Encodings

Reformer also uses axial positional encodings. This reduces the memory cost of positional encodings for long sequences.
The large positional encoding matrix is factorized into smaller matrices. This saves memory while still preserving position information.

Conclusion

Architecture differences are crucial for model selection. Encoders, decoders, and encoder-decoders are suited to different task families.
Specialized attention mechanisms address long-context limitations. This matters when building systems that need to process long documents, knowledge bases, or enterprise records.

Visual Reference

Use this section only when an image, workflow, or diagram helps explain the idea.

Transformer architecture choice
  -> Encoder-only
     -> understand input
     -> BERT / DistilBERT / ModernBERT
     -> classification / NER / extractive QA
 
  -> Decoder-only
     -> generate next tokens
     -> SmolLM / Llama / Gemma / DeepSeek
     -> chat / writing / code / few-shot learning
 
  -> Encoder-decoder
     -> understand input, then generate transformed output
     -> BART / mBART / Marian / T5
     -> translation / summarization / grammar correction

Architecture selection questions:
 
Do I need to analyze existing text?
  -> encoder
 
Do I need to generate new text?
  -> decoder
 
Do I need to transform one sequence into another?
  -> encoder-decoder
 
Do I need very long context?
  -> consider efficient attention patterns

Code Blocks

Use this section when the concept has a practical technical example.

Architecture Selection Heuristic

def choose_transformer_architecture(task):
    encoder_tasks = {
        "sentiment_classification",
        "topic_classification",
        "named_entity_recognition",
        "extractive_question_answering",
    }
 
    decoder_tasks = {
        "chat",
        "creative_writing",
        "code_generation",
        "text_completion",
    }
 
    encoder_decoder_tasks = {
        "translation",
        "summarization",
        "grammar_correction",
        "generative_question_answering",
    }
 
    if task in encoder_tasks:
        return "encoder-only"
    if task in decoder_tasks:
        return "decoder-only"
    if task in encoder_decoder_tasks:
        return "encoder-decoder"
 
    return "clarify task shape before selecting a model"

What it shows: Architecture selection should start from the task shape, not from model popularity.

Pipeline Examples By Architecture

from transformers import pipeline
 
# Encoder-style task: understand and classify existing text
classifier = pipeline("sentiment-analysis")
 
# Decoder-style task: generate continuation text
generator = pipeline("text-generation")
 
# Encoder-decoder style task: transform one sequence into another
summarizer = pipeline("summarization")

What it shows: The same Hugging Face pipeline() interface can hide very different model architectures underneath.

Personal Synthesis

How does this relate to my current work?

The Connection: This chapter gives me a model-selection map. Instead of treating every model as a generic LLM, I should identify whether the workflow needs understanding, generation, transformation, or long-context processing.
Practical Application: For an enterprise AI platform, encoder models can support classification and extraction, decoder models can support assistant-style generation, and encoder-decoder models can support summarization, translation, and document transformation.
Design Reminder: The best model is not always the largest or newest model. The best model is the one whose architecture matches the task shape, data type, latency requirement, and risk profile.
RAG Connection: RAG often combines architectures: encoder-style models create embeddings or retrieve relevant context, while decoder-style models generate answers.
Long-Context Reminder: Enterprise documents can be long, so attention efficiency and context strategy matter. Longformer/Reformer-style ideas explain why naive full attention can become expensive.

Key Design Principle

State the reusable rule or decision principle that should survive after the details fade.

Principle:
Choose Transformer architecture by task shape:
use encoders for understanding,
decoders for generation,
encoder-decoders for sequence transformation,
and efficient-attention variants when long context becomes the bottleneck.

How Transformers Solve Tasks - applies the architecture map to concrete task categories like BERT, GPT, T5, Whisper, and ViT.
Transformers What Can They Do - shows the practical pipeline layer that hides architecture complexity.
Understanding NLP vs LLMs - provides the conceptual base for why LLMs are one part of the NLP toolkit.
The Molty Framework for AI Prompting - connects decoder-style model behavior to prompt protocol design.

References & Credits

“Most Transformer models use one of three architectures: encoder-only, decoder-only, or encoder-decoder.”

Hugging Face LLM Course

Source: Transformer Architectures

deanlu.ai

Explorer

Transformer Architectures

Raw Takeaways

Encoder Models

Decoder Models

Modern Large Language Models (LLMs)

Sequence-to-Sequence Models

Practical Applications

Choosing The Right Architecture

The Evolution Of LLMs

Attention Mechanisms

LSH Attention

Local Attention

Axial Positional Encodings

Conclusion

Visual Reference

Code Blocks

Architecture Selection Heuristic

Pipeline Examples By Architecture

Personal Synthesis

Key Design Principle

References & Credits

Graph View

Table of Contents

Backlinks

deanlu.ai

Explorer

Transformer Architectures

Raw Takeaways

Encoder Models

Decoder Models

Modern Large Language Models (LLMs)

Sequence-to-Sequence Models

Practical Applications

Choosing The Right Architecture

The Evolution Of LLMs

Attention Mechanisms

LSH Attention

Local Attention

Axial Positional Encodings

Conclusion

Visual Reference

Code Blocks

Architecture Selection Heuristic

Pipeline Examples By Architecture

Personal Synthesis

Key Design Principle

Related Notes

References & Credits

Graph View

Table of Contents

Backlinks