Source Context

Origin: Hugging Face LLM Course - How Transformers solve tasks Core Idea: Transformer models are matched to tasks through architecture choice: encoder models are strongest for understanding, decoder models are strongest for generation, and encoder-decoder models are strongest for transforming one sequence into another.


Raw Takeaways

Directly summarized from the source and keep the same structure of the source:

  • Pipelines are easy to use but hide important architecture choices. Earlier examples showed that Transformers can solve many tasks with pipeline(), but this chapter explains why different models are better suited to different problems.
  • BERT, GPT, and T5 represent the three main Transformer families. BERT is encoder-only, GPT is decoder-only, and T5 is encoder-decoder.

The Transformer Architecture

  • The original Transformer contains both an encoder and a decoder. The encoder processes the input sequence, while the decoder generates an output sequence.
  • Modern models often use only part of the original architecture. BERT-style models use the encoder, GPT-style models use the decoder, and T5/BART-style models use both.
  • Architecture determines task fit. Encoder models are naturally suited for understanding, decoder models for generation, and encoder-decoder models for converting one sequence into another.

Encoder Models

  • Encoder models use bidirectional attention. They can look at the full input sequence on both the left and right side of each token.
  • BERT is the classic encoder-only example. It is useful when the task requires understanding or classifying an existing input.
  • Common encoder tasks include: sentence classification, token classification, named entity recognition, and extractive question answering.
  • Encoder models are not primarily generative. They are better at producing representations of input text than producing long new text.

Decoder Models

  • Decoder models use causal or masked self-attention. Each token can only attend to previous tokens, which makes the architecture suitable for next-token prediction.
  • GPT-style models are decoder-only. They generate text by repeatedly predicting the next token.
  • Decoder models are suitable for text generation. They can continue prompts, produce completions, and generate free-form language.
  • The decoder output is converted into probabilities. A linear layer and softmax turn hidden states into token probabilities.

Sequence-to-Sequence Models

  • Encoder-decoder models use both sides of the Transformer. The encoder reads the input, and the decoder generates the output.
  • T5 and BART are common encoder-decoder examples. They are well suited to tasks where input text must be transformed into output text.
  • Common sequence-to-sequence tasks include: translation, summarization, and other text-to-text transformations.
  • The same general pattern appears in speech recognition. Whisper uses an encoder-decoder architecture to map audio features into text.

Bias and Limitations

  • Pretrained models inherit patterns from their training data. This can include useful linguistic structure, but also social bias and harmful associations.
  • Model outputs should not be treated as automatically neutral or factual. Architecture explains capability, but production systems still need evaluation, filtering, and human judgment.

Computer Vision

  • Transformers can also process images. Vision Transformer models split an image into patches and treat those patches like tokens.
  • Image patches receive positional information. This gives the model a way to preserve spatial structure.
  • Vision Transformers show that the Transformer pattern is not limited to language. The same attention-based architecture can operate on non-text sequences.

Multimodal Tasks

  • Some Transformer systems combine multiple data types. Text, image, audio, and other modalities can be processed together or connected through specialized architectures.
  • Multimodal capability matters for real-world AI systems. Enterprise knowledge is often spread across documents, screenshots, diagrams, calls, videos, and structured records.

Visual Reference

Use this section only when an image, workflow, or diagram helps explain the idea.

Transformer encoder and decoder architecture

This diagram explains the core architectural split behind many Transformer models:

  • BERT uses the encoder side to understand input with bidirectional context.
  • GPT uses the decoder side to generate output step by step.
  • Encoder-decoder models combine both sides when the task requires understanding an input and generating a transformed output.

Architecture-To-Task Map

Encoder-only
  -> understand input
  -> classify / extract / label
  -> examples: BERT, DistilBERT
 
Decoder-only
  -> generate next tokens
  -> complete / write / generate
  -> examples: GPT-2, GPT, Llama
 
Encoder-decoder
  -> understand input, then generate output
  -> summarize / translate / transform
  -> examples: BART, T5, Whisper

Common Task Pattern

Raw input
  -> preprocessing / tokenization / feature extraction
  -> Transformer backbone
  -> task-specific head or decoder
  -> logits / spans / labels / generated tokens
  -> usable output

Source Structure Map

Pipelines
  -> hide model architecture details
 
Transformer architecture
  -> encoder
  -> decoder
  -> encoder-decoder
 
Task families
  -> understanding tasks
  -> generation tasks
  -> sequence-to-sequence tasks
  -> vision and multimodal tasks

Code Blocks

Use this section when the concept has a practical technical example.

Automatic Speech Recognition

from transformers import pipeline
 
transcriber = pipeline(
    task="automatic-speech-recognition",
    model="openai/whisper-base.en",
)
 
result = transcriber(
    "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"
)
 
print(result)

What it shows: Whisper uses an encoder-decoder structure to convert audio into text. The pipeline hides the audio preprocessing and decoding details behind one simple interface.

Encoder Model For Sentiment Classification

from transformers import pipeline
 
classifier = pipeline("sentiment-analysis")
result = classifier("This architecture is useful for enterprise AI workflows.")
 
print(result)

What it shows: Encoder-style models are commonly used for understanding tasks such as classification.

Decoder Model For Text Generation

from transformers import pipeline
 
generator = pipeline("text-generation")
result = generator(
    "In enterprise AI, Transformer models are useful because",
    max_length=40,
)
 
print(result)

What it shows: Decoder-style models generate text by predicting continuation tokens.

Encoder-Decoder Model For Summarization

from transformers import pipeline
 
summarizer = pipeline("summarization")
result = summarizer(
    """
    Transformer models can be organized into encoder-only, decoder-only,
    and encoder-decoder architectures. Each architecture is suited to a
    different family of tasks.
    """
)
 
print(result)

What it shows: Encoder-decoder models are well suited to transforming an input sequence into a new output sequence.

Task Selection Heuristic

def choose_transformer_architecture(task):
    if task in ["classification", "named_entity_recognition", "extractive_question_answering"]:
        return "encoder-only"
    if task in ["text_generation", "code_generation", "chat_completion"]:
        return "decoder-only"
    if task in ["summarization", "translation", "speech_recognition"]:
        return "encoder-decoder"
    return "evaluate task requirements"

What it shows: The architecture choice should follow the task pattern: understanding, generation, or input-to-output transformation.


Personal Synthesis

How does this relate to my current work?

  • The Connection: This chapter turns Transformer architecture into practical model selection. BERT, GPT, and T5 are not just different brand names; they reflect different task patterns.
  • Practical Application: For an enterprise AI platform, I should classify every workflow by task shape before choosing a model: understanding, generation, transformation, vision, audio, or multimodal.
  • Design Reminder: A model is not just “an LLM.” It is an architecture, training objective, and task interface. Those choices define what the model is good at.
  • RAG Connection: Retrieval uses encoder-style representation models, while answer generation usually uses decoder-style or encoder-decoder models. RAG is therefore a system that combines architectural roles.
  • Agent Design Connection: A mature agent should route subtasks to the right model/tool type instead of sending every problem to a general chat model.
  • Multimodal Direction: Whisper and ViT show that enterprise knowledge systems can include meetings, diagrams, screenshots, scanned documents, and visual process evidence as first-class sources.

Key Design Principle

State the reusable rule or decision principle that should survive after the details fade.

Principle:
Choose the Transformer architecture based on the task shape:
use encoders for understanding, decoders for generation,
and encoder-decoders for input-to-output transformation.


References & Credits

“Understanding which part of the Transformer architecture (encoder, decoder, or both) is best suited for a particular NLP task is key.”

  • Hugging Face LLM Course

Source: How Transformers solve tasks