BERT

I Use This When...

I need strong language understanding rather than free-form generation. BERT-style models are useful for classification, tagging, retrieval, semantic matching, question answering, and other tasks where reading both left and right context helps.

History

Devlin et al. (Google, 2018). 'Bidirectional Encoder Representations from Transformers.' Changed NLP overnight — every benchmark was broken within months.

Why It Exists

The "why" chain is:

Many NLP tasks depend on full context, not just the prefix.
Left-to-right language models are not ideal for pure understanding tasks.
We want a model that can use both left and right context at once.
We also want pretraining so downstream tasks need less labeled data.

BERT exists because understanding language often needs bidirectional context and task transfer.

How It Works

Visual Intuition

Imagine a sentence with one token hidden.

the model sees the words before and after the blank
it uses both directions of context to infer the missing token
repeating this across huge corpora teaches rich contextual representations

Instead of learning to continue text like GPT, BERT learns to reconstruct and understand context from all sides.

The timeline node is here:

-> MLViz Node: BERT

Step by Step

Tokenize text and add special tokens
Randomly mask some input tokens during pretraining
Run the whole sequence through a Transformer encoder stack
Predict the masked tokens from bidirectional context
Fine-tune the pretrained encoder on downstream tasks

The pretrained encoder becomes a reusable language understanding backbone.

Code

# concept sketch
# mask some tokens
# encoded = bert_encoder(tokens_with_masks)
# loss = masked_lm_loss(encoded, masked_targets)

The Math Inside

Core pretraining objective:

randomly mask a subset of tokens
predict the original masked tokens from the full surrounding context

Architecture:

encoder-only Transformer stack
every token can attend to every other token
no causal mask, because bidirectional context is allowed

Historically, BERT also used next sentence prediction (NSP), but the larger lesson is the masked-language-model objective plus fine-tuning paradigm.

That design made pretrained language understanding reusable across many tasks without training a new architecture from scratch each time.

Math Prerequisites

Transformer - the encoder architecture underneath
Cross-Entropy - masked token prediction loss
GPT - the autoregressive alternative for generation

Transformer — The architecture
GPT — Decoder-only alternative
Timeline — The NLP revolution