Transformers replace recurrence and convolution with self-attention: every token can directly attend to every earlier token. This is the architecture behind modern autocomplete and GPT-style models.
Key idea
Each token forms a query, key, and value; attention weights every other token by relevance and mixes their values:
Attention(Q, K, V) = softmax(QKα΅ / βdβ) Β· V
A causal mask prevents attending to future tokens, so the model predicts the next token from the past only.
Inputs & representation
- Input: token embeddings + positional encodings (attention is order-agnostic without them).
- Model: stacked masked self-attention + feed-forward blocks.
- Output: next-token distribution at every position.
tokens + positions
β
[ masked self-attention ] ββ all earlier tokens
β
[ feed-forward ] β next-token logits
How it applies to autocomplete
Run the prefix through the decoder; the last positionβs logits rank completions. KV-caching makes incremental typing cheap β reuse past keys/values and only compute the new token. This is the strongest quality option, at higher compute cost.
Trade-offs
| Strength | Weakness |
|---|---|
| Best quality, long-range context | Attention is O(nΒ²) in sequence length |
| Fully parallel training | Larger memory / compute footprint |
| KV-cache β fast incremental decode | Needs positional encoding & more data |
Decoding strategies
How you turn logits into a suggestion matters as much as the model:
- Greedy / top-k / top-p (nucleus) β speed vs. diversity trade-off.
- Beam search β better for short, high-confidence completions.
References
- Vaswani et al. (2017), Attention Is All You Need.
- Radford et al., the GPT series (decoder-only LMs).
Notes / TODO
- Fine-tune a small decoder-only model on our corpus.
- Benchmark KV-cache latency for live typing.