Convolutional networks slide learnable filters over the token sequence, capturing local patterns (character or word n-gram-like features) — but computed in parallel across the whole sequence rather than step-by-step.
Key idea
Stack 1-D convolutions; each layer widens the receptive field, so deep stacks (or dilated convolutions) can see long contexts. Causal convolutions mask future tokens so the model only conditions on the past — required for next-token prediction.
Receptive field grows with depth × kernel size (× dilation), not with sequence position.
Inputs & representation
- Input: token (or character) embeddings arranged as a 1-D signal.
- Model: stacked causal/dilated 1-D convs + residual connections.
- Output: next-token distribution at each position, all positions at once.
[emb][emb][emb][emb]
\ | / causal conv (kernel=3, masked)
[ feature ]
↓ stack/dilate → wider context
How it applies to autocomplete
Character-level CNNs are good at morphology and typo tolerance; word-level CNNs capture short phrase patterns. Highly parallel training; fixed receptive field bounds how much history matters.
Trade-offs
| Strength | Weakness |
|---|---|
| Fully parallel (fast training) | Receptive field is bounded by depth |
| Strong at local / sub-word patterns | Long-range context needs many layers |
| Robust to typos (char-level) | Less common than transformers for LMs |
References
- Bai, Kolter & Koltun (2018), Temporal Convolutional Networks (TCN).
- van den Oord et al. (2016), WaveNet (dilated causal convolutions).
Notes / TODO
- Try a char-level causal CNN for typo robustness.
- Measure how receptive-field size affects completion quality.