Four forces stand between a model and the abyss of memorization. Each panel below is a live, mathematically exact simulation — no decoration faked, every curve solved from the real equations. Drag the dials. Watch truth fight noise.
A degree-9 polynomial is fit to noisy samples of a true sine wave by ridge-regularized least squares, solved exactly via the normal equations w = (XᵀX + λI)⁻¹ Xᵀy. As you raise λ, the penalty crushes large weights and the wild curve relaxes toward the truth. The bars show each weight shrinking.
The dataset is sliced into k folds. Each round, one fold becomes the validation set (cyan) while the rest train (rose). A polynomial of chosen degree is fit on the train folds and scored on the held-out fold. The averaged validation error is the honest estimate of generalization — watch it bottom out at the right complexity.
Same noisy data, one dial: polynomial degree = capacity. Low degree underfits (too rigid). High degree overfits (memorizes noise, explodes between points). The live bias² / variance / total decomposition shows the U-shaped sweet spot of true generalization error.
A high-capacity model is trained by real gradient descent. Training loss (rose) falls forever; validation loss (cyan) falls, then turns upward as the model begins memorizing. Early stopping freezes the weights at the validation minimum — the gold star. Press Train and watch the two curves split.