The Geometry of Generalization — Four Forces Against Overfitting

Force I

Regularization · L1 & L2

A degree-9 polynomial is fit to noisy samples of a true sine wave by ridge-regularized least squares, solved exactly via the normal equations w = (XᵀX + λI)⁻¹ Xᵀy. As you raise λ, the penalty crushes large weights and the wild curve relaxes toward the truth. The bars show each weight shrinking.

True f(x)=sin Noisy training points Fitted polynomial

Penalty λ (log) 0.0001

Norm L2 (ridge)

Force II

k-Fold Cross-Validation

The dataset is sliced into k folds. Each round, one fold becomes the validation set (cyan) while the rest train (rose). A polynomial of chosen degree is fit on the train folds and scored on the held-out fold. The averaged validation error is the honest estimate of generalization — watch it bottom out at the right complexity.

Folds k 5

Model degree 3

Force III

Model Simplification · Capacity vs. Truth

Same noisy data, one dial: polynomial degree = capacity. Low degree underfits (too rigid). High degree overfits (memorizes noise, explodes between points). The live bias² / variance / total decomposition shows the U-shaped sweet spot of true generalization error.

Capacity (degree) 4

Force IV

Early Stopping · The Moment of Divergence

A high-capacity model is trained by real gradient descent. Training loss (rose) falls forever; validation loss (cyan) falls, then turns upward as the model begins memorizing. Early stopping freezes the weights at the validation minimum — the gold star. Press Train and watch the two curves split.

Learning rate 0.040

Patience (epochs) 12