Scaling Laws

A few pointers on scaling laws in neural networks.

Kaplan et al.’s original 2020 scaling laws paper: Scaling Laws for Neural Language Models introduced the idea that model loss scales as a power law of model size and dataset size across many orders of magnitude. One major issue here is that hyperparameters such as learning rate were kept fixed across model sizes, leading to underestimates of the performance of large models.

Greg Yang and colleagues’ 2021 paper Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer introduces 𝜇Transfer. This method allows for hyperparameter tuning on smallish models and then successful transfer of those hyperparameters to much larger models. Essentially, this provides a more systematic way of selecting hyperparameters for large models than the heuristics previously used to determine hparams for large models. These systematically chosen hparams perform much better than the previously used hparams, further suggesting that the power-law constants in the Kaplan et al. were overly pessimistic.

Deepmind’s 2022 Chinchilla paper, An empirical analysis of compute-optimal large language model training, proposes a big revision of Kaplan et al.’s scaling laws. The main point is that increasing the data set size and learning rate schedule with which large models are trained, leads to much more compute-effective scaling than increasing models’ parameter count in the > 50B parameter range.

Gwern has some interesting notes on scaling laws.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s