A few pointers on scaling laws in neural networks.
Kaplan et al.’s original 2020 scaling laws paper: Scaling Laws for Neural Language Models introduced the idea that model loss scales as a power law of model size and dataset size across many orders of magnitude. One major issue here is that hyperparameters such as learning rate were kept fixed across model sizes, leading to underestimates of the performance of large models.
Greg Yang and colleagues’ 2021 paper Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer introduces 𝜇Transfer. This method allows for hyperparameter tuning on smallish models and then successful transfer of those hyperparameters to much larger models. Essentially, this provides a more systematic way of selecting hyperparameters for large models than the heuristics previously used to determine hparams for large models. These systematically chosen hparams perform much better than the previously used hparams, further suggesting that the power-law constants in the Kaplan et al. were overly pessimistic.
Deepmind’s 2022 Chinchilla paper, An empirical analysis of compute-optimal large language model training, proposes a big revision of Kaplan et al.’s scaling laws. The main point is that increasing the data set size and learning rate schedule with which large models are trained, leads to much more compute-effective scaling than increasing models’ parameter count in the > 50B parameter range.