Deep Learning Theory

Linked from Annotated Bibliography / Deep Learning Theory

Wide Neural Networks
- Neural networks tend to be become tractable to theoretical analysis as their widths become very wide.
  - One intuitive picture of this is that as the width of the network becomes large, parameter changes become small. For small changes in parameters, the network will be close to its linearization. This linearization is best represented as a kernel, and is known as the neural tangent kernel, or NTK.
  - Once the NTK has been understood, a next step is to look at what happens when we include additional terms of the Taylor-series expansion of the model, such as the quadratic. We can look at how these terms update or “rotate” the NTK over the course of training, and try to understand the associated inductive bias.
  - There are also exist other ways of approach the wide network regime, such as the mean field limit.
- NTK Limit
  - Neural Tangent Kernel: Convergence and Generalization in Neural Networks
    - The og paper
  - On the linearity of large non-linear models: when and why the tangent kernel is constant, Belkin
    - Takes the approach of upper-bounding the Hessian.
- Loss Landscapes and Convergence
  - Loss landscapes and optimization in over-parameterized non-linear systems and neural networks, Belkin 2021
    - Shows that wide neural networks satisfy PL* conditions almost everywhere, which is a sufficient condition for reaching a global minimum when randomly initialized. Uses the result in linearity of large nl models.
  - Stochastic Mirror Descent on Overparameterized Nonlinear Models: Convergence, Implicit Regularization, and Generalization
    - shows that “in the overparameterized nonlinear setting, if the initialization is close enough to the manifold of global minima (something that comes for free in the highly overparameterized case), SMD with sufficiently small step size converges to a global minimum that is approximately the closest one in Bregman divergence”
  - THE LOSS LANDSCAPE OF OVERPARAMETERIZED NEURAL NETWORKS 2018
  - Over-Parameterized Deep Neural Networks Have No Strict Local Minima For Any Continuous Activations
Feature Learning
- Different ways of trying to go beyond the linear limit to capture dynamics
- NTK Alignment
  - How the NTK aligns to the task. These papers show that the kernel aligns to the task labels for the training set. It’s much more difficult to show what’s going on off of the training data, and very few of them attempt to address this generalization component.
  - Implicit Regularization via Neural Feature Alignment Baratin 2021
    - Maybe the first paper I found on this. Uses the centered kernel alignment to track alignment between kernel and training labels over the course of training.
  - What can linearized neural networks actually say about generalization? 2021
    - Motivates discussion by introducing a Rademacher complexity bound related to RKHS norm, and then showing that target-kernel alignment lower-bounds this norm. Doesn’t really address the fact that his bound does not hold when the kernel has been selected from the training data.
  - Geometric compression of invariant manifolds in neural nets Geiger 2021
    - Looks at how the the uninformative weights of the network are compressed, resulting in superior performance to Isotropic kernels
  - A Theory of Neural Tangent Kernel Alignment and Its Influence on Training Bordelon 2022
    - Contributes an analysis of alignment in deep linear networks.
  - Neural Spectrum Alignment: Empirical Study
  - NEURAL NETWORKS AS KERNEL LEARNERS: THE SILENT ALIGNMENT EFFECT
- Quadratic models or general Taylor series
  - Beyond Linearization: On Quadratic and Higher-Order Approximation of Wide Neural Networks Jason Lee 2020
    - Uses a randomization trick to show how to make quadratic term come into play.
  - Quadratic models for understanding neural network dynamics Belkin 2023
    - Uses quadratic approximation to explain a learning dynamic known as “catapult phase”.
  - The Principles of Deep Learning TheoryDaniel A. Roberts and Sho Yaida 2021
    - Performs a perturbation theory analysis of neural network learning. By carefully excluding high-order terms, we can get a closed-form approximation of learning dynamics of quadratic models, which are themselves an approximation of actual neural networks.
    - Useful explicatory variants
      - https://deeplearningtheory.com/Princeton-School/Lecture-1.pdf
      - https://deeplearningtheory.com/Princeton-School/Lecture-2.pdf
- Feature Learning
  - Papers looking how neural networks perform feature learning. Closely related to NTK alignment, but not necessarily taking a kernel perspective.
  - Neural Networks can Learn Representations with Gradient Descent Jason Lee 2022
    - A bound for learning polynomials that is much better than can be done without learned features.
  - Neural Networks Efficiently Learn Low-Dimensional Representations with SGD 2023
    - Very similar idea to Geometric compression paper.
Physics/Stat Mech-Inspired
- Scaling description of generalization with number of parameters in deep learning Geiger 2020
- Spectral Bias and Task-Model Alignment Explain Generalization in Kernel Regression and Infinitely Wide Neural Networks Bordelon
Self-stabilization
Implicit Regularization / Inductive Bias of SGD
- Rethinking SGD’s noise – II: Implicit Bias Bach 2022
Function Spaces + Regularization