High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation

Nov 28, 2022

Speakers

About

We study the first gradient descent step on the first-layer parameters W in a two-layer neural network: f(x) = 1/√(N)a^⊤σ(W^⊤x), where W∈ℝ^d× N, a∈ℝ^N are randomly initialized, and the training objective is the empirical MSE loss: 1/n∑_i=1^n (f(x_i)-y_i)^2. In the proportional asymptotic limit where n,d,N→∞ at the same rate, and an idealized student-teacher setting where the teacher f^* is a single-index model, we compute the prediction risk of ridge regression on the conjugate kernel after one gradient step on W with learning rate η. We consider two scalings of the first step learning rate η. For small η, we establish a Gaussian equivalence property for the trained feature map, and prove that the learned kernel improves upon the initial random features model, but cannot defeat the best linear model on the input. Whereas for sufficiently large η, we prove that for certain f^*, the same ridge estimator on trained features can go beyond this “linear regime” and outperform a wide range of (fixed) kernels. Our results demonstrate that even one gradient step can lead to a considerable advantage over random features, and highlight the role of learning rate scaling in the initial phase of training.

Organizer

Store presentation

Should this presentation be stored for 1000 years?

How do we store presentations

Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%

Sharing

Recommended Videos

Presentations on similar topic, category or speaker

Interested in talks like this? Follow NeurIPS 2022