Shikai Qiu, Andres Potapczynski, Marc Finzi, Micah Goldblum, Andrew Gordon Wilson
In this paper, the authors seek to find more compute-efficient alternatives to replace dense linear layers, investigating structured alternatives such as low-rank matrices, monarch matrices and Kronecker products.
The authors claim that these approaches have largely failed in the past due to choosing hyperparameters poorly when these alternatives are used in place of dense linear layers.
To address this, they adapt the initialisation scheme derived from the maximal update parametrization work to support with these structured matrices, and use it to optimise some simple hyperparameters (like learning rate). They show that by doing this, they are able to achieve better test performance per flop on a number of tasks.
Compute Better Spent: Replacing Dense Layers with Structured Matrices