G-Research were headline sponsors at NeurIPS 2022, in New Orleans.

ML is a fast-evolving discipline; attending conferences like NeurIPS and keeping up-to-date with the latest developments is key to the success of our quantitative researchers and machine learning engineers.

Our NeurIPS 2022 paper review series gives you the opportunity to hear about the research and papers that our quants and ML engineers found most interesting from the conference.

Here, Leo M, Quantitative Researcher at G-Research, discusses two papers from NeurIPS which both focus on data pruning:

**Paper 1: Beyond neural scaling laws: beating power law scaling via data pruning**

*Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, Ari S. Morcos *

**Paper 2: The curse of (non)convexity: The case of an Optimization-Inspired Data Pruning algorithm**

*Fadhel Ayed, Soufiane Hayou *

I chose to discuss these two papers simultaneously because they both show interesting but apparently opposite results on the data pruning problem.

### Paper 1

Data pruning aims to make the training of large machine learning models faster by training only a subset of selected “most relevant” data points. As developed in paper one, *Beyond neural scaling laws: beating power law scaling via data pruning*, it has been observed in various contexts that the error of a model decays as a power-law in the number of training samples: Loss = (number of examples)^{-a}, where a ~ 0.1 for large language models.

Hence, such models need the addition of a huge number of new examples in order to get any significant performance improvement.

The first paper aims at breaking this slow power-low decay, hoping to get a pruning method to achieve, for instance, exponential decay: Loss = exp(- b (number of selected examples)).

The authors propose to study a simple noiseless perceptron model (i.e. learning a linear classifier from noiseless observations) in order to get intuition. This model can be solved analytically using the famous replica method from statistical physics, which allows us to get precise insights in that simple scenario.

Namely, they observe the following:

The optimal pruning strategy depends on the number of data points available. With limited data, one should keep the easiest examples, whereas when data is abundant, one should give priority to hard examples.

This is very intuitive, since with limited data one can only hope to learn a rough approximation of the optimal decision function, and for this we can ignore the very difficult examples that give very fine-grained information.

As the number of used training data points grows, we should therefore use more and more hard data points. Doing this in an optimal way allows us to get the exponential decay of the test error mentioned above.

In order to assess the difficulty of an example, they assume to have access to an oracle that gives the distance of the example to the separating hyperplane. However, when such an oracle is not available, the authors propose to use an approximate oracle, using for instance a hyperplane learnt on a small subset of the data. They can then compute the error as a function of both the fraction of data kept, and the angle theta between the hyperplane used for assessing the difficulty of each data point and the true one.

Interestingly, they show that for imperfect ranking of the examples (i.e. non-zero theta), there exists a minimum value of the fraction of kept data points, below which pruning becomes ineffective. In other words, with imperfect ranking of the examples one cannot prune more than some fixed fraction of the original dataset without compromising on accuracy, and we do not get an asymptotic exponential scaling.

While it is interesting to have these phenomena on a toy model, I found it very interesting that the authors were able to observe them on ResNets on real world image classification tasks.

### Paper 2

This first paper gave essentially positive results in favour of data pruning. In contrast, the second paper that I would like to discuss, *The curse of (non)convexity: The case of an Optimization-Inspired Data Pruning algorithm*, sheds a different light on this problem.

The authors here study stochastic gradient descent (SGD) for minimising a finite sum of functions’ “sum_i f_i” (each of the f_i corresponds to the loss of a model on the ith example of the dataset).

They are interested in finding a way of sampling the SGD data points that performs better than uniform selection, and that would potentially use only a subset of the original dataset (hence resulting in data pruning).

In the strongly convex case, if the functions f_i are L_i-smooth (i.e. the gradient of f_i is L_i – Lipschitz), the authors propose to use L_i as a measure of hardness for optimising f_i.

Indeed, large L_i leads to large condition numbers and hence slow convergence speed for gradient descent.

It therefore makes sense to sample f_i, which is hard to optimise, more often: the authors propose to use each sample i with probability proportional to L_i and show that this outperforms other sampling schemes. From there, they propose a data pruning algorithm that keeps the data points that lead to the largest L_i. For practical applications, where these Lipschitz constants are generally unknown, they propose approximations using either the gradients or Hessians of the cost functions.

They then show on a toy convex model, that their strategy outperforms uniform sampling. However, they observe that for non-convex problems they do not perform better than uniform sampling.

A possible explanation given in the paper is that pruning modifies the data distribution of the examples that one uses (e.g. creating a bias towards points with large L_i), and leads to a biased estimate of the population loss. The authors illustrate this on a simple example where the biased population loss turns out to be very different from the actual population loss.

**Comparing the papers**

I think that this last remark forces us to look again at the positive results of the first paper: how can data pruning lead to better results when it is using an incorrect data distribution?

When trying to predict y from x, one would ideally like to learn the conditional distribution P(y|x) (or E[y|x]). Hence, if the pruning does not modify the conditional distribution P(y|x) of the labels given the features, it should be possible to learn P(y|x) without any bias, even if the joint distribution of (x,y) in the pruned dataset does not match the “true distribution”.

For instance, any pruning scheme that is only based on x (independently of y) will preserve P(y|x). This echoes an observation made in the first paper, where the authors noticed that self-supervised pruning performs as well as supervised pruning.