*This article is one of a series of paper reviews from our researchers and machine learning engineers – view more*

Machine Learning (ML) is a fast evolving discipline, which means conference attendance and hearing about the very latest research is key to the ongoing development and success of our quantitative researchers and ML engineers.

As part of their attendance at ICML 2023 in Hawaii, we asked our quants and machine learning practitioners to write about some of the papers and research that they found most interesting.

Here, Maria R, Quantitative Researcher at G-Research, discusses four papers.

__Learning to Maximize Mutual Information for Dynamic Feature Selection__

*Ian Covert, Wei Qiu, Mingyu Lu, Nayoon Kim, Nathan White, Su-In Lee*

Dynamic Feature Selection (DFS) aims to find features in a dataset carrying most of the task-relevant information. Traditional methods for DFS all rely on reinforcement learning; they are difficult to train and do not always perform better them static feature selection. In this paper, the authors introduce a simple method trade which selects features greedily based on their condition mutual information (CMI) with the response.

Their method allows you to directly predict the optimal selection at each step by leveraging a variational approach on the CMI (i.e. when trying the algorithm to optimality, it converges to the greedy CMI).

Given a prediction function F, the metric of interest is the expected loss of F, evaluated on a subset of features. In the greedy algorithm, features are added sequentially to the set. They are selected by maximising the marginal increase in CMI at each step in time.

__Iterative Approximate Cross-Validation__

*Yuetian Luo, Zhimei Ren, Rina Foygel Barber*

Leave one out Cross-Validation (LOOCV) is a robust and widely used method to evaluate and compare machine learning models. It is however computationally expensive as a different model is trained for each input in the dataset.

Methods exist to approximate the LOOCV score using the parameters that solve the optimisation problem. However, the exact solution is often unknown, either due to the complexity of the learning problem or to the use of early stopping to avoid overfitting.

The authors propose an iterative algorithm, which allows you to find an approximation of the LOOCV parameters. The iteration steps used by the algorithm is summarised in Eq. 11. In essence, this method approximates the exact LOOCV parameters by expanding the loss function around the difference between the LOOCV parameters and the risk minimising parameters. The exact parameters solving the optimisation problem are then replaced by their approximation at the previous step. The authors further include formal guarantees of convergence of their method.

__“Why did the model fail?”: Attributing Model Performance Changes to Distributional Shifts__

*Haoran Zhang, Harvineet Singh, Marzyeh Ghassemi, Shalmali Joshi*

Distribution shifts in a dataset often impact the performance of a machine learning model. Evaluating the key changes in distribution underlying such change is not trivial. Some marginal or conditional probability distribution shift in the inputs might have no impact on the loss, degrade it or even improve its performance.

Using a game theoretic approach based on the Shapley value computation, the authors are able to determine the contribution to the performance change of a fixed set of distribution that might change across different environments. Given a set, they develop a model-agnostic approach to quantify the contribution of each distribution shift.

__Scaling up Dataset Distillation to ImageNet-1K with constant memory__

*Justin Cui, Ruochen Wang, Si Si, Cho-Jui Hsieh*

The aim of dataset distillation is to generate a small dataset of synthetic samples from a large dataset without deteriorating performance too much.

Various methods exist for dataset distillation. Among these, “Matching Training Trajectories” (MTT) achieves state of the art (SOTA) performance on smaller datasets. However, MTT is extremely memory expensive as it requires unrolling T SGD steps and thus storing T gradients. If the dataset is too large, this becomes unfeasible.

The authors propose a memory efficient version of MTT which only requires the computation of one gradient-graph even if unrolling SGD for T steps. Consequently, they reduce the GPU memory complexity with respect to T from linear in MTT to constant while achieving an identical solution. This allows them to scale MTT to ImageNet-1K.

Key for performance is the use of soft-labels when the number of labels per class is large. The idea behind the MMT method is to match the output of the model trained on the synthetic dataset to those of the model trained on the true dataset.

Finally, the authors verify that the synthetic dataset can be used to train different architecture model and thus carries the important information carried by the inputs.