Marco Mondelli Seminar - King's College London NLP

Learning and Data in the Age of LLMs: Theoretical Insights into Knowledge Distillation, Test-Time-Training and Synthetic Data Selection

👨‍🏫 Speaker: Professor Marco Mondelli

📅 Time: 2025/10/22

🎥 Recording:

🎬

Recording will be available after the talk

(2025/10/22)

📄 Abstract:

The availability of powerful models pre-trained on a vast corpus of data has spurred research on alternative training paradigms, and this talk presents three vignettes giving theoretical insights through the lens of high-dimensional regression. The first vignette is about knowledge distillation where one uses the output of a surrogate model as labels to supervise the training of a target model. I will be particularly interested in the phenomenon of weak-to-strong generalization in which a strong student outperforms the weak teacher from which the task is learned. More precisely, I will provide a sharp characterization of the risk of the target model when the surrogate model is either arbitrary or obtained via empirical risk minimization. This shows that weak-to-strong training, with the surrogate as the weak model, provably outperforms training with strong labels under the same data budget, but it is unable to improve the data scaling law. The second vignette is about test-time training (TTT) where one explicitly updates the weights of a model to adapt to the specific test instance. I will investigate a gradient-based TTT algorithm for in-context learning, where a linear transformer model is trained via a single gradient step on the in-context demonstrations provided in the test prompt. This shows how TTT can significantly reduce the sample size required for in-context learning, and it delineates the role of the alignment between pre-training distribution and target task. Finally, the third vignette is about synthetic data selection where one uses data obtained from a generative model to augment the training dataset. I will prove that, for linear models, the covariance shift between the target distribution and the distribution of the synthetic data affects the generalization error but, surprisingly, the mean shift does not. Remarkably, selecting synthetic data that match the covariance of the target distribution is optimal not only theoretically in the context of linear models, but also empirically: this procedure performs well against the state of the art, across training paradigms, architectures, datasets and generative models used for augmentation.

👨‍🎓 Biography:

Marco Mondelli received the B.S. and M.S. degree in Telecommunications Engineering from the University of Pisa, Italy, in 2010 and 2012, respectively. In 2016, he obtained his Ph.D. degree in Computer and Communication Sciences at EPFL. In 2017-2019, he was a Postdoctoral Scholar in the Department of Electrical Engineering at Stanford University. In 2018, he was also a Research Fellow with the Simons Institute for the Theory of Computing, for the program on "Foundations of Data Science". He has been a faculty member at the Institute of Science and Technology Austria (ISTA) since 2019, first as an Assistant Professor and, since 2025, as a Professor. His research interests include data science, machine learning, high-dimensional statistics, information theory, and coding theory. He is the recipient of a number of fellowships and awards, including the Jack K. Wolf ISIT Student Paper Award in 2015, the STOC Best Paper Award in 2016, the EPFL Doctorate Award in 2018, the Simons-Berkeley Research Fellowship in 2018, the Lopez-Loreta Prize in 2019, the Information Theory Society Best Paper Award in 2021 and the ERC Starting Grant in 2024.

© 2025 Copyright: KCL NLP Group