Xuansheng Wu Seminar - King's College London NLP

Extracting and Utilizing Interpretation in Large Language Models

👨‍🏫 Speaker: Xuansheng Wu

📅 Time: 2025/10/15

🎥 Recording:

🎬

Recording will be available after the talk

(TBD)

📄 Abstract:

Large language models (LLMs) have shown strong promise in many scenarios, yet still fall short in certain situations. Addressing these shortcomings requires explanations that not only reveal failure modes but also guide concrete fixes. We present a unified framework that extracts and utilizes interpretations at four stages of the LLM life cycle: data preparation, training, inference, and post-processing. Our preliminary results have demonstrated that explanations can (i) identify hallucinated responses with post-generation filtering, (ii) steer LLMs to prevent jailbreaks during model inference, and (iii) regularize the reliance on unintended features during model training. Ongoing work extends these ideas to diversity-driven synthetic data generation for model training. Collectively, these contributions chart a practical path toward developing and deploying LLMs that are safer, more transparent, and broadly trustworthy.

👨‍🎓 Biography:

Xuansheng Wu is a fifth-year PhD candidate in Computer Science at the University of Georgia. His research focuses on the mechanism interpretation of large language models, with an emphasis on understanding hidden representations to enable fine-grained model control, such as instruction following, hallucination detection, and logical reasoning. He has published 18 peer-reviewed papers, including 9 as first author and 2 as co-first author. His research has received over 900 citations on Google Scholar. He also contributed to a project proposal that secured a $10 million grant from the U.S. Institute of Education Sciences to establish the National Center for AI-Generated Content. Xuansheng has interned with leading industry teams, including Baidu NLP, Tencent AI Lab, and Amazon Rufus, and his open-source projects on GitHub have received over 800 stars.

© 2025 Copyright: KCL NLP Group