Tongshuang (Sherry) Wu

Principles and Tools for Evaluating and Improving NLP Models

You think you understood your model, but you really didn’t :)

While the accuracy of Natural Language Processing (NLP) models has been going up, users have more expectations than captured by just accuracy (e.g., chatbot assistants should make fair replies). To guard the end-user experience, experts are responsible for deploying trustworthy models while iterating on those that have clear weak spots. However, state-of-the-art models are inherently complex, making it hard to decipher if a model is behaving as expected, let alone making targeted improvements. In this talk, I will discuss the importance of comprehensive analyses on model inputs and outputs, in terms of guiding practitioners towards systematic model understandings and improvement. Along the way, I will describe existing pitfalls in the status-quo analysis process, i.e., they rely on insufficient local contexts that may lead towards false trust or suboptimal interventions. I will introduce essential building blocks for informative analysis, i.e., (1) to scale up the analysis by grouping similar instances, and (2) to isolate important components by generating counterfactuals. I will also illustrate and compare tools and models I built on top of the two building blocks (Errudite, CheckList, Polyjuice), and discuss the importance of balancing between domain expertise and inherent NLP task properties in such tool designs.


Tongshuang Wu is a fifth year PhD student at the University of Washington, co-advised by Jeffrey Heer and Daniel S. Weld. Her research focuses on helping humans more effectively and systematically evaluate and interact with their models through Human-Computer Interaction (HCI) and Natural Language Processing (NLP).

Presentation Materials

Talk Video