Holistic Evaluation of Language Models

MATH dataset scenario

Percy Liang†, Rishi Bommasani†, Tony Lee†, Dimitris Tsipras*, Dilara Soylu*, Michihiro Yasunaga*, Yian Zhang*, Deepak Narayanan*, Yuhuai Wu*, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Chris Ré, Christian Cosgrove, Christopher D. Manning, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Kathleen McKeown, Keshav Santhanam, Laurel Orr, Lucia Zhang, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, Yuta Koreeda; CRFM
†: lead authors
*: major contributors

Accepted to TMLR 2023.


How well do large language models perform on the MATH dataset?