Holistic Evaluation of Language Models

MATH dataset scenario

@Percy Liang†, @Rishi Bommasani†, @Tony Lee†, @Dimitris Tsipras*, @Dilara Soylu*, @Michihiro Yasunaga*, @Yian Zhang*, @Deepak Narayanan*, @Yuhuai Wu*, @Ananya Kumar, @Benjamin Newman, @Binhang Yuan, @Bobby Yan, @Ce Zhang, @Chris Ré, @Christian Cosgrove, @Christopher D. Manning, @Diana Acosta-Navas, @Drew A. Hudson, @Eric Zelikman, @Esin Durmus, @Faisal Ladhak, @Frieda Rong, @Hongyu Ren, @Huaxiu Yao, @Jue Wang, @Kathleen McKeown, @Keshav Santhanam, @Laurel Orr, @Lucia Zhang, @Mert Yuksekgonul, @Mirac Suzgun, @Nathan Kim, @Neel Guha, @Niladri Chatterji, @Omar Khattab, @Peter Henderson, @Qian Huang, @Ryan Chi, @Sang Michael, @Shibani Santurkar, @Surya Ganguli, @Tatsunori Hashimoto, @Thomas Icard, @Tianyi Zhang, @Vishrav Chaudhary, @William Wang, @Xuechen Li, @Yifan Mai, @Yuhui Zhang, @Yuta Koreeda; CRFM
†: lead authors
*: major contributors

Accepted to TMLR 2023.


How well do large language models perform on the MATH dataset?