Who Judges the Judge: An Empirical Study on Online Judge Tests
Online Judge platforms play a pivotal role in education, competitive programming, recruitment, career training, and large language model training. They rely on predefined test suites to judge the correctness of submitted solutions. It is therefore important that the solution judgement is reliable and free from potentially misleading false positives (i.e., incorrect solutions that are judged as correct).
In this paper, we conduct an empirical study of 939 coding problems with 541,552 solutions, all of which are judged to be correct according to the test suites used by the platform, finding that 43.4% of the problems include false positive solutions (3,440 bugs are revealed in total).
We also find that test suites are, nevertheless, of high quality according to widely-studied test effectiveness measurements: 88.2% of false positives have perfect (100%) line coverage, 78.9% have perfect branch coverage, and 32.5% have a perfect mutation score.
Our findings indicate that more work is required to weed out false positive solutions and to further improve test suite effectiveness. We have released the detected false positive solutions and the generated test inputs to facilitate future research.
Tue 18 JulDisplayed time zone: Pacific Time (US & Canada) change
15:30 - 17:00 | ISSTA Online 3: Empirical StudiesTechnical Papers at Bezos Seminar Room (Gates G04) Chair(s): Jordan Samhi University of Luxembourg | ||
15:30 10mTalk | Understanding Breaking Changes in the Wild Technical Papers Dhanushka Jayasuriya University of Auckland, Valerio Terragni University of Auckland, Jens Dietrich Victoria University of Wellington, Samuel Ou University of Auckland, Kelly Blincoe University of Auckland DOI | ||
15:40 10mTalk | LiResolver: License Incompatibility Resolution for Open Source Software Technical Papers Sihan Xu Nankai University, Ya Gao Nankai University, Lingling Fan Nankai University, Linyu Li Nankai University, Xiangrui Cai Nankai University, Zheli Liu Nankai University DOI | ||
15:50 10mTalk | An Empirical Study on Concurrency Bugs in Interrupt-Driven Embedded Software Technical Papers Chao Li Beijing Institute of Control Engineering; Beijing Sunwise Information Technology, Rui Chen Beijing Institute of Control Engineering; Beijing Sunwise Information Technology, Boxiang Wang Beijing Institute of Control Engineering; Beijing Sunwise Information Technology, Zhixuan Wang Xidian University, Tingting Yu Beijing Institute of Control Engineering; Beijing Sunwise Information Technology, Yunsong Jiang Beijing Institute of Control Engineering; Beijing Sunwise Information Technology, Mengfei Yang China Academy of Space Technology DOI | ||
16:00 10mTalk | An Empirical Study of Functional Bugs in Android AppsACM SIGSOFT Distinguished Paper Technical Papers Yiheng Xiong East China Normal University, Mengqian Xu East China Normal University, Ting Su East China Normal University, Jingling Sun East China Normal University, Jue Wang Nanjing University, He Wen East China Normal University, Geguang Pu East China Normal University, Jifeng He East China Normal University, Zhendong Su ETH Zurich DOI | ||
16:10 10mTalk | Testing the Compiler for a New-Born Programming Language: An Industrial Case Study (Experience Paper) Technical Papers Yingquan Zhao Tianjin University, Junjie Chen Tianjin University, Ruifeng Fu Tianjin University, Haojie Ye Huawei, Zan Wang Tianjin University DOI | ||
16:20 10mTalk | An Empirical Study on the Effects of Obfuscation on Static Machine Learning-Based Malicious JavaScript Detectors Technical Papers Kunlun Ren Huazhong University of Science and Technology, Qiang Weizhong Huazhong University of Science and Technology, Yueming Wu Nanyang Technological University, yi zhou Huazhong University of Science and Technology, Deqing Zou Huazhong University of Science and Technology, Hai Jin Huazhong University of Science and Technology DOI | ||
16:30 10mTalk | Security Checking of Trigger-Action-Programming Smart Home Integrations Technical Papers Lei Bu Nanjing University, Qiuping Zhang Nanjing University, Suwan Li Nanjing University, Jinglin Dai Nanjing University, Guangdong Bai University of Queensland, Kai Chen Institute of Information Engineering at Chinese Academy of Sciences, Xuandong Li Nanjing University DOI | ||
16:40 10mTalk | Third-Party Library Dependency for Large-Scale SCA in the C/C++ Ecosystem: How Far Are We? Technical Papers Ling Jiang Southern University of Science and Technology, Hengchen Yuan Southern University of Science and Technology, Qiyi Tang Tencent Security Keen Lab, Sen Nie Tencent Security Keen Lab, Shi Wu Tencent Security Keen Lab, Yuqun Zhang Southern University of Science and Technology DOI | ||
16:50 10mTalk | Who Judges the Judge: An Empirical Study on Online Judge Tests Technical Papers Kaibo Liu Peking University, Yudong Han Peking University, Jie M. Zhang King’s College London, Zhenpeng Chen University College London, Federica Sarro University College London, Mark Harman University College London, Gang Huang Peking University; National Key Laboratory of Data Space Technology and System, Yun Ma Peking University DOI |