Deep Learning (DL) models to analyze source code have shown immense promise during the past few years.
More recently, self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks, such as clone and bug detection.

While previous work successfully learned from different code abstractions (e.g., token, AST, graph), we argue that it is also essential to factor in how developers code day-to-day for learning general-purpose representation. On the one hand, human developers tend to write repetitive programs referencing existing code snippets from the current codebase or online resources (e.g., Stack Overflow website) rather than implementing functions from scratch; such behaviors result in a vast number of code clones. In contrast, a deviant clone by mistake might trigger malicious program behaviors.

Thus, as a proxy to incorporate developers' coding behavior into the pre-training scheme, we propose to include code clones and their deviants. In particular, we propose CONCORD, a self-supervised pre-training strategy to place benign clones closer in the representation space while moving deviants further apart. We show that CONCORD's clone-aware pre-training drastically reduces the need for expensive pre-training resources while improving the performance of downstream SE tasks. We also empirically demonstrate that CONCORD can improve existing pre-trained models to learn better representations that consequently become more efficient in both identifying semantically equivalent programs and differentiating buggy from non-buggy code.

Tue 18 Jul

Displayed time zone: Pacific Time (US & Canada) change

13:30 - 15:00
ISSTA 3: Deep-Learning for Software AnalysisTechnical Papers at Amazon Auditorium (Gates G20)
Chair(s): Shiyi Wei University of Texas at Dallas
13:30
15m
Talk
API2Vec: Learning Representations of API Sequences for Malware Detection
Technical Papers
Lei Cui Zhongguancun Laboratory, Jiancong Cui University of Chinese Academy of Sciences; Institute of Information Engineering at Chinese Academy of Sciences, Yuede Ji University of North Texas, Zhiyu Hao Zhongguancun Laboratory, Lun Li Institute of Information Engineering at Chinese Academy of Sciences, Zhenquan Ding Institute of Information Engineering at Chinese Academy of Sciences
DOI
13:45
15m
Talk
CONCORD: Clone-Aware Contrastive Learning for Source CodeACM SIGSOFT Distinguished Paper
Technical Papers
Yangruibo Ding Columbia University, Saikat Chakraborty Microsoft Research, Luca Buratti IBM Research, Saurabh Pujar IBM, Alessandro Morari IBM Research, Gail Kaiser Columbia University, Baishakhi Ray Columbia University
DOI
14:00
15m
Talk
Type Batched Program Reduction
Technical Papers
Golnaz Gharachorlu Simon Fraser University, Nick Sumner Simon Fraser University
DOI
14:15
15m
Talk
CodeGrid: A Grid Representation of Code
Technical Papers
Abdoul Kader Kaboré University of Luxembourg, Earl T. Barr University College London; Google DeepMind, Jacques Klein University of Luxembourg, Tegawendé F. Bissyandé University of Luxembourg
DOI
14:30
15m
Talk
Guided Retraining to Enhance the Detection of Difficult Android Malware
Technical Papers
Nadia Daoudi University of Luxembourg, Kevin Allix CentraleSupélec, Tegawendé F. Bissyandé University of Luxembourg, Jacques Klein University of Luxembourg
DOI
14:45
15m
Talk
Automatically Reproducing Android Bug Reports using Natural Language Processing and Reinforcement Learning
Technical Papers
Zhaoxu Zhang University of Southern California, Robert Winn University of Southern California, Yu Zhao University of Central Missouri, Tingting Yu University of Cincinnati, William G.J. Halfond University of Southern California
DOI