Improving Binary Code Similarity Transformer Models by Semantics-Driven Instruction Deemphasis (ISSTA 2023 - Technical Papers)

Who

Xiangzhe Xu, Shiwei Feng, Yapeng Ye, Guangyu Shen, Zian Su, Siyuan Cheng, Guanhong Tao, Qingkai Shi, Zhuo Zhang, Xiangyu Zhang

Track

ISSTA 2023 Technical Papers

Time Zone

The program is currently displayed in (GMT-07:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-07:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 19 Jul 2023 10:45 - 11:00 at Smith Classroom (Gates G10) - ISSTA 5: Improving Deep Learning Systems Chair(s): Michael Pradel

Abstract

Given a function in the binary executable form, binary code similarity analysis determines a set of similar functions from a large pool of candidate functions. These similar functions are usually compiled from the same source code with different compilation setups. Such analysis has a large number of applications, such as malware detection, code clone detection, and automatic software patching. The state-of-the art methods utilize complex Deep Learning models such as Transformer models. We observe that these models suffer from undesirable instruction distribution biases caused by specific compiler conventions. We develop a novel technique to detect such biases and repair them by removing the corresponding instructions from the dataset and finetuning the models. This entails synergy between Deep Learning model analysis and program analysis. Our results show that we can substantially improve the state-of-the-art models' performance by up to 14.4% in the most challenging cases where test data may be out of the distributions of training data.

DOI

https://doi.org/10.1145/3597926.3598121

Xiangzhe Xu

Purdue University

United States

Shiwei Feng

Purdue University

United States

Yapeng Ye

Purdue University

United States

Guangyu Shen

Purdue University

United States

Zian Su

Purdue University

United States

Siyuan Cheng

Purdue University

United States

Guanhong Tao

Purdue University

United States

Qingkai Shi

Purdue University

United States

Zhuo Zhang

Purdue University

United States

Xiangyu Zhang

Purdue University

United States

Time Zone

The program is currently displayed in (GMT-07:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-07:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 19 Jul
Displayed time zone: Pacific Time (US & Canada) change

10:30 - 12:00	ISSTA 5: Improving Deep Learning SystemsTechnical Papers at Smith Classroom (Gates G10) Chair(s): Michael Pradel University of Stuttgart

10:30 15m Talk		Understanding and Tackling Label Errors in Deep Learning-Based Vulnerability Detection (Experience Paper) Technical Papers XuNie Huazhong University of Science and Technology; Beijing University of Posts and Telecommunications, Ningke Li Huazhong University of Science and Technology, Kailong Wang Huazhong University of Science and Technology, Shangguang Wang Beijing University of Posts and Telecommunications, Xiapu Luo Hong Kong Polytechnic University, Haoyu Wang Huazhong University of Science and Technology DOI
10:45 15m Talk		Improving Binary Code Similarity Transformer Models by Semantics-Driven Instruction Deemphasis Technical Papers Xiangzhe Xu Purdue University, Shiwei Feng Purdue University, Yapeng Ye Purdue University, Guangyu Shen Purdue University, Zian Su Purdue University, Siyuan Cheng Purdue University, Guanhong Tao Purdue University, Qingkai Shi Purdue University, Zhuo Zhang Purdue University, Xiangyu Zhang Purdue University DOI
11:00 15m Talk		CILIATE: Towards Fairer Class-Based Incremental Learning by Dataset and Training Refinement Technical Papers Xuanqi Gao Xi’an Jiaotong University, Juan Zhai University of Massachusetts Amherst, Shiqing Ma UMass Amherst, Chao Shen Xi’an Jiaotong University, Yufei Chen Xi’an Jiaotong University; City University of Hong Kong, Shiwei Wang Xi’an Jiaotong University DOI Pre-print
11:15 15m Talk		DeepAtash: Focused Test Generation for Deep Learning Systems Technical Papers Tahereh Zohdinasab USI Lugano, Vincenzo Riccio University of Udine, Paolo Tonella USI Lugano DOI
11:30 15m Talk		Systematic Testing of the Data-Poisoning Robustness of KNN Technical Papers Yannan Li University of Southern California, Jingbo Wang University of Southern California, Chao Wang University of Southern California DOI
11:45 15m Talk		Semantic-Based Neural Network Repair Technical Papers Richard Schumi Singapore Management University, Jun Sun Singapore Management University DOI