Efficient Cross-Architecture Binary Function Embeddings through Knowledge Distillation

by Dominik Bayerl, Thomas Hutzelmann, and Hans-Joachim Hof (Technische Hochschule Ingolstadt)

Abstract

Deep learning has recently been shown to be effective in various tasks related to static binary analysis. One important analysis task is the binary function similarity problem: Given the binary code of two functions compiled with different compilers, different settings, and different processor architectures, the goal is to decide whether the functions are semantically equivalent (i.e. ”similar”) or not. This problem has numerous applications for embedded systems, for example plagiarism detection, validation of compliance restrictions with usable software licenses, more efficient reverse engineering of existing binary codebases, or vulnerability scanning by detecting known vulnerable functions. In this paper, we propose a novel training scheme for the popular transformer neural network architecture to learn function embeddings directly from
instruction listings. Unlike existing approaches, our solution explicitly considers the cross-architecture scenario: we propose a training method to adapt the model to different instruction set architectures (ISA) without having to train a new model from scratch, which allows the model to also be used efficiently for embedded systems, where there are a variety of different processor architectures. We show that our solution achieves a similarity classification accuracy of 89.6% on a dataset consisting of several real-world open source software projects. Finally, we conduct extensive experiments to demonstrate the effectiveness of knowledge distillation in increasing the computational efficiency of the embedding model. We demonstrate a reduction in the number of parameters from 87M to 23M, while still maintaining a classification accuracy of 87.8%. Our code and artifacts are available as open source.