Fonte: “Comparing Deep Learning and Shallow Learning Techniques for API Calls Malware Prediction: A Study”, 4 Febbraio 2022, Paper, https://www.mdpi.com/2076-3417/12/3/1645/htm
Abstract
Recognition of malware is critical in cybersecurity as it allows for avoiding execution and the downloading of malware. One of the possible approaches is to analyze the executable’s Application Programming Interface (API) calls, which can be done using tools that work in sandboxes, such as Cuckoo or CAPEv2. This chain of calls can then be used to classify if the considered file is benign or malware. This work aims to compare six modern shallow learning and deep learning techniques based on tabular data, using two datasets of API calls containing malware and goodware, where the corresponding chain of API calls is expressed for each instance. The results show the quality of shallow learning approaches based on tree ensembles, such as CatBoost, both in terms of F1-macro score and Area Under the ROC curve (AUC ROC), and training time, making them optimal for making inferences on Edge AI solutions. The results are then analyzed with the explainable AI SHAP technique, identifying the API calls that most influence the process, i.e., those that are particularly afferent to malware and goodware.
Introduction
The protection of a computer system, especially in a smart enterprise, represents a key factor for the company’s survival: the damage caused by computer attacks can have significant economic impacts. Among the possible attacks, malware attacks are particularly relevant, as they cause direct damage to the system or intercept relevant information for the company. Protecting the smart enterprise from malware is crucial, so it is necessary to implement techniques that can detect and recognize malware within enterprise networks. Most malware detection systems are based on signature verification techniques, which are effective for all known malicious software, but inadequate for new malware as there are no signatures available. This limitation represents a problem, as it can intercept attacks built with prefabricated elements, but it cannot protect against a specifically constructed attack. A way to overcome the limitations of such approaches is to exploit the sandbox software CAPEv2 [1], the evolution of Cuckoo [2], to perform an analysis of hypothetical threats. The use of machine learning and classification techniques, operating on the information related to the behavior of executables, allows the detection of malware, especially the 0-day ones, i.e., those not yet known.