A Comprehensive Study on Learning-Based PE Malware Family Classification Methods

Overview

A Comprehensive Study on Learning-Based PE Malware Family Classification Methods

Datasets

Because of copyright issues, both the MalwareBazaar dataset and the MalwareDrift dataset just contain the malware SHA-256 hash and all of the related information which can be find in the Datasets folder. You can download raw malware samples from the open-source malware release website by applying an api-key, and use disassembly tool to convert the malware into binary and disassembly files.

  • The MalwareBazaar dataset : you can download the samples from MalwareBazaar.
  • The MalwareDrift dataset : you can download the samples from VirusShare.

Experimental Settings

Model Training Strategy Optimizer Learning Rate Batch Size Input Format
ResNet-50 From Scratch Adam 1e-3 64 224*224 color image
ResNet-50 Transfer Adam 1e-3 All data* 224*224 color image
VGG-16 From Scratch SGD 5e-6** 64 224*224 color image
VGG-16 Transfer SGD 5e-6 64 224*224 color image
Inception-V3 From Scratch Adam 1e-3 64 224*224 color image
Inception-V3 Transfer Adam 1e-3 All data 224*224 color image
IMCFN From Scratch SGD 5e-6*** 32 224*224 color image
IMCFN Transfer SGD 5e-6*** 32 224*224 color image
CBOW+MLP - SGD 1e-3 128 CBOW: byte sequences; MLP: 256*256 matrix
MalConv - SGD 1e-3 32 2MB raw byte values
MAGIC - Adam 1e-4 10 ACFG
Word2Vec+KNN - - - - Word2Vec: Opcode sequences; KNN distance measure: WMD
MCSC - SGD 5e-3 64 Opcode sequences

* The batch size is set to 128 for the MalwareBazaar dataset
** The learning rate is set to 5e-5 for the Malimg dataset and 1e-5 for the MalwareBazaar dataset
*** The learning rate is set to 1e-5 for the MalwareBazaar dataset
CBOW is with default parameters in the Word2Vec package in the Gensim library of Python

Graphically Analysis of Table 4 and Table 5

Here is a more detailed figure analysis for Table 4 and Table 5 in order to make the raw information in the paper easier to digest.

Table 4

  • The classification performance (F1-Score) of each approach on three datasets classification performance

    The figure shows the classification performance (F1-Score) of each methods on three datasets. It is noteworthy that the Malimg dataset only contains malware images, and thus it can only be used to evaluate the 4 image-based methods.

  • The average classification performance (F1-Score) of each approach for three datasets average classification performance

    The figure shows the average classification performance (F1-Score) of each method for the three datasets. Among them, the F1-score corresponding to each model is obtained by averaging the F1-score of the model on three datasets, which represents the average performance.

  • The train time and resource overhead of each method on three datasets
    resource consumption

    The figure shows the train time (left subgraph) and resource overhead (right subgraph) needed for every method on three datasets. The bar immediately to the right of the train time bar is the memory overhead of this model. Similarly, there are only 4 image-based models for the Malimg dataset.

Table 5

  • The classification performance (F1-Score) of transfer learning for image-based approaches on three datasets transfer learning

    This figure shows the F1-Score obtained by every image-based model using the strategy of training from scratch, 10% transfer learning, 50% transfer learning, 80% transfer learning, and 100% transfer learning, respectively. Every subgraph correspond to the BIG-15, Malimg, and MalwareBazaar dataset, respectively.

  • The train time and resource overhead of transfer learning for image-based approaches on three datasets
    resource consumption

    Each row correspond to the BIG-15, Mmalimg, and MalwareBazaar dataset, respectively. For each row, there are 4 models (ResNet-50, VGG-16, Inception-V3 and IMCFN). For each model, there are 8 bars on the right, the left 4 bars stands for the train time under 10%, 50%, 80% and 100% transfer learning, and the right 4 bars are the memory overhead under 10%, 50%, 80% and 100% transfer learning.

An Industrial Grade Federated Learning Framework

DOC | Quick Start | 中文 FATE (Federated AI Technology Enabler) is an open-source project initiated by Webank's AI Department to provide a secure comput

Federated AI Ecosystem 4.8k Jan 09, 2023
Learning Pixel-level Semantic Affinity with Image-level Supervision for Weakly Supervised Semantic Segmentation, CVPR 2018

Learning Pixel-level Semantic Affinity with Image-level Supervision This code is deprecated. Please see https://github.com/jiwoon-ahn/irn instead. Int

Jiwoon Ahn 337 Dec 15, 2022
Libraries, tools and tasks created and used at DeepMind Robotics.

Libraries, tools and tasks created and used at DeepMind Robotics.

DeepMind 270 Nov 30, 2022
Torchyolo - Yolov3 ve Yolov4 modellerin Pytorch uygulamasıdır

TORCHYOLO : Yolo Modellerin Pytorch Uygulaması Yapılacaklar: Yolov3 model.py ve

Kadir Nar 3 Aug 22, 2022
A collection of IPython notebooks covering various topics.

ipython-notebooks This repo contains various IPython notebooks I've created to experiment with libraries and work through exercises, and explore subje

John Wittenauer 2.6k Jan 01, 2023
Potato Disease Classification - Training, Rest APIs, and Frontend to test.

Potato Disease Classification Setup for Python: Install Python (Setup instructions) Install Python packages pip3 install -r training/requirements.txt

codebasics 95 Dec 21, 2022
Out-of-Town Recommendation with Travel Intention Modeling (AAAI2021)

TrainOR_AAAI21 This is the official implementation of our AAAI'21 paper: Haoran Xin, Xinjiang Lu, Tong Xu, Hao Liu, Jingjing Gu, Dejing Dou, Hui Xiong

Jack Xin 13 Oct 19, 2022
TACTO: A Fast, Flexible and Open-source Simulator for High-Resolution Vision-based Tactile Sensors

TACTO: A Fast, Flexible and Open-source Simulator for High-Resolution Vision-based Tactile Sensors This package provides a simulator for vision-based

Facebook Research 255 Dec 27, 2022
EDPN: Enhanced Deep Pyramid Network for Blurry Image Restoration

EDPN: Enhanced Deep Pyramid Network for Blurry Image Restoration Ruikang Xu, Zeyu Xiao, Jie Huang, Yueyi Zhang, Zhiwei Xiong. EDPN: Enhanced Deep Pyra

69 Dec 15, 2022
Augmented CLIP - Training simple models to predict CLIP image embeddings from text embeddings, and vice versa.

Train aug_clip against laion400m-embeddings found here: https://laion.ai/laion-400-open-dataset/ - note that this used the base ViT-B/32 CLIP model. S

Peter Baylies 55 Sep 13, 2022
Code for the paper "Graph Attention Tracking". (CVPR2021)

SiamGAT 1. Environment setup This code has been tested on Ubuntu 16.04, Python 3.5, Pytorch 1.2.0, CUDA 9.0. Please install related libraries before r

122 Dec 24, 2022
Single object tracking and segmentation.

Single/Multiple Object Tracking and Segmentation Codes and comparison of recent single/multiple object tracking and segmentation. News 💥 AutoMatch is

ZP ZHANG 385 Jan 02, 2023
Image Segmentation and Object Detection in Pytorch

Image Segmentation and Object Detection in Pytorch Pytorch-Segmentation-Detection is a library for image segmentation and object detection with report

Daniil Pakhomov 732 Dec 10, 2022
[3DV 2021] A Dataset-Dispersion Perspective on Reconstruction Versus Recognition in Single-View 3D Reconstruction Networks

dispersion-score Official implementation of 3DV 2021 Paper A Dataset-dispersion Perspective on Reconstruction versus Recognition in Single-view 3D Rec

Yefan 7 May 28, 2022
Code for PhySG: Inverse Rendering with Spherical Gaussians for Physics-based Relighting and Material Editing

PhySG: Inverse Rendering with Spherical Gaussians for Physics-based Relighting and Material Editing CVPR 2021. Project page: https://kai-46.github.io/

Kai Zhang 141 Dec 14, 2022
This is the official released code for our paper, The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos

The-Emergence-of-Objectness This is the official released code for our paper, The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos

44 Oct 08, 2022
Fuzzer for Linux Kernel Drivers

difuze: Fuzzer for Linux Kernel Drivers This repo contains all the sources (including setup scripts), you need to get difuze up and running. Tested on

seclab 344 Dec 27, 2022
Convnext-tf - Unofficial tensorflow keras implementation of ConvNeXt

ConvNeXt Tensorflow This is unofficial tensorflow keras implementation of ConvNe

29 Oct 06, 2022
Code for AA-RMVSNet: Adaptive Aggregation Recurrent Multi-view Stereo Network (ICCV 2021).

AA-RMVSNet Code for AA-RMVSNet: Adaptive Aggregation Recurrent Multi-view Stereo Network (ICCV 2021) in PyTorch. paper link: arXiv | CVF Change Log Ju

Qingtian Zhu 97 Dec 30, 2022