A machine learning malware analysis framework for Android apps.

Overview

šŸ•µļø A machine learning malware analysis framework for Android apps. ā˜¢ļø


DroidDetective is a Python tool for analysing Android applications (APKs) for potential malware related behaviour and configurations. When provided with a path to an application (APK file) Droid Detective will make a prediction (using it's ML model) of if the application is malicious. Features and qualities of Droid Detective include:

  • Analysing which of ~330 permissions are specified in the application's AndroidManifest.xml file. šŸ™…
  • Analysing the number of standard and proprietary permissions in use in the application's AndroidManifest.xml file. 🧮
  • Using a RandomForest machine learning classifier, trained off the above data, from ~14 malware families and ~100 Google Play Store applications. šŸ’»

šŸ¤– Getting Started

Installation

All DroidDetective dependencies can be installed manually or via the requirements file, with

pip install -r REQUIREMENTS.txt

DroidDetective has been tested on both Windows 10 and Ubuntu 18.0 LTS.

Usage

DroidDetective can be run by providing the Python file with an APK as a command line parameter, such as:

python DroidDetective.py myAndroidApp.apk

If an apk_malware.model file is not present, then the tooling will first train the model and will require a training set of APKs in both a folder at the root of the project called malware and another called normal. Once run successfully a result will be printed onto the CLI on if the model has identified the APK to be malicious or benign. An example of this output can be seen below:

>> Analysed file 'com.android.camera2.apk', identified as not malware.

An additional parameter can be provided to DroidDetective.py as a Json file to save the results to. If this Json file already exists the results of this run will be appended to the Json file.

python DroidDetective.py myAndroidApp.apk output.json

An example of this output Json is as follows:

{
    "com.android.camera2": false,
}

āš—ļø Data Science | The ML Model

DroidDetective is a Python tool for analyzing Android applications (APKs) for potential malware related behaviour. This works by training a Random Forest classifier on information derived from both known malware APKs and standard APKs available on the Android app store. This tooling comes pre-trained, however, the model can be re-trained on a new dataset at any time. āš™ļø

This model currently uses permissions from an APKs AndroidManifest.xml file as a feature set. This works by creating a dictionary of each standard Android permission and setting the feature to 1 if the permission is present in the APK. Similarly, a feature is added for the amount of permissions in use in the manifest and for the amount of unidentified permissions found in the manifest.

The pre-trained model was trained off approximately 14 malware families (each with one or more APK files), located from ashisdb's repository, and approximately 100 normal applications located from the Google Play Store.

The below denotes the statistics for this ML model:

Accuracy: 0.9310344827586207
Recall: 0.9166666666666666
Precision: 0.9166666666666666
F-Measure: 0.9166666666666666

The top 10 highest weighted features (i.e. Android permissions) used by this model, for identifying malware, can be seen below:

"android.permission.SYSTEM_ALERT_WINDOW": 0.019091367939223395,
"android.permission.ACCESS_NETWORK_STATE": 0.021001765263234648,
"android.permission.ACCESS_WIFI_STATE": 0.02198962579120518,
"android.permission.RECEIVE_BOOT_COMPLETED": 0.026398914436102188,
"android.permission.GET_TASKS": 0.03595458598076517,
"android.permission.WAKE_LOCK": 0.03908212881520419,
"android.permission.WRITE_SMS": 0.057041576632290585,
"android.permission.INTERNET": 0.08816028225034145,
"android.permission.WRITE_EXTERNAL_STORAGE": 0.09835914154294739,
"other_permission": 0.10189463965313218,
"num_of_permissions": 0.12392224814084198

šŸ“œ License

GNU General Public License v3.0

Owner
James Stevenson
I’m a Software Engineer and Security Researcher, with a background of over five years in the computer security industry.
James Stevenson
Code and models used in "MUSS Multilingual Unsupervised Sentence Simplification by Mining Paraphrases".

Multilingual Unsupervised Sentence Simplification Code and pretrained models to reproduce experiments in "MUSS: Multilingual Unsupervised Sentence Sim

Facebook Research 81 Dec 29, 2022
LoFTR:Detector-Free Local Feature Matching with Transformers CVPR 2021

LoFTR-with-train-script LoFTR:Detector-Free Local Feature Matching with Transformers CVPR 2021 (with train script --- unofficial ---). About Megadepth

Nan Xiaohu 15 Nov 04, 2022
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2020 Links Doc

Sebastian Raschka 4.2k Jan 02, 2023
PyGRANSO: A PyTorch-enabled port of GRANSO with auto-differentiation

PyGRANSO PyGRANSO: A PyTorch-enabled port of GRANSO with auto-differentiation Please check https://ncvx.org/PyGRANSO for detailed instructions (introd

SUN Group @ UMN 26 Nov 16, 2022
Official Repo for ICCV2021 Paper: Learning to Regress Bodies from Images using Differentiable Semantic Rendering

[ICCV2021] Learning to Regress Bodies from Images using Differentiable Semantic Rendering Getting Started DSR has been implemented and tested on Ubunt

Sai Kumar Dwivedi 83 Nov 27, 2022
A fuzzing framework for SMT solvers

yinyang A fuzzing framework for SMT solvers. Given a set of seed SMT formulas, yinyang generates mutant formulas to stress-test SMT solvers. yinyang c

Project Yin-Yang for SMT Solver Testing 145 Jan 04, 2023
StackRec: Efficient Training of Very Deep Sequential Recommender Models by Iterative Stacking

StackRec: Efficient Training of Very Deep Sequential Recommender Models by Iterative Stacking Datasets You can download datasets that have been pre-pr

25 May 29, 2022
URIE: Universal Image Enhancementfor Visual Recognition in the Wild

URIE: Universal Image Enhancementfor Visual Recognition in the Wild This is the implementation of the paper "URIE: Universal Image Enhancement for Vis

Taeyoung Son 43 Sep 12, 2022
Assessing syntactic abilities of BERT

BERT-Syntax Assesing the syntactic abilities of BERT. What Evaluate Google's BERT-Base and BERT-Large models on the syntactic agreement datasets from

Yoav Goldberg 147 Aug 02, 2022
This repository contains the code to replicate the analysis from the paper "Moving On - Investigating Inventors' Ethnic Origins Using Supervised Learning"

Replication Code for 'Moving On' - Investigating Inventors' Ethnic Origins Using Supervised Learning This repository contains the code to replicate th

Matthias Niggli 0 Jan 04, 2022
Orange Chicken: Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation

Orange Chicken: Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation This repository contains code and data f

Zoey Liu 0 Jan 07, 2022
Modular Probabilistic Programming on MXNet

MXFusion | | | | Tutorials | Documentation | Contribution Guide MXFusion is a modular deep probabilistic programming library. With MXFusion Modules yo

Amazon 100 Dec 10, 2022
Learning-Augmented Dynamic Power Management

Learning-Augmented Dynamic Power Management This repository contains source code accompanying paper Learning-Augmented Dynamic Power Management with M

Adam 0 Feb 22, 2022
ML From Scratch

ML from Scratch MACHINE LEARNING TOPICS COVERED - FROM SCRATCH Linear Regression Logistic Regression K Means Clustering K Nearest Neighbours Decision

Tanishq Gautam 66 Nov 02, 2022
CTF Challenge for CSAW Finals 2021

Terminal Velocity Misc CTF Challenge for CSAW Finals 2021 This is a challenge I've had in mind for almost 15 years and never got around to building un

Jordan 6 Jul 30, 2022
NasirKhusraw - The TSP solved using genetic algorithm and show TSP path overlaid on a map of the Iran provinces & their capitals.

Nasir Khusraw : Travelling Salesman Problem The TSP solved using genetic algorithm. This project show TSP path overlaid on a map of the Iran provinces

J Brave 2 Sep 01, 2022
LeetCode Solutions https://t.me/tenvlad

leetcode LeetCode Solutions groupped by common patterns YouTube: https://www.youtube.com/c/vladten Telegram: https://t.me/nilinterface Problems source

Vlad Ten 158 Dec 29, 2022
Reinforcement Learning for finance

Reinforcement Learning for Finance We apply reinforcement learning for stock trading. Fetch Data Example import utils # fetch symbols from yahoo fina

Tomoaki Fujii 159 Jan 03, 2023
CIFAR-10_train-test - training and testing codes for dataset CIFAR-10

CIFAR-10_train-test - training and testing codes for dataset CIFAR-10

Frederick Wang 3 Apr 26, 2022
vit for few-shot classification

Few-Shot ViT Requirements PyTorch (= 1.9) TorchVision timm (latest) einops tqdm numpy scikit-learn scipy argparse tensorboardx Pretrained Checkpoints

Martin Dong 26 Nov 30, 2022