AI-based, context-driven network device ranking

Related tags

Deep Learningbatea
Overview

Python package

logo

Batea

A batea is a large shallow pan of wood or iron traditionally used by gold prospectors for washing sand and gravel to recover gold nuggets.

Batea is a context-driven network device ranking framework based on the anomaly detection family of machine learning algorithms. The goal of Batea is to allow security teams to automatically filter interesting network assets in large networks using nmap scan reports. We call those Gold Nuggets.

For more information about Gold Nuggeting and the science behind Batea, check out our whitepaper here

You can try Batea on your nmap scan data without downloading the software, using Batea Live: https://batea.delvesecurity.com/

How it works

Batea works by constructing a numerical representation (numpy) of all devices from your nmap reports (XML) and then applying anomaly detection methods to uncover the gold nuggets. It is easily extendable by adding specific features, or interesting characteristics, to the numerical representation of the network elements.

The numerical representation of the network is constructed using features, which are inspired by the expertise of the security community. The features act as elements of intuition, and the unsupervised anomaly detection methods allow the context of the network asset, or the total description of the network, to be used as the central building block of the ranking algorithm. The exact algorithm used is Isolation Forest (https://en.wikipedia.org/wiki/Isolation_forest)

Machine learning models are the heart of Batea. Models are algorithms trained on the whole dataset and used to predict a score on the same (and other) data points (network devices). Batea also allows for model persistence. That is, you can re-use pretrained models and export models trained on large datasets for further use.

Usage

# Complete info
$ sudo nmap -A 192.168.0.0/16 -oX output.xml

# Partial info
$ sudo nmap -O -sV 192.168.0.0/16 -oX output.xml


$ batea -v output.xml

Installation

$ git clone [email protected]:delvelabs/batea.git
$ cd batea
$ python3 setup.py sdist
$ pip3 install -r requirements.txt
$ pip3 install -e .

Developers Installation

$ git clone [email protected]:delvelabs/batea.git
$ cd batea
$ python3 -m venv batea/
$ source batea/bin/activate
$ python3 setup.py sdist
$ pip3 install -r requirements-dev.txt
$ pip3 install -e .
$ pytest

Example usage

# simple use (output top 5 gold nuggets with default format)
$ batea nmap_report.xml

# Output top 3
$ batea -n 3 nmap_report.xml

# Output all assets
$ batea -A nmap_report.xml

# Using multiple input files
$ batea -A nmap_report1.xml nmap_report2.xml

# Using wildcards (default xsl)
$ batea ./nmap*.xml
$ batea -f csv ./assets*.csv

# You can use batea on pretrained models and export trained models.

# Training, output and dumping model for persistence
$ batea -D mymodel.batea nmap_report.xml

# Using pretrained model
$ batea -L mymodel.batea nmap_report.xml

# Using preformatted csv along with xml files
$ batea -x nmap_report.xml -c portscan_data.csv

# Adjust verbosity
$ batea -vv nmap_report.xml

How to add a feature

Batea works by assigning numerical features to every host in the report (or series of report). Hosts are python objects derived from the nmap report. They consist of the following list of attributes: [ipv4, hostname, os_info, ports] where ports is a list of ports objects. Each port has the following list of attributes : [port, protocol, state, service, software, version, cpe, scripts], all defaulting to None.

Features are objects inherited from the FeatureBase class that instantiate a specific _transform method. This method always takes the list of all hosts as input and returns a lambda function that maps each host to a numpy column of numeric values (host order is conserved). The column is then appended to the matrix representation of the report. Features must output correct numerical values (floats or integers) and nothing else.

Most feature transformations are implemented using a simple lambda function. Just make sure to default a numeric value to every host for model compatibility.

Ex:

class CustomInterestingPorts(FeatureBase):
    def __init__(self):
        super().__init__(name="some_custom_interesting_ports")

    def _transform(self, hosts):
      """This method takes a list of hosts and returns a function that counts the number
      of host ports member from a predefined list of "interesting" ports, defaulting to 0.

      Parameters
      ----------
      hosts : list
          The list of all hosts

      Returns
      -------
      f : lambda function
          Counts the number of ports in the defined list.
      """
        member_ports = [21, 22, 25, 8080, 8081, 1234]
        f = lambda host: len([port for port in host.ports if port.port in member_ports])
        return f

You can then add the feature to the report by using the NmapReport.add_feature method in batea/__init__.py

from .features.basic_features import CustomInterestingPorts

def build_report():
    report = NmapReport()
    #[...]
    report.add_feature(CustomInterestingPorts())

    return report

Using precomputed tabular data (CSV)

It is possible to use preprocessed data to train the model or for prediction. The data has to be indexed by (ipv4, port) with one unique combination per row. The type of data should be close to what you expect from the XML version of an nmap report. A column has to use one of the following names, but you don't have to use all of them. The parser defaults to null values if a column is absent.

  'ipv4',
  'hostname',
  'os_name',
  'port',
  'state',
  'protocol',
  'service',
  'software_banner',
  'version',
  'cpe',
  'other_info'

Example:

ipv4,hostname,os_name,port,state,protocol,service,software_banner
10.251.53.100,internal.delvesecurity.com,Linux,110,open,tcp,rpcbind,"program version   port/proto  service100000  2,3,4        111/tcp  rpcbind100000  2,3,4    "
10.251.53.100,internal.delvesecurity.com,Linux,111,open,tcp,rpcbind,
10.251.53.188,serious.delvesecurity.com,Linux,6000,open,tcp,X11,"X11Probe: CentOS"

Outputing numerical representation

For the data scientist in you, or just for fun and profit, you can output the numerical matrix along with the score column instead of the regular output. This can be useful for further data analysis and debug purpose.

$ batea -oM network_matrix nmap_report.xml
Owner
Secureworks Taegis VDR
Automatically identify and prioritize vulnerabilities for intelligent remediation.
Secureworks Taegis VDR
Implicit Model Specialization through DAG-based Decentralized Federated Learning

Federated Learning DAG Experiments This repository contains software artifacts to reproduce the experiments presented in the Middleware '21 paper "Imp

Operating Systems and Middleware Group 5 Oct 16, 2022
A simple python library for fast image generation of people who do not exist.

Random Face A simple python library for fast image generation of people who do not exist. For more details, please refer to the [paper](https://arxiv.

Sergei Belousov 170 Dec 15, 2022
Determined: Deep Learning Training Platform

Determined: Deep Learning Training Platform Determined is an open-source deep learning training platform that makes building models fast and easy. Det

Determined AI 2k Dec 31, 2022
use tensorflow 2.0 to tell a dog and cat from a specified picture

dog_or_cat use tensorflow 2.0 to tell a dog and cat from a specified picture This is one of the classic experiments for the introduction of deep learn

你这个代码我看不懂 1 Oct 22, 2021
SegNet-Basic with Keras

SegNet-Basic: What is Segnet? Deep Convolutional Encoder-Decoder Architecture for Semantic Pixel-wise Image Segmentation Segnet = (Encoder + Decoder)

Yad Konrad 81 Jun 30, 2022
Semantic Segmentation of images using PixelLib with help of Pascalvoc dataset trained with Deeplabv3+ framework.

CARscan- Approach 1 - Segmentation of images by detecting contours. It failed because in images with elements along with cars were also getting detect

Padmanabha Banerjee 5 Jul 29, 2021
Joint-task Self-supervised Learning for Temporal Correspondence (NeurIPS 2019)

Joint-task Self-supervised Learning for Temporal Correspondence Project | Paper Overview Joint-task Self-supervised Learning for Temporal Corresponden

Sifei Liu 167 Dec 14, 2022
Repository of best practices for deep learning in Julia, inspired by fastai

FastAI Docs: Stable | Dev FastAI.jl is inspired by fastai, and is a repository of best practices for deep learning in Julia. Its goal is to easily ena

FluxML 532 Jan 02, 2023
Unofficial PyTorch implementation of Attention Free Transformer (AFT) layers by Apple Inc.

aft-pytorch Unofficial PyTorch implementation of Attention Free Transformer's layers by Zhai, et al. [abs, pdf] from Apple Inc. Installation You can i

Rishabh Anand 184 Dec 12, 2022
Code for "Intra-hour Photovoltaic Generation Forecasting based on Multi-source Data and Deep Learning Methods."

pv_predict_unet-lstm Code for "Intra-hour Photovoltaic Generation Forecasting based on Multi-source Data and Deep Learning Methods." IEEE Transactions

FolkScientistInDL 8 Oct 08, 2022
This repo is developed for Strong Baseline For Vehicle Re-Identification in Track 2 Ai-City-2021 Challenges

A STRONG BASELINE FOR VEHICLE RE-IDENTIFICATION This paper is accepted to the IEEE Conference on Computer Vision and Pattern Recognition Workshop(CVPR

Cybercore Co. Ltd 78 Dec 29, 2022
Twin-deep neural network for semi-supervised learning of materials properties

Deep Semi-Supervised Teacher-Student Material Synthesizability Prediction Citation: Semi-supervised teacher-student deep neural network for materials

MLEG 3 Dec 14, 2022
Audio Visual Emotion Recognition using TDA

Audio Visual Emotion Recognition using TDA RAVDESS database with two datasets analyzed: Video and Audio dataset: Audio-Dataset: https://www.kaggle.com

Combinatorial Image Analysis research group 3 May 11, 2022
High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.

TL;DR Ignite is a high-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently. Click on the image to

4.2k Jan 01, 2023
Attentive Implicit Representation Networks (AIR-Nets)

Attentive Implicit Representation Networks (AIR-Nets) Preprint | Supplementary | Accepted at the International Conference on 3D Vision (3DV) teaser.mo

29 Dec 07, 2022
UltraGCN: An Ultra Simplification of Graph Convolutional Networks for Recommendation

UltraGCN This is our Pytorch implementation for our CIKM 2021 paper: Kelong Mao, Jieming Zhu, Xi Xiao, Biao Lu, Zhaowei Wang, Xiuqiang He. UltraGCN: A

XUEPAI 93 Jan 03, 2023
SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

The SLIDE package contains the source code for reproducing the main experiments in this paper. Dataset The Datasets can be downloaded in Amazon-

Intel Labs 72 Dec 16, 2022
Temporal Knowledge Graph Reasoning Triggered by Memories

MTDM Temporal Knowledge Graph Reasoning Triggered by Memories To alleviate the time dependence, we propose a memory-triggered decision-making (MTDM) n

4 Sep 25, 2022
Koç University deep learning framework.

Knet Knet (pronounced "kay-net") is the Koç University deep learning framework implemented in Julia by Deniz Yuret and collaborators. It supports GPU

1.4k Dec 31, 2022
Package for extracting emotions from social media text. Tailored for financial data.

EmTract: Extracting Emotions from Social Media Text Tailored for Financial Contexts EmTract is a tool that extracts emotions from social media text. I

13 Nov 17, 2022