Full-featured Decision Trees and Random Forests learner.

Last update: Aug 15, 2022

Overview

CID3

This is a full-featured Decision Trees and Random Forests learner. It can save trees or forests to disk for later use. It is possible to query trees and Random Forests and to fill out an unlabeled file with the predicted classes. Documentation is not yet available, although the program options can be shown with command:

% java -jar cid3.jar -h

usage: java -jar cid3.jar
 -a,--analysis <name>    show causal analysis report
 -c,--criteria <name>    input criteria: c[Certainty], e[Entropy], g[Gini]
 -f,--file <name>        input file
 -h,--help               print this message
 -o,--output <name>      output file
 -p,--partition          partition train/test data
 -q,--query <type>       query model, enter: t[Tree] or r[Random forest]
 -r,--forest <amount>    create random forest, enter # of trees
 -s,--save               save tree/random forest
 -t,--threads <amount>   maximum number of threads (default is 500)
 -v,--validation         create 10-fold cross-validation
 -ver,--version          version

List of features

It uses a new Certainty formula as splitting criteria.
Provides causal analysis report, which shows how some attribute values cause a particular classification.
Creates full trees, showing error rates for train and test data, attribute importance, causes and false positives/negatives.
If no test data is provided, it can split the train dataset in 80% for training and 20% for testing.
Creates random forests, showing error rates for train and test data, attribute importance, causes and false positives/negatives. Random forests are created in parallel, so it is very fast.
Creates 10 Fold Cross-Validation for trees and random forests, showing error rates, mean and Standard Error and false positives/negatives. Cross-Validation folds are created in parallel.
Saves trees and random forests to disk in a compressed file. (E.g. model.tree, model.forest)
Query trees and random forest from saved files. Queries can contain missing values, just enter the character: “?”.
Make predictions and fill out cases files with those predictions, either from single trees or random forests.
Missing values imputation for train and test data is implemented. Continuous attributes are imputed as the mean value. Discrete attributes are imputed as MODE, which selects the value that is most frequent.
Ignoring attributes is implemented. In the .names file just set the attribute type as: ignore.
Three different splitting criteria can be used: Certainty, Entropy and Gini. If no criteria is invoked then Certainty will be used.

Example run with titanic dataset

[email protected] datasets % java -jar cid3.jar -f titanic

CID3 [Version 1.1]              Saturday October 30, 2021 06:34:11 AM
------------------
[ ✓ ] Read data: 891 cases for training. (10 attributes)
[ ✓ ] Decision tree created.

Rules: 276
Nodes: 514

Importance Cause   Attribute Name
---------- -----   --------------
      0.57   yes ············ Sex
      0.36   yes ········· Pclass
      0.30   yes ··········· Fare
      0.28   yes ······· Embarked
      0.27   yes ·········· SibSp
      0.26   yes ·········· Parch
      0.23    no ············ Age


[==== TRAIN DATA ====] 

Correct guesses:  875
Incorrect guesses: 16 (1.8%)

# Of Cases  False Pos  False Neg   Class
----------  ---------  ---------   -----
       549         14          2 ····· 0
       342          2         14 ····· 1

Time: 0:00:00

Requirements

CID3 requires JDK 15 or higher.

The data format is similar to that of C4.5 and C5.0. The data file format is CSV, and it could be split in two separated files, like: titanic.data and titanic.test. The class attribute column must be the last column of the file. The other necessary file is the "names" file, which should be named like: titanic.names, and it contains the names and types of the attributes. The first line is the class attribute possible values. This line could be left empty with just a dot(.) Below is an example of the titanic.names file:

0,1.  
PassengerId: ignore.  
Pclass: 1,2,3.  
Sex : male,female.  
Age: continuous.  
SibSp: discrete.  
Parch: discrete.  
Ticket: ignore.  
Fare: continuous.  
Cabin: ignore.  
Embarked: discrete.

Example of causal analysis

% java -jar cid3.jar -f adult -a education

From this example we can see that attribute "education" is a cause, which is based on the certainty-raising inequality. Once we know that it is a cause we then compare the causal certainties of its values. When it's value is "Doctorate" it causes the earnings to be greater than $50,000, with a probability of 0.73. A paper will soon be published with all the formulas used to calculate the Certainty for splitting the nodes and the certainty-raising inequality, used for causal analysis.

Importance Cause   Attribute Name
---------- -----   --------------
      0.56   yes ······ education

Report of causal certainties
----------------------------

[ Attribute: education ]

    1st-4th --> <=50K  (0.97)

    5th-6th --> <=50K  (0.95)

    7th-8th --> <=50K  (0.94)

    9th --> <=50K  (0.95)

    10th --> <=50K  (0.94)

    11th --> <=50K  (0.95)

    12th --> <=50K  (0.93)

    Assoc-acdm --> <=50K  (0.74)

    Assoc-voc --> <=50K  (0.75)

    Bachelors --> Non cause.

    Doctorate --> >50K  (0.73)

    HS-grad --> <=50K  (0.84)

    Masters --> >50K  (0.55)

    Preschool --> <=50K  (0.99)

    Prof-school --> >50K  (0.74)

    Some-college --> <=50K  (0.81)

Releases(v1.2.4)

v1.2.4(Apr 28, 2022)

Fixed a bug when entering an attribute name for causal analysis report.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.2.3(Mar 10, 2022)

Implemented progress animation when option -s is invoked.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.2.2(Mar 2, 2022)

Added progress animation to the analysis report.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.2.1(Jan 21, 2022)

Replaced a problematic character.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.2(Nov 9, 2021)

This version includes de correct calculation of causal certainties and the certainty raising inequality. Also the analysis report is sorted by attribute values.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.1.5(Nov 7, 2021)

Implemented correctly the causal analysis, using the certainty-raising inequality and the causal certainties.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.1.3(Nov 7, 2021)

Implemented causes for specific attribute values.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.1.2(Nov 6, 2021)

Minor patch.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.1.1(Oct 31, 2021)

This is a hurried patch to fix a problem in the causal analysis report. Now the report works as it was intended.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.1(Oct 30, 2021)

Release v1.1 contains many new features and fixes. Implemented report of causal certainties, which allows to see how certain attribute values cause a particular classification.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.0.7(Oct 28, 2021)

Code cleanup and new features implemented. When querying a tree now checks for invalid input and asks for correct input. This will be the last patch until version v1.1
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.0.6(Oct 28, 2021)

Correctly aligned text on console.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.0.5(Oct 27, 2021)

Reintroduced attribute importance for Entropy and Gini criteria.
Source code(tar.gz)
Source code(zip)
cid3.jar(5.62 MB)
v1.0.4(Oct 27, 2021)

Removed causal analysis from Entropy and Gini criteria. It only makes sense with Certainty.
Source code(tar.gz)
Source code(zip)
cid3.jar(5.62 MB)
v1.0.3(Oct 23, 2021)

Rolled back the parallel tests of Random Forests. It is much faster now.
Source code(tar.gz)
Source code(zip)
cid3.jar(5.62 MB)
v1.0.2(Oct 23, 2021)

Minor changes.
Source code(tar.gz)
Source code(zip)
cid3.jar(5.62 MB)
v1.0.1(Oct 23, 2021)

Now testing Random Forests is done in parallel.
Source code(tar.gz)
Source code(zip)
cid3.jar(5.62 MB)
v1.0(Oct 18, 2021)

Releasing version v1.0
Source code(tar.gz)
Source code(zip)
cid3.jar(5.62 MB)

Full body anonymization - Realistic Full-Body Anonymization with Surface-Guided GANs

Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, Daniel Silva, Andrew McCallum, Amr Ahmed. KDD 2019.

gHHC Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, D

35 Nov 16, 2022

A python library to build Model Trees with Linear Models at the leaves.

212 Dec 30, 2022

Full-featured Decision Trees and Random Forests learner.

Related tags

Overview

CID3

List of features

Example run with titanic dataset

Requirements

Example of causal analysis

You might also like...

Full body anonymization - Realistic Full-Body Anonymization with Surface-Guided GANs

Random-Afg - Afghanistan Random Old Idz Cloner Tools

ElegantRL is featured with lightweight, efficient and stable, for researchers and practitioners.

This program writes christmas wish programmatically. It is using turtle as a pen pointer draw christmas trees and stars.

Simulate genealogical trees and genomic sequence data using population genetic models

TreeSubstitutionCipher - Encryption system based on trees and substitution

Python implementation of cover trees, near-drop-in replacement for scipy.spatial.kdtree

Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, Daniel Silva, Andrew McCallum, Amr Ahmed. KDD 2019.

A python library to build Model Trees with Linear Models at the leaves.

Releases(v1.2.4)

v1.2.4(Apr 28, 2022)

v1.2.3(Mar 10, 2022)

v1.2.2(Mar 2, 2022)

v1.2.1(Jan 21, 2022)

v1.2(Nov 9, 2021)

v1.1.5(Nov 7, 2021)

v1.1.3(Nov 7, 2021)

v1.1.2(Nov 6, 2021)

v1.1.1(Oct 31, 2021)

v1.1(Oct 30, 2021)

v1.0.7(Oct 28, 2021)

v1.0.6(Oct 28, 2021)

v1.0.5(Oct 27, 2021)

v1.0.4(Oct 27, 2021)

v1.0.3(Oct 23, 2021)

v1.0.2(Oct 23, 2021)

v1.0.1(Oct 23, 2021)

v1.0(Oct 18, 2021)

Owner

Alejandro Penate-Diaz

Learning Multiresolution Matrix Factorization and its Wavelet Networks on Graphs

Image Deblurring using Generative Adversarial Networks

PyTorch implementation of image classification models for CIFAR-10/CIFAR-100/MNIST/FashionMNIST/Kuzushiji-MNIST/ImageNet

Churn-Prediction-Project - In this project, a churn prediction model is developed for a private bank as a term project for Data Mining class.

FirmWire is a full-system baseband firmware emulation platform for fuzzing, debugging, and root-cause analysis of smartphone baseband firmwares

CenterPoint 3D Object Detection and Tracking using center points in the bird-eye view.

A proof of concept ai-powered Recaptcha v2 solver

Stochastic Normalizing Flows

Content shared at DS-OX Meetup

Voxel-based Network for Shape Completion by Leveraging Edge Generation (ICCV 2021, oral)

A Fast Monotone Rotating Shallow Water model

Code of paper "CDFI: Compression-Driven Network Design for Frame Interpolation", CVPR 2021

Open source implementation of "A Self-Supervised Descriptor for Image Copy Detection" (SSCD).

Educational 2D SLAM implementation based on ICP and Pose Graph

Object Depth via Motion and Detection Dataset

Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

This program can detect your face and add an Christams hat on the top of your head

Neural Message Passing for Computer Vision

Wenet STT Python

Real-CUGAN - Real Cascade U-Nets for Anime Image Super Resolution