AutoX是一个高效的自动化机器学习工具，它主要针对于表格类型的数据挖掘竞赛。它的特点包括: 效果出色、简单易用、通用、自动化、灵活。

Last update: Dec 28, 2022

Related tags

Machine Learning AutoX

Overview

English | 简体中文

AutoX是什么？

AutoX一个高效的自动化机器学习工具，它主要针对于表格类型的数据挖掘竞赛。它的特点包括:

效果出色: AutoX在多个kaggle数据集上，效果显著优于其他解决方案(见效果对比)。
简单易用: AutoX的接口和sklearn类似，方便上手使用。
通用: 适用于分类和回归问题。
自动化: 无需人工干预，全自动的数据清洗、特征工程、模型调参等步骤。
灵活性: 各组件解耦合，能单独使用，对于自动机器学习效果不满意的地方，可以结合专家知识，AutoX提供灵活的接口。
比赛上分点总结：整理并公开历史比赛的上分点。

安装

1. git clone https://github.com/4paradigm/autox.git
2. cd autox
3. python setup.py install

架构

├── autox
│   ├── ensemble
│   ├── feature_engineer
│   ├── feature_selection
│   ├── file_io
│   ├── join_tables
│   ├── metrics
│   ├── models
│   ├── process_data
│   └── util.py
│   ├── CONST.py
│   ├── autox.py
├── run_oneclick.py
└── demo
└── test
├── setup.py
├── README.md

快速上手

全自动: 适合于想要快速获得一个不错结果的用户。只需要配置最少的数据信息，就能完成机器学习全流程的构建。

from autox import AutoX
path = data_dir
autox = AutoX(target = 'loss', train_name = 'train.csv', test_name = 'test.csv', 
               id = ['id'], path = path)
sub = autox.get_submit()
sub.to_csv("submission.csv", index = False)

半自动: run_demo.ipynb

适合于想要获得更优预测结果的用户。AutoX提供了易用且丰富的接口，用户可以方便地根据实际数据场景进行配置，以获得更优的预测结果。

效果对比：

index	data_type	data_name(link)	AutoX	AutoGluon	H2o
1	regression	zhidemai	1.1231	1.9466	1.1927
2	regression	Tabular Playground Series - Aug 2021	7.87731	10.3944	7.8895
3	binary classification	Titanic	x	0.78229	0.79186

数据类型

cat: Categorical，类别型无序变量
ord: Ordinal，类别型有序变量
num: Numeric，连续型变量
datetime: Datetime型时间变量
timestamp: imestamp型时间变量

pipeline的逻辑

1.初始化AutoX类

1.1 读数据
1.2 合并train和test
1.3 识别数据表中列的类型
1.4 数据预处理

2.特征工程

特征工程包含单表特征和多表特征。
每一个特征工程类都包含以下功能：
    一、自动筛选要执行当前操作的特征；
    二、查看筛选出来的特征
    三、修改要执行当前操作的特征
    四、执行特征数据的计算，返回和主表样本条数以及顺序一致的特征

3.特征合并

将构造出来的特征进行合并，行数不变，列数增加，返回大的宽表

4.训练集和测试集的划分

将宽表划分成训练集和测试集

5.特征过滤

通过train和test的特征列数据分布情况，对构造出来的特征进行过滤，避免过拟合

6.模型训练

利用过滤后的宽表特征对模型进行训练
模型类提供功能包括：
    一、查看模型默认参数；
    二、模型训练；
    三、模型调参；
    四、查看模型对应的特征重要性；
    五、模型预测

7.模型预测

AutoX类

AutoX类自动为用户管理数据集和数据集信息。
初始化AutoX类之后会执行以下操作：
一、读数据；
二、合并train和test；
三、识别数据表中列的类型；
四、数据预处理。

属性

info_: info_属性用于保存数据集的信息。

info_['id']: List，用于标识数据表唯一的Key
info_['target']: String，用于标识数据表的标签列
info_['shape_of_train']: Int，train数据集的数据样本条数
info_['shape_of_test']: Int，test数据集的数据样本条数
info_['feature_type']: Dict of Dict，标识数据表中特征列的数据类型
info_['train_name']: String，用于训练集主表表名
info_['test_name']: String，用于测试集主表表名

dfs_: dfs_属性用于保存所有的DataFrame，包含原始表数据和构造的表数据。

dfs_['train_test']: train数据和test数据合并后的数据
dfs_['FE_feature_name']:特征工程所构造出的数据，例如FE_count，FE_groupby
dfs_['FE_all']:原始特征和所有特征工程合并后的数据集

方法

concat_train_test: 将训练集和测试集拼接起来，一般在读取数据之后执行
split_train_test: 将训练集和测试集分开，一般在完成特征工程之后执行
get_submit: 获得预测结果(中间过程执行了完成的机器学习pipeline，包括数据预处理，特征工程，模型训练，模型调参，模型融合，模型预测等)

AutoX的pipeline中的操作对应的具体细节：

读数据

读取给定路径下的所有文件。默认情况下，会将训练集主表和测试集主表进行拼接，
再进行后续的数据预处理以及特征工程等操作，并在模型预测开始前，将训练集和测试进行拆分。

数据预处理

- 对时间列解析年, 月, 日, 时、星期几等信息
- 在每次训练前，会对输入到模型的数据删除无效(nunique为1)的特征
- 去除异常样本，去除label为nan的样本

特征工程

1-1拼表特征

1-M拼表特征

- time diff特征
- 聚合统计类特征

count特征

对要操作的特征列，将全体数据集中，和当前样本特征属性一致的样本计数作为特征

target encoding特征
统计类特征

使用两层for训练提取统计类特征。
第一层for循环遍历所有筛选出来的分组特征(group_col)，
第二层for循环遍历所有筛选出来的聚合特征(agg_col)，
在第二层for循环中，
若遇到类别型特征，计算的统计特征为nunique，
若遇到数值型特征，计算的统计特征包括[median, std, sum, max, min, mean].

shift特征

模型训练

AutoX目前支持以下模型，默认情况下，会对Lightgbm模型进行训练：
1. Lightgbm；
2. AutoX 深度神经网络。

模型融合

AutoX支持的模型融合方式包括一下两种，默认情况下，不进行融合。
1. Stacking；
2. Bagging。

比赛上分点总结：

kaggle criteo: 对于nunique很大的特征列，进行分桶操作。例如，对于nunique大于10000的特征，做hash后截断保留4位，再进行label_encode。 zhidemai: article_id隐含了时间信息，增加article_id的排序特征。例如，groupby(['date'])['article_id'].rank()。

错误排查

错误信息	解决办法

Comments

AutoX_Recommend, 数据集处理: kdd cup 2020

原始数据地址: https://tianchi.aliyun.com/competition/entrance/231785/introduction 数据处理方法参考: https://github.com/4paradigm/AutoX/blob/master/autox/autox_recommend/data_process/MovieLens_data_process.ipynb 以及 https://github.com/4paradigm/AutoX/blob/master/autox/autox_recommend/data_process/Netflix-data-process.ipynb
call-for-contributions AutoX_Recommend

opened by poteman 1
AutoX_Recommend, 数据集处理: Amazon product data

原始数据地址: http://jmcauley.ucsd.edu/data/amazon/ 数据处理方法参考: https://github.com/4paradigm/AutoX/blob/master/autox/autox_recommend/data_process/MovieLens_data_process.ipynb 以及 https://github.com/4paradigm/AutoX/blob/master/autox/autox_recommend/data_process/Netflix-data-process.ipynb
call-for-contributions AutoX_Recommend

opened by poteman 1
AutoX_Recommend, 数据集处理: Amazon electronic product recommendation

原始数据地址: https://www.kaggle.com/datasets/prokaggler/amazon-electronic-product-recommendation 数据处理方法参考: https://github.com/4paradigm/AutoX/blob/master/autox/autox_recommend/data_process/MovieLens_data_process.ipynb 以及 https://github.com/4paradigm/AutoX/blob/master/autox/autox_recommend/data_process/Netflix-data-process.ipynb
call-for-contributions AutoX_Recommend

opened by poteman 1
ModuleNotFoundError: No module named 'autox.autox_server'

git clone https://github.com/4paradigm/AutoX.git pip install pytorch_tabnet pip install ./AutoX python from autox import AutoX

ModuleNotFoundError: No module named 'autox.autox_server'

opened by utopianet 1

lightgbm.train bug(lightgbm==3.3.2.99)

Mac中 lightgbm==3.3.2.99， lightgbm.train不再包含verbose_eval和early_stopping_rounds接口，改用callbacks接口，调用lgb模型时会报错

File ~/miniforge3/envs/lx/lib/python3.9/site-packages/autox/autox_competition/models/regressor_ts.py:231, in LgbRegressionTs.fit(self, train, test, used_features, target, time_col, ts_unit, Early_Stopping_Rounds, N_round, Verbose, log1p, custom_metric, weight_for_mae)
    226     model = lgb.train(self.params_, trn_data, num_boost_round=self.N_round, valid_sets=[trn_data, val_data],
    227                       verbose_eval=self.Verbose,
    228                       early_stopping_rounds=self.Early_Stopping_Rounds,
    229                       feval=weighted_mae_lgb(weight=weight_for_mae))
    230 else:
--> 231     model = lgb.train(self.params_, trn_data, num_boost_round=self.N_round, valid_sets=[trn_data, val_data],
...
    233                     early_stopping_rounds=self.Early_Stopping_Rounds)
    234 val = model.predict(train.iloc[valid_idx][used_features])
    235 if log1p:

TypeError: train() got an unexpected keyword argument 'verbose_eval'

opened by LXlearning 0

AutoX_NLP/ nlp_feature.py,glove环境适配

当前glove模型使用的是glove-python-binary包，对windows系统及mac系统安装较困难，可通过其他方式实现glove。代码链接：https://github.com/4paradigm/AutoX/blob/master/autox/autox_nlp/feature_engineer/nlp_feature.py
call-for-contributions AutoX_NLP

opened by DHengW 0
AutoX_NLP/ nlp_feature.py, OOV问题优化

当前Word2Vec和Glove模型无法处理测试数据中未见过的词，需要对测试数据重新进行词表构建，对整体效果影响较大。代码链接：https://github.com/4paradigm/AutoX/blob/master/autox/autox_nlp/feature_engineer/nlp_feature.py
call-for-contributions AutoX_NLP

opened by DHengW 0
AutoX_NLP/ nlp_feature.py, fasttext处理效率优化

当前使用fasttext进行特征提取的效率较慢，同等数据量下与BERT-tiny用时相当，可针对性优化。代码链接：https://github.com/4paradigm/AutoX/blob/master/autox/autox_nlp/feature_engineer/nlp_feature.py
call-for-contributions AutoX_NLP

opened by DHengW 0

Releases(v5.2.0)

v5.2.0(May 16, 2022)
What's Changed

Stable Topk cat-features cross by @scchy in https://github.com/4paradigm/AutoX/pull/31

fixed : Multi Class leaves by @scchy in https://github.com/4paradigm/AutoX/pull/32

New Contributors

@scchy made their first contribution in https://github.com/4paradigm/AutoX/pull/31

Full Changelog: https://github.com/4paradigm/AutoX/compare/v5.1.0...v5.2.0
Source code(tar.gz)
Source code(zip)
v5.1.0(May 16, 2022)
What's Changed

Update setup.py by @utopianet in https://github.com/4paradigm/AutoX/pull/11

Update setup.py by @utopianet in https://github.com/4paradigm/AutoX/pull/12

MOD: drop duplicates of self.lags by @Leopold-Dev in https://github.com/4paradigm/AutoX/pull/15

Update influence.py for typos. by @Yulv-git in https://github.com/4paradigm/AutoX/pull/16

Update README_EN.md by @ArtificialZeng in https://github.com/4paradigm/AutoX/pull/18

Create README_EN.md by @ArtificialZeng in https://github.com/4paradigm/AutoX/pull/19

Update README.md by @zhhwss in https://github.com/4paradigm/AutoX/pull/20

English Readme update by @ArtificialZeng in https://github.com/4paradigm/AutoX/pull/22

Update README.md by @utopianet in https://github.com/4paradigm/AutoX/pull/23

1、增加用于分类任务的StackingClassifier类；2、可传入自定义评估函数 by @mingyang1996 in https://github.com/4paradigm/AutoX/pull/24

create english version of autox_server readme by @ArtificialZeng in https://github.com/4paradigm/AutoX/pull/25

1、增加BaggingRegressor类和BaggingClassifier类，用户可根据需求决定是否采用加权平均以及采用的种子数量； 3、修改stacking类中fit函数的一个默认参数。 by @mingyang1996 in https://github.com/4paradigm/AutoX/pull/27

优化bagging相关方法 by @mingyang1996 in https://github.com/4paradigm/AutoX/pull/28

New Contributors

@utopianet made their first contribution in https://github.com/4paradigm/AutoX/pull/11

@Leopold-Dev made their first contribution in https://github.com/4paradigm/AutoX/pull/15

@Yulv-git made their first contribution in https://github.com/4paradigm/AutoX/pull/16

@ArtificialZeng made their first contribution in https://github.com/4paradigm/AutoX/pull/18

@zhhwss made their first contribution in https://github.com/4paradigm/AutoX/pull/20

Full Changelog: https://github.com/4paradigm/AutoX/compare/v5.0.0...v5.1.0
Source code(tar.gz)
Source code(zip)
v5.0.0(Apr 6, 2022)

autots
Source code(tar.gz)
Source code(zip)
v4.7.0(Mar 29, 2022)

GRN feature selection
Source code(tar.gz)
Source code(zip)
v4.6.0(Mar 21, 2022)

one2many feature
Source code(tar.gz)
Source code(zip)
v4.5.0(Mar 21, 2022)

adversarial validation.
Source code(tar.gz)
Source code(zip)
v4.4.0(Mar 21, 2022)

GRN feature selection.
Source code(tar.gz)
Source code(zip)
v4.3.1(Mar 21, 2022)

pseudo label.
Source code(tar.gz)
Source code(zip)
v4.2.0(Mar 21, 2022)

gbdt feature.
Source code(tar.gz)
Source code(zip)
v4.1.0(Mar 21, 2022)

update autoxserver.
Source code(tar.gz)
Source code(zip)
v4.0.0(Mar 21, 2022)

autox server.
Source code(tar.gz)
Source code(zip)
v3.1.0(Mar 21, 2022)

pip version.
Source code(tar.gz)
Source code(zip)
v3.0.1(Mar 21, 2022)

autox interpreter.
Source code(tar.gz)
Source code(zip)
v2.2.0(Mar 21, 2022)

image to vector.
Source code(tar.gz)
Source code(zip)
v2.0.0(Mar 21, 2022)

time series model.
Source code(tar.gz)
Source code(zip)
v1.5.0(Mar 21, 2022)

multi tables, one to many relationship.
Source code(tar.gz)
Source code(zip)
v1.4.0(Mar 21, 2022)

shift, diff, cumsum features.
Source code(tar.gz)
Source code(zip)
v1.3.0(Mar 21, 2022)

nlp feature.
Source code(tar.gz)
Source code(zip)
v1.2.0(Mar 21, 2022)

multi table info.
Source code(tar.gz)
Source code(zip)
v1.1.0(Mar 21, 2022)

tabnet.
Source code(tar.gz)
Source code(zip)
v1.0.0(Mar 21, 2022)

binary classification task
Source code(tar.gz)
Source code(zip)
v0.3.1(Mar 21, 2022)

xgb model.
Source code(tar.gz)
Source code(zip)
v0.3.0(Mar 21, 2022)

model parameter tuning
Source code(tar.gz)
Source code(zip)
v0.2.0(Mar 21, 2022)

Reduce Memory 8x or 16x
Source code(tar.gz)
Source code(zip)
v0.1.0(Mar 21, 2022)

Full Changelog: https://github.com/4paradigm/AutoX/commits/v0.1.0: finish the framework
Source code(tar.gz)
Source code(zip)

Owner

4Paradigm

4Paradigm Open Source Community

GitHub Repository

LiuAlgoTrader is a scalable, multi-process ML-ready framework for effective algorithmic trading

LiuAlgoTrader is a scalable, multi-process ML-ready framework for effective algorithmic trading. The framework simplify development, testing, deployment, analysis and training algo trading strategies

458 Dec 24, 2022

Causal Inference and Machine Learning in Practice with EconML and CausalML: Industrial Use Cases at Microsoft, TripAdvisor, Uber

124 Dec 28, 2022

Accelerating model creation and evaluation.

EmeraldML A machine learning library for streamlining the process of (1) cleaning and splitting data, (2) training, optimizing, and testing various mo

0 Dec 06, 2021

Dieses Projekt ermöglicht es den Smartmeter der EVN (Netz Niederösterreich) über die Kundenschnittstelle auszulesen.

SmartMeterEVN Dieses Projekt ermöglicht es den Smartmeter der EVN (Netz Niederösterreich) über die Kundenschnittstelle auszulesen. Smart Meter werden

43 Dec 04, 2022

Fit interpretable models. Explain blackbox machine learning.

InterpretML - Alpha Release In the beginning machines learned in darkness, and data scientists struggled in the void to explain them. Let there be lig

5.2k Jan 09, 2023

This is an auto-ML tool specialized in detecting of outliers

Auto-ML tool specialized in detecting of outliers Description This tool will allows you, with a Dash visualization, to compare 10 models of machine le

1 Nov 03, 2021

Tools for Optuna, MLflow and the integration of both.

HPOflow - Sphinx DOC Tools for Optuna, MLflow and the integration of both. Detailed documentation with examples can be found here: Sphinx DOC Table of

17 Nov 20, 2022

This repo includes some graph-based CTR prediction models and other representative baselines.

Graph-based CTR prediction This is a repository designed for graph-based CTR prediction methods, it includes our graph-based CTR prediction methods: F

47 Dec 30, 2022

Required for a machine learning pipeline data preprocessing and variable engineering script needs to be prepared

Feature-Engineering Required for a machine learning pipeline data preprocessing and variable engineering script needs to be prepared. When the dataset

5 Apr 21, 2022

Confidence intervals for scikit-learn forest algorithms

forest-confidence-interval: Confidence intervals for Forest algorithms Forest algorithms are powerful ensemble methods for classification and regressi

272 Dec 01, 2022

Machine Learning for Time-Series with Python.Published by Packt

Machine-Learning-for-Time-Series-with-Python Become proficient in deriving insights from time-series data and analyzing a model’s performance Links Am

124 Dec 28, 2022

Client - 🔥 A tool for visualizing and tracking your machine learning experiments

Weights and Biases Use W&B to build better models faster. Track and visualize all the pieces of your machine learning pipeline, from datasets to produ

5.2k Jan 03, 2023

Pandas DataFrames and Series as Interactive Tables in Jupyter

Pandas DataFrames and Series as Interactive Tables in Jupyter Star Turn pandas DataFrames and Series into interactive datatables in both your notebook

364 Jan 04, 2023

A single Python file with some tools for visualizing machine learning in the terminal.

Machine Learning Visualization Tools A single Python file with some tools for visualizing machine learning in the terminal. This demo is composed of t

35 Dec 29, 2022

Iris species predictor app is used to classify iris species created using python's scikit-learn, fastapi, numpy and joblib packages.

Iris Species Predictor Iris species predictor app is used to classify iris species using their sepal length, sepal width, petal length and petal width

5 Apr 05, 2022

Little Ball of Fur - A graph sampling extension library for NetworKit and NetworkX (CIKM 2020)

Little Ball of Fur is a graph sampling extension library for Python. Please look at the Documentation, relevant Paper, Promo video and External Resour

619 Dec 14, 2022

Used Logistic Regression, Random Forest, and XGBoost to predict the outcome of Search & Destroy games from the Call of Duty World League for the 2018 and 2019 seasons.

Call of Duty World League: Search & Destroy Outcome Predictions Growing up as an avid Call of Duty player, I was always curious about what factors led

2 Jan 18, 2022

AutoX是一个高效的自动化机器学习工具，它主要针对于表格类型的数据挖掘竞赛。 它的特点包括: 效果出色、简单易用、通用、自动化、灵活。

Related tags

Overview

AutoX是什么？

目录

安装

架构

快速上手

效果对比：

数据类型

pipeline的逻辑

AutoX类

属性

info_: info_属性用于保存数据集的信息。

dfs_: dfs_属性用于保存所有的DataFrame，包含原始表数据和构造的表数据。

方法

AutoX的pipeline中的操作对应的具体细节：

读数据

数据预处理

特征工程

模型训练

模型融合

比赛上分点总结：

错误排查

Comments

Releases(v5.2.0)

v5.2.0(May 16, 2022)

What's Changed

New Contributors

v5.1.0(May 16, 2022)

What's Changed

New Contributors

v5.0.0(Apr 6, 2022)

v4.7.0(Mar 29, 2022)

v4.6.0(Mar 21, 2022)

v4.5.0(Mar 21, 2022)

v4.4.0(Mar 21, 2022)

v4.3.1(Mar 21, 2022)

v4.2.0(Mar 21, 2022)

v4.1.0(Mar 21, 2022)

v4.0.0(Mar 21, 2022)

v3.1.0(Mar 21, 2022)

v3.0.1(Mar 21, 2022)

v2.2.0(Mar 21, 2022)

v2.0.0(Mar 21, 2022)

v1.5.0(Mar 21, 2022)

v1.4.0(Mar 21, 2022)

v1.3.0(Mar 21, 2022)

v1.2.0(Mar 21, 2022)

v1.1.0(Mar 21, 2022)

v1.0.0(Mar 21, 2022)

v0.3.1(Mar 21, 2022)

v0.3.0(Mar 21, 2022)

v0.2.0(Mar 21, 2022)

v0.1.0(Mar 21, 2022)