Code for DeepXML: A Deep Extreme Multi-Label Learning Framework Applied to Short Text Documents

Overview

DeepXML

Code for DeepXML: A Deep Extreme Multi-Label Learning Framework Applied to Short Text Documents


Architectures and algorithms

DeepXML supports multiple feature architectures such as Bag-of-embedding/Astec, RNN, CNN etc. The code uses a json file to construct the feature architecture. Features could be computed using following encoders:

  • Bag-of-embedding/Astec: As used in the DeepXML paper [1].
  • RNN: RNN based sequential models. Support for RNN, GRU, and LSTM.
  • XML-CNN: CNN architecture as proposed in the XML-CNN paper [4].

Best Practices for features creation


  • Adding sub-words on top of unigrams to the vocabulary can help in training more accurate embeddings and classifiers.

Setting up


Expected directory structure

+-- 
   
    
|  +-- programs
|  |  +-- deepxml
|  |    +-- deepxml
|  +-- data
|    +-- 
    
     
|  +-- models
|  +-- results


    
   

Download data for Astec

* Download the (zipped file) BoW features from XML repository.  
* Extract the zipped file into data directory. 
* The following files should be available in 
   
    /data/
    
      for new datasets (ignore the next step)
    - trn_X_Xf.txt
    - trn_X_Y.txt
    - tst_X_Xf.txt
    - tst_X_Y.txt
    - fasttextB_embeddings_300d.npy or fasttextB_embeddings_512d.npy
* The following files should be available in 
     
      /data/
      
        if the dataset is in old format (please refer to next step to convert the data to new format)
    - train.txt
    - test.txt
    - fasttextB_embeddings_300d.npy or fasttextB_embeddings_512d.npy 

      
     
    
   

Convert to new data format

# A perl script is provided (in deepxml/tools) to convert the data into new format as expected by Astec
# Either set the $data_dir variable to the data directory of a particular dataset or replace it with the path
perl convert_format.pl $data_dir/train.txt $data_dir/trn_X_Xf.txt $data_dir/trn_X_Y.txt
perl convert_format.pl $data_dir/test.txt $data_dir/tst_X_Xf.txt $data_dir/tst_X_Y.txt

Example use cases


A single learner with DeepXML framework

The DeepXML framework can be utilized as follows. A json file is used to specify architecture and other arguments. Please refer to the full documentation below for more details.

./run_main.sh 0 DeepXML EURLex-4K 0 108

An ensemble of multiple learners with DeepXML framework

An ensemble can be trained as follows. A json file is used to specify architecture and other arguments.

./run_main.sh 0 DeepXML EURLex-4K 0 108,666,786

Full Documentation

./run_main.sh 
    
     
      
       
       
         * gpu_id: Run the program on this GPU. * framework - DeepXML: Divides the XML problems in 4 modules as proposed in the paper. - DeepXML-OVA: Train the architecture in 1-vs-all fashion [4][5], i.e., loss is computed for each label in each iteration. - DeepXML-ANNS: Train the architecture using a label shortlist. Support is available for a fixed graph or periodic training of the ANNS graph. * dataset - Name of the dataset. - Astec expects the following files in 
        
         /data/
         
           - trn_X_Xf.txt - trn_X_Y.txt - tst_X_Xf.txt - tst_X_Y.txt - fasttextB_embeddings_300d.npy or fasttextB_embeddings_512d.npy - You can set the 'embedding_dims' in config file to switch between 300d and 512d embeddings. * version - different runs could be managed by version and seed. - models and results are stored with this argument. * seed - seed value as used by numpy and PyTorch. - an ensemble is learned if multiple comma separated values are passed. 
         
        
       
      
     
    
   

Notes

* Other file formats such as npy, npz, pickle are also supported.
* Initializing with token embeddings (computed from FastText) leads to noticible accuracy gain in Astec. Please ensure that the token embedding file is available in data directory, if 'init=token_embeddings', otherwise it'll throw an error.
* Config files are made available in deepxml/configs/
   
    /
    
      for datasets in XC repository. You can use them when trying out Astec/DeepXML on new datasets.
* We conducted our experiments on a 24-core Intel Xeon 2.6 GHz machine with 440GB RAM with a single Nvidia P40 GPU. 128GB memory should suffice for most datasets.
* Astec make use of CPU (mainly for nmslib) as well as GPU. 

    
   

Cite as

@InProceedings{Dahiya21,
    author = "Dahiya, K. and Saini, D. and Mittal, A. and Shaw, A. and Dave, K. and Soni, A. and Jain, H. and Agarwal, S. and Varma, M.",
    title = "DeepXML: A Deep Extreme Multi-Label Learning Framework Applied to Short Text Documents",
    booktitle = "Proceedings of the ACM International Conference on Web Search and Data Mining",
    month = "March",
    year = "2021"
}

YOU MAY ALSO LIKE

References


[1] K. Dahiya, D. Saini, A. Mittal, A. Shaw, K. Dave, A. Soni, H. Jain, S. Agarwal, and M. Varma. Deepxml: A deep extreme multi-label learning framework applied to short text documents. In WSDM, 2021.

[2] pyxclib: https://github.com/kunaldahiya/pyxclib

[3] H. Jain, V. Balasubramanian, B. Chunduri and M. Varma, Slice: Scalable linear extreme classifiers trained on 100 million labels for related searches, In WSDM 2019.

[4] J. Liu, W.-C. Chang, Y. Wu and Y. Yang, XML-CNN: Deep Learning for Extreme Multi-label Text Classification, In SIGIR 2017.

[5] R. Babbar, and B. Schölkopf, DiSMEC - Distributed Sparse Machines for Extreme Multi-label Classification In WSDM, 2017.

[6] P., Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information. In TACL, 2017.

Owner
Extreme Classification
Extreme Classification
Miscellaneous and lightweight network tools

Network Tools Collection of miscellaneous and lightweight network tools to simplify daily operations, administration, and troubleshooting of networks.

Nicholas Russo 22 Mar 22, 2022
Educational API for 3D Vision using pose to control carton.

Educational API for 3D Vision using pose to control carton.

41 Jul 10, 2022
Robust, modular and efficient implementation of advanced Hamiltonian Monte Carlo algorithms

AdvancedHMC.jl AdvancedHMC.jl provides a robust, modular and efficient implementation of advanced HMC algorithms. An illustrative example for Advanced

The Turing Language 167 Jan 01, 2023
Code for KDD'20 "An Efficient Neighborhood-based Interaction Model for Recommendation on Heterogeneous Graph"

Heterogeneous INteract and aggreGatE (GraphHINGE) This is a pytorch implementation of GraphHINGE model. This is the experiment code in the following w

Jinjiarui 69 Nov 24, 2022
One-Shot Neural Ensemble Architecture Search by Diversity-Guided Search Space Shrinking

One-Shot Neural Ensemble Architecture Search by Diversity-Guided Search Space Shrinking This is an official implementation for NEAS presented in CVPR

Multimedia Research 19 Sep 08, 2022
Deploy optimized transformer based models on Nvidia Triton server

Deploy optimized transformer based models on Nvidia Triton server

Lefebvre Sarrut Services 1.2k Jan 05, 2023
Implementation of Sequence Generative Adversarial Nets with Policy Gradient

SeqGAN Requirements: Tensorflow r1.0.1 Python 2.7 CUDA 7.5+ (For GPU) Introduction Apply Generative Adversarial Nets to generating sequences of discre

Lantao Yu 2k Dec 29, 2022
This library is a location of the LegacyLogger for PyTorch Lightning.

neptune-contrib Documentation See neptune-contrib documentation site Installation Get prerequisites python versions 3.5.6/3.6 are supported Install li

neptune.ai 26 Oct 07, 2021
This is the source code of the solver used to compete in the International Timetabling Competition 2019.

ITC2019 Solver This is the source code of the solver used to compete in the International Timetabling Competition 2019. Building .NET Core (2.1 or hig

Edon Gashi 8 Jan 22, 2022
DANet for Tabular data classification/ regression.

Deep Abstract Networks A PyTorch code implemented for the submission DANets: Deep Abstract Networks for Tabular Data Classification and Regression. Do

Ronnie Rocket 55 Sep 14, 2022
This is a package for LiDARTag, described in paper: LiDARTag: A Real-Time Fiducial Tag System for Point Clouds

LiDARTag Overview This is a package for LiDARTag, described in paper: LiDARTag: A Real-Time Fiducial Tag System for Point Clouds (PDF)(arXiv). This wo

University of Michigan Dynamic Legged Locomotion Robotics Lab 159 Dec 21, 2022
A tool to estimate time varying instantaneous reproduction number during epidemics

EpiEstim A tool to estimate time varying instantaneous reproduction number during epidemics. It is described in the following paper: @article{Cori2013

MRC Centre for Global Infectious Disease Analysis 78 Dec 19, 2022
Repositório criado para abrigar os notebooks com a listas de exercícios propostos pelo professor Gustavo Guanabara do canal Curso em Vídeo do YouTube durante o Curso de Python 3

Curso em Vídeo - Exercícios de Python 3 Sobre o repositório Este repositório contém os notebooks com a listas de exercícios propostos pelo professor G

João Pedro Pereira 9 Oct 15, 2022
Model Zoo for AI Model Efficiency Toolkit

We provide a collection of popular neural network models and compare their floating point and quantized performance.

Qualcomm Innovation Center 137 Jan 03, 2023
In real-world applications of machine learning, reliable and safe systems must consider measures of performance beyond standard test set accuracy

PixMix Introduction In real-world applications of machine learning, reliable and safe systems must consider measures of performance beyond standard te

Andy Zou 79 Dec 30, 2022
Ensembling Off-the-shelf Models for GAN Training

Vision-aided GAN video (3m) | website | paper Can the collective knowledge from a large bank of pretrained vision models be leveraged to improve GAN t

345 Dec 28, 2022
Code for ACL2021 long paper: Knowledgeable or Educated Guess? Revisiting Language Models as Knowledge Bases

LANKA This is the source code for paper: Knowledgeable or Educated Guess? Revisiting Language Models as Knowledge Bases (ACL 2021, long paper) Referen

Boxi Cao 30 Oct 24, 2022
Using modified BiSeNet for face parsing in PyTorch

face-parsing.PyTorch Contents Training Demo References Training Prepare training data: -- download CelebAMask-HQ dataset -- change file path in the pr

zll 1.6k Jan 08, 2023
A hyperparameter optimization framework

Optuna: A hyperparameter optimization framework Website | Docs | Install Guide | Tutorial Optuna is an automatic hyperparameter optimization software

7.4k Jan 04, 2023
The official repository for our paper "The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization".

Codebase for learning control flow in transformers The official repository for our paper "The Neural Data Router: Adaptive Control Flow in Transformer

Csordás Róbert 24 Oct 15, 2022