Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

Related tags

Text Data & NLPgensen
Overview

GenSen

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

Sandeep Subramanian, Adam Trischler, Yoshua Bengio & Christopher Pal

ICLR 2018

About

GenSen is a technique to learn general purpose, fixed-length representations of sentences via multi-task training. These representations are useful for transfer and low-resource learning. For details please refer to our ICLR paper.

Code

We provide a PyTorch implementation of our paper along with pre-trained models as well as code to evaluate these models on a variety of transfer learning benchmarks.

Requirements

  • Python 2.7 (Python 3 compatibility coming soon)
  • PyTorch 0.2 or 0.3
  • nltk
  • h5py
  • numpy
  • scikit-learn

Usage

Setting up Models & pre-trained word vecotrs

You download our pre-trained models and set up pre-trained word vectors for vocabulary expansion by

cd data/models
bash download_models.sh
cd ../embedding
bash glove2h5.sh
Using a pre-trained model to extract sentence representations.

You can use our pre-trained models to extract the last hidden state or all hidden states of our multi-task GRU. Additionally, you can concatenate the output of multiple models to replicate the numbers in our paper.

from gensen import GenSen, GenSenSingle

gensen_1 = GenSenSingle(
    model_folder='./data/models',
    filename_prefix='nli_large_bothskip',
    pretrained_emb='./data/embedding/glove.840B.300d.h5'
)
reps_h, reps_h_t = gensen_1.get_representation(
    sentences, pool='last', return_numpy=True, tokenize=True
)
print reps_h.shape, reps_h_t.shape
  • The input to get_representation is sentences, which should be a list of strings. If your strings are not pre-tokenized, then set tokenize=True to use the NLTK tokenizer before computing representations.
  • reps_h (batch_size x seq_len x 2048) contains the hidden states for all words in all sentences (padded to the max length of sentences)
  • reps_h_t (batch_size x 2048) contains only the last hidden state for all sentences in the minibatch

GenSenSingle will return the output of a single model nli_large_bothskip (+STN +Fr +De +NLI +L +STP). You can concatenate the output of multiple models by creating a GenSen instance with multiple GenSenSingle instances, as follows:

gensen_2 = GenSenSingle(
    model_folder='./data/models',
    filename_prefix='nli_large_bothskip_parse',
    pretrained_emb='./data/embedding/glove.840B.300d.h5'
)
gensen = GenSen(gensen_1, gensen_2)
reps_h, reps_h_t = gensen.get_representation(
    sentences, pool='last', return_numpy=True, tokenize=True
)
  1. reps_h (batch_size x seq_len x 4096) contains the hidden states for all words in all sentences (padded to the max length of sentences)
  2. reps_h_t (batch_size x 4096) contains only the last hidden state for all sentences in the minibatch

The model will produce a fixed-length vector for each sentence as well as the hidden states corresponding to each word in every sentence (padded to max sentence length). You can also return a numpy array instead of a torch.FloatTensor by setting return_numpy=True.

Vocabulary Expansion

If you have a specific domain for which you want to compute representations, you can call vocab_expansion on instances of the GenSenSingle or GenSen class simply by gensen.vocab_expansion(vocab) where vocab is a list of unique words in the new domain. This will learn a linear mapping from the provided pretrained embeddings (which have a significantly larger vocabulary) provided to the space of gensen's word vectors. For an example of how this is used in an actual setting, please refer to gensen_senteval.py.

Training a model from scratch

To train a model from scratch, simply run train.py with an appropriate JSON config file. An example config is provided in example_config.json. To continue training, just relaunch the same scripy with load_dir=auto in the config file.

To download some of the data required to train a GenSen model, run:

bash get_data.sh

Note that this script can take a while to complete since it downloads, tokenizes and lowercases a fairly large En-Fr corpus. If you already have these parallel corpora processed, you can replace the paths to these files in the provided example_config.json

Some of the data used in our work is no longer publicly available (BookCorpus - see http://yknzhu.wixsite.com/mbweb) or has an LDC license associated (Penn Treebank). As a result, the example_config.json script will only train on Multilingual NMT and NLI, since they are publicly available. To use models trained on all tasks, please use our available pre-trained models.

Additional Sequence-to-Sequence transduction tasks can be added trivally to the multi-task framework by editing the json config file with more tasks.

python train.py --config example_config.json

To use the default settings in example_config.json you will need a GPU with atleast 16GB of memory (such as a P100), to train on smaller GPUs, you may need to reduce the batch size.

Note that if "load_dir" is set to auto, the script will resume from the last saved model in "save_dir".

Creating a GenSen model from a trained multi-task model

Once you have a trained model, we can throw away all of the decoders and just retain the encoder used to compute sentence representations.

You can do this by running

python create_gensen.py -t <path_to_trained_model> -s <path_to_save_encoder> -n <name_of_encoder>

Once you have done this, you can load this model just like any of the pre-trained models by specifying the model_folder as path_to_save_encoder and filename_prefix as name_of_encoder in the above command.

your_gensen = GenSenSingle(
    model_folder='<path_to_save_encoder>',
    filename_prefix='<name_of_encoder>',
    pretrained_emb='./data/embedding/glove.840B.300d.h5'
)

Transfer Learning Evaluations

We used the SentEval toolkit to run most of our transfer learning experiments. To replicate these numbers, clone their repository and follow setup instructions. Once complete, copy gensen_senteval.py and gensen.py into their examples folder and run the following commands to reproduce different rows in Table 2 of our paper. Note: Please set the path to the pretrained glove embeddings (glove.840B.300d.h5) and model folder as appropriate.

(+STN +Fr +De +NLI +L +STP)      python gensen_senteval.py --prefix_1 nli_large --prefix_2 nli_large_bothskip
(+STN +Fr +De +NLI +2L +STP)     python gensen_senteval.py --prefix_1 nli_large_bothskip --prefix_2 nli_large_bothskip_2layer
(+STN +Fr +De +NLI +L +STP +Par) python gensen_senteval.py --prefix_1 nli_large_bothskip_parse --prefix_2 nli_large_bothskip

Reference

@article{subramanian2018learning,
title={Learning general purpose distributed sentence representations via large scale multi-task learning},
author={Subramanian, Sandeep and Trischler, Adam and Bengio, Yoshua and Pal, Christopher J},
journal={arXiv preprint arXiv:1804.00079},
year={2018}
}
Owner
Maluuba Inc.
A @Microsoft company
Maluuba Inc.
Script to generate VAD dataset used in Asteroid recipe

About the dataset LibriVAD is an open source dataset for voice activity detection in noisy environments. It is derived from LibriSpeech signals (clean

11 Sep 15, 2022
Outreachy TFX custom component project

Schema Curation Custom Component Outreachy TFX custom component project This repo contains the code for Schema Curation Custom Component made as a par

Robert Crowe 5 Jul 16, 2021
Simple telegram bot to convert files into direct download link.you can use telegram as a file server 🪁

TGCLOUD 🪁 Simple telegram bot to convert files into direct download link.you can use telegram as a file server 🪁 Features Easy to Deploy Heroku Supp

Mr.Acid dev 6 Oct 18, 2022
A high-level Python library for Quantum Natural Language Processing

lambeq About lambeq is a toolkit for quantum natural language processing (QNLP). Documentation: https://cqcl.github.io/lambeq/ Getting started Prerequ

Cambridge Quantum 315 Jan 01, 2023
:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

R²SQL The PyTorch implementation of paper Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing. (AAAI 2021) Requirement

huybery 60 Dec 31, 2022
PIZZA - a task-oriented semantic parsing dataset

The PIZZA dataset continues the exploration of task-oriented parsing by introducing a new dataset for parsing pizza and drink orders, whose semantics cannot be captured by flat slots and intents.

17 Dec 14, 2022
Interactive Jupyter Notebook Environment for using the GPT-3 Instruct API

gpt3-instruct-sandbox Interactive Jupyter Notebook Environment for using the GPT-3 Instruct API Description This project updates an existing GPT-3 san

312 Jan 03, 2023
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

Salesforce 564 Jan 08, 2023
official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

Plugin 3 Jan 12, 2022
Open source code for AlphaFold.

AlphaFold This package provides an implementation of the inference pipeline of AlphaFold v2.0. This is a completely new model that was entered in CASP

DeepMind 9.7k Jan 02, 2023
Sequence-to-Sequence Framework in PyTorch

nmtpytorch allows training of various end-to-end neural architectures including but not limited to neural machine translation, image captioning and au

LIUM 395 Nov 21, 2022
Submit issues and feature requests for our API here.

AIx GPT API Submit issues and feature requests for our API here. See https://apps.aixsolutionsgroup.com for more info. Python Quick Start pip install

AIx Solutions 7 Mar 27, 2022
The Classical Language Toolkit

Notice: This Git branch (dev) contains the CLTK's upcoming major release (v. 1.0.0). See https://github.com/cltk/cltk/tree/master and https://docs.clt

Classical Language Toolkit 754 Jan 09, 2023
A number of methods in order to perform Natural Language Processing on live data derived from Twitter

A number of methods in order to perform Natural Language Processing on live data derived from Twitter

1 Nov 24, 2021
BeautyNet is an AI powered model which can tell you whether you're beautiful or not.

BeautyNet BeautyNet is an AI powered model which can tell you whether you're beautiful or not. Download Dataset from here:https://www.kaggle.com/gpios

Ansh Gupta 0 May 06, 2022
iBOT: Image BERT Pre-Training with Online Tokenizer

Image BERT Pre-Training with iBOT Official PyTorch implementation and pretrained models for paper iBOT: Image BERT Pre-Training with Online Tokenizer.

Bytedance Inc. 435 Jan 06, 2023
A framework for cleaning Chinese dialog data

A framework for cleaning Chinese dialog data

Yida 136 Dec 20, 2022
Sorce code and datasets for "K-BERT: Enabling Language Representation with Knowledge Graph",

K-BERT Sorce code and datasets for "K-BERT: Enabling Language Representation with Knowledge Graph", which is implemented based on the UER framework. R

Weijie Liu 834 Jan 09, 2023
Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET

Training COMET using seq2seq setting Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET. The codes are modified from run_summarizati

tqfang 9 Dec 17, 2022
用Resnet101+GPT搭建一个玩王者荣耀的AI

基于pytorch框架用resnet101加GPT搭建AI玩王者荣耀 本源码模型主要用了SamLynnEvans Transformer 的源码的解码部分。以及pytorch自带的预训练模型"resnet101-5d3b4d8f.pth"

冯泉荔 2.2k Jan 03, 2023