Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

Related tags

Text Data & NLPgensen
Overview

GenSen

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

Sandeep Subramanian, Adam Trischler, Yoshua Bengio & Christopher Pal

ICLR 2018

About

GenSen is a technique to learn general purpose, fixed-length representations of sentences via multi-task training. These representations are useful for transfer and low-resource learning. For details please refer to our ICLR paper.

Code

We provide a PyTorch implementation of our paper along with pre-trained models as well as code to evaluate these models on a variety of transfer learning benchmarks.

Requirements

  • Python 2.7 (Python 3 compatibility coming soon)
  • PyTorch 0.2 or 0.3
  • nltk
  • h5py
  • numpy
  • scikit-learn

Usage

Setting up Models & pre-trained word vecotrs

You download our pre-trained models and set up pre-trained word vectors for vocabulary expansion by

cd data/models
bash download_models.sh
cd ../embedding
bash glove2h5.sh
Using a pre-trained model to extract sentence representations.

You can use our pre-trained models to extract the last hidden state or all hidden states of our multi-task GRU. Additionally, you can concatenate the output of multiple models to replicate the numbers in our paper.

from gensen import GenSen, GenSenSingle

gensen_1 = GenSenSingle(
    model_folder='./data/models',
    filename_prefix='nli_large_bothskip',
    pretrained_emb='./data/embedding/glove.840B.300d.h5'
)
reps_h, reps_h_t = gensen_1.get_representation(
    sentences, pool='last', return_numpy=True, tokenize=True
)
print reps_h.shape, reps_h_t.shape
  • The input to get_representation is sentences, which should be a list of strings. If your strings are not pre-tokenized, then set tokenize=True to use the NLTK tokenizer before computing representations.
  • reps_h (batch_size x seq_len x 2048) contains the hidden states for all words in all sentences (padded to the max length of sentences)
  • reps_h_t (batch_size x 2048) contains only the last hidden state for all sentences in the minibatch

GenSenSingle will return the output of a single model nli_large_bothskip (+STN +Fr +De +NLI +L +STP). You can concatenate the output of multiple models by creating a GenSen instance with multiple GenSenSingle instances, as follows:

gensen_2 = GenSenSingle(
    model_folder='./data/models',
    filename_prefix='nli_large_bothskip_parse',
    pretrained_emb='./data/embedding/glove.840B.300d.h5'
)
gensen = GenSen(gensen_1, gensen_2)
reps_h, reps_h_t = gensen.get_representation(
    sentences, pool='last', return_numpy=True, tokenize=True
)
  1. reps_h (batch_size x seq_len x 4096) contains the hidden states for all words in all sentences (padded to the max length of sentences)
  2. reps_h_t (batch_size x 4096) contains only the last hidden state for all sentences in the minibatch

The model will produce a fixed-length vector for each sentence as well as the hidden states corresponding to each word in every sentence (padded to max sentence length). You can also return a numpy array instead of a torch.FloatTensor by setting return_numpy=True.

Vocabulary Expansion

If you have a specific domain for which you want to compute representations, you can call vocab_expansion on instances of the GenSenSingle or GenSen class simply by gensen.vocab_expansion(vocab) where vocab is a list of unique words in the new domain. This will learn a linear mapping from the provided pretrained embeddings (which have a significantly larger vocabulary) provided to the space of gensen's word vectors. For an example of how this is used in an actual setting, please refer to gensen_senteval.py.

Training a model from scratch

To train a model from scratch, simply run train.py with an appropriate JSON config file. An example config is provided in example_config.json. To continue training, just relaunch the same scripy with load_dir=auto in the config file.

To download some of the data required to train a GenSen model, run:

bash get_data.sh

Note that this script can take a while to complete since it downloads, tokenizes and lowercases a fairly large En-Fr corpus. If you already have these parallel corpora processed, you can replace the paths to these files in the provided example_config.json

Some of the data used in our work is no longer publicly available (BookCorpus - see http://yknzhu.wixsite.com/mbweb) or has an LDC license associated (Penn Treebank). As a result, the example_config.json script will only train on Multilingual NMT and NLI, since they are publicly available. To use models trained on all tasks, please use our available pre-trained models.

Additional Sequence-to-Sequence transduction tasks can be added trivally to the multi-task framework by editing the json config file with more tasks.

python train.py --config example_config.json

To use the default settings in example_config.json you will need a GPU with atleast 16GB of memory (such as a P100), to train on smaller GPUs, you may need to reduce the batch size.

Note that if "load_dir" is set to auto, the script will resume from the last saved model in "save_dir".

Creating a GenSen model from a trained multi-task model

Once you have a trained model, we can throw away all of the decoders and just retain the encoder used to compute sentence representations.

You can do this by running

python create_gensen.py -t <path_to_trained_model> -s <path_to_save_encoder> -n <name_of_encoder>

Once you have done this, you can load this model just like any of the pre-trained models by specifying the model_folder as path_to_save_encoder and filename_prefix as name_of_encoder in the above command.

your_gensen = GenSenSingle(
    model_folder='<path_to_save_encoder>',
    filename_prefix='<name_of_encoder>',
    pretrained_emb='./data/embedding/glove.840B.300d.h5'
)

Transfer Learning Evaluations

We used the SentEval toolkit to run most of our transfer learning experiments. To replicate these numbers, clone their repository and follow setup instructions. Once complete, copy gensen_senteval.py and gensen.py into their examples folder and run the following commands to reproduce different rows in Table 2 of our paper. Note: Please set the path to the pretrained glove embeddings (glove.840B.300d.h5) and model folder as appropriate.

(+STN +Fr +De +NLI +L +STP)      python gensen_senteval.py --prefix_1 nli_large --prefix_2 nli_large_bothskip
(+STN +Fr +De +NLI +2L +STP)     python gensen_senteval.py --prefix_1 nli_large_bothskip --prefix_2 nli_large_bothskip_2layer
(+STN +Fr +De +NLI +L +STP +Par) python gensen_senteval.py --prefix_1 nli_large_bothskip_parse --prefix_2 nli_large_bothskip

Reference

@article{subramanian2018learning,
title={Learning general purpose distributed sentence representations via large scale multi-task learning},
author={Subramanian, Sandeep and Trischler, Adam and Bengio, Yoshua and Pal, Christopher J},
journal={arXiv preprint arXiv:1804.00079},
year={2018}
}
Owner
Maluuba Inc.
A @Microsoft company
Maluuba Inc.
Unsupervised Language Model Pre-training for French

FlauBERT and FLUE FlauBERT is a French BERT trained on a very large and heterogeneous French corpus. Models of different sizes are trained using the n

GETALP 212 Dec 10, 2022
Search with BERT vectors in Solr and Elasticsearch

Search with BERT vectors in Solr and Elasticsearch

Dmitry Kan 123 Dec 29, 2022
(ACL-IJCNLP 2021) Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models.

BERT Convolutions Code for the paper Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models. Contains expe

mlpc-ucsd 21 Jul 18, 2022
Subtitle Workshop (subshop): tools to download and synchronize subtitles

SUBSHOP Tools to download, remove ads, and synchronize subtitles. SUBSHOP Purpose Limitations Required Web Credentials Installation, Configuration, an

Joe D 4 Feb 13, 2022
The swas programming language

The Swas programming language This is a language that was made for fun. Installation Step 0: Make sure you have python installed Step 1. Clone this re

Swas.py 19 Jul 18, 2022
NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Artefact 114 Dec 15, 2022
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language This repository contains UA-GEC data and an accompanying Python lib

Grammarly 227 Jan 02, 2023
LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

Felipe Maia Polo 125 Dec 20, 2022
NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

This file contains the following documents sumbited for Baruch CIS9665 group 9 fall 2021. 1. Dataset: drug_reviews.csv 2. python codes for text classi

Aarif Munwar Jahan 2 Jan 04, 2023
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统,包含语音编码器、语音合成器、声码器和可视化模块。

Kuang Dada 6 Nov 08, 2022
Journey is a NLP-Powered Developer assistant

Journey Journey is a NLP-Powered Developer assistant Using on the powerful Natural Language Processing library Mindmeld, this projects aims to assist

Christian Eilers 21 Dec 11, 2022
SimCSE: Simple Contrastive Learning of Sentence Embeddings

SimCSE: Simple Contrastive Learning of Sentence Embeddings This repository contains the code and pre-trained models for our paper SimCSE: Simple Contr

Princeton Natural Language Processing 2.5k Jan 07, 2023
मराठी भाषा वाचविण्याचा एक प्रयास. इंग्रजी ते मराठीचा शब्दकोश. An attempt to preserve the Marathi language. A lightweight and ad free English to Marathi thesaurus.

For English, scroll down मराठी शब्द मराठी भाषा वाचवण्यासाठी मी हा ओपन सोर्स प्रोजेक्ट सुरू केला आहे. माझ्या मते, आपली भाषा हळूहळू आणि कोणाचाही लक्षात

मुक्त स्त्रोत 20 Oct 11, 2022
AllenNLP integration for Shiba: Japanese CANINE model

Allennlp Integration for Shiba allennlp-shiab-model is a Python library that provides AllenNLP integration for shiba-model. SHIBA is an approximate re

Shunsuke KITADA 12 Feb 16, 2022
뉴스 도메인 질의응답 시스템 (21-1학기 졸업 프로젝트)

뉴스 도메인 질의응답 시스템 본 프로젝트는 뉴스기사에 대한 질의응답 서비스 를 제공하기 위해서 진행한 프로젝트입니다. 약 3개월간 ( 21. 03 ~ 21. 05 ) 진행하였으며 Transformer 아키텍쳐 기반의 Encoder를 사용하여 한국어 질의응답 데이터셋으로

TaegyeongEo 4 Jul 08, 2022
Semi-automated vocabulary generation from semantic vector models

vec2word Semi-automated vocabulary generation from semantic vector models This script generates a list of potential conlang word forms along with asso

9 Nov 25, 2022
Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

MLP Singer Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis. Audio samples are available on our demo page.

Neosapience 103 Dec 23, 2022
BeautyNet is an AI powered model which can tell you whether you're beautiful or not.

BeautyNet BeautyNet is an AI powered model which can tell you whether you're beautiful or not. Download Dataset from here:https://www.kaggle.com/gpios

Ansh Gupta 0 May 06, 2022
Utility for Google Text-To-Speech batch audio files generator. Ideal for prompt files creation with Google voices for application in offline IVRs

Google Text-To-Speech Batch Prompt File Maker Are you in the need of IVR prompts, but you have no voice actors? Let Google talk your prompts like a pr

Ponchotitlán 1 Aug 19, 2021
LUKE -- Language Understanding with Knowledge-based Embeddings

LUKE (Language Understanding with Knowledge-based Embeddings) is a new pre-trained contextualized representation of words and entities based on transf

Studio Ousia 587 Dec 30, 2022