Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Overview

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

This is the official repository for the EMNLP 2021 long paper Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration. We provide code for training and evaluating Phrase-BERT in addition to the datasets used in the paper.

Update: the model is also available now on Huggingface thanks to the help from whaleloops and nreimers!

Setup

This repository depends on sentence-BERT version 0.3.3, which you can install from the source using:

>>> git clone https://github.com/UKPLab/sentence-transformers.git --branch v0.3.3
>>> cd sentence-transformers/
>>> pip install -e .

Also you can install sentence-BERT with pip:

>>> pip install sentence-transformers==0.3.3

Quick Start

The following example shows how to use a trained Phrase-BERT model to embed phrases into dense vectors.

First download and unzip our model.

>>> cd 
   
    
>>> wget https://storage.googleapis.com/phrase-bert/phrase-bert/phrase-bert-model.zip
>>> unzip phrase-bert-model.zip -d phrase-bert-model/
>>> rm phrase-bert-model.zip

   

Then load the Phrase-BERT model through the sentence-BERT interface:

from sentence_transformers import SentenceTransformer
model_path = '
   
    '
model = SentenceTransformer(model_path)

   

You can compute phrase embeddings using Phrase-BERT as follows:

phrase_list = [ 'play an active role', 'participate actively', 'active lifestyle']
phrase_embs = model.encode( phrase_list )
[p1, p2, p3] = phrase_embs

As in sentence-BERT, the default output is a list of numpy arrays:

for phrase, embedding in zip(phrase_list, phrase_embs):
    print("Phrase:", phrase)
    print("Embedding:", embedding)
    print("")

An example of computing the dot product of phrase embeddings:

import numpy as np
print(f'The dot product between phrase 1 and 2 is: {np.dot(p1, p2)}')
print(f'The dot product between phrase 1 and 3 is: {np.dot(p1, p3)}')
print(f'The dot product between phrase 2 and 3 is: {np.dot(p2, p3)}')

An example of computing cosine similarity of phrase embeddings:

import torch 
from torch import nn
cos_sim = nn.CosineSimilarity(dim=0)
print(f'The cosine similarity between phrase 1 and 2 is: {cos_sim( torch.tensor(p1), torch.tensor(p2))}')
print(f'The cosine similarity between phrase 1 and 3 is: {cos_sim( torch.tensor(p1), torch.tensor(p3))}')
print(f'The cosine similarity between phrase 2 and 3 is: {cos_sim( torch.tensor(p2), torch.tensor(p3))}')

The output should look like:

The dot product between phrase 1 and 2 is: 218.43600463867188
The dot product between phrase 1 and 3 is: 165.48483276367188
The dot product between phrase 2 and 3 is: 160.51708984375
The cosine similarity between phrase 1 and 2 is: 0.8142536282539368
The cosine similarity between phrase 1 and 3 is: 0.6130303144454956
The cosine similarity between phrase 2 and 3 is: 0.584893524646759

Evaluation

Given the lack of a unified phrase embedding evaluation benchmark, we collect the following five phrase semantics evaluation tasks, which are described further in our paper:

Change config/model_path.py with the model path according to your directories and

  • For evaluation on Turney, run python eval_turney.py

  • For evaluation on BiRD, run python eval_bird.py

  • for evaluation on PPDB / PPDB-filtered / PAWS-short, run eval_ppdb_paws.py with:

    nohup python  -u eval_ppdb_paws.py \
        --full_run_mode \
        --task 
         
           \
        --data_dir 
          
            \
        --result_dir 
           
             \
        >./output.txt 2>&1 &
    
           
          
         

Train your own Phrase-BERT

If you would like to go beyond using the pre-trained Phrase-BERT model, you may train your own Phrase-BERT using data from the domain you are interested in. Please refer to phrase-bert/phrase_bert_finetune.py

The datasets we used to fine-tune Phrase-BERT are here: training data csv file and validation data csv file.

To re-produce the trained Phrase-BERT, please run:

export INPUT_DATA_PATH=
   
    
export TRAIN_DATA_FILE=
    
     
export VALID_DATA_FILE=
     
      
export INPUT_MODEL_PATH=bert-base-nli-stsb-mean-tokens 
export OUTPUT_MODEL_PATH=
      
       


python -u phrase_bert_finetune.py \
    --input_data_path $INPUT_DATA_PATH \
    --train_data_file $TRAIN_DATA_FILE \
    --valid_data_file $VALID_DATA_FILE \
    --input_model_path $INPUT_MODEL_PATH \
    --output_model_path $OUTPUT_MODEL_PATH

      
     
    
   

Citation:

Please cite us if you find this useful:

@inproceedings{phrasebertwang2021,
    author={Shufan Wang and Laure Thompson and Mohit Iyyer},
    Booktitle = {Empirical Methods in Natural Language Processing},
    Year = "2021",
    Title={Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration}
}
:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

R²SQL The PyTorch implementation of paper Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing. (AAAI 2021) Requirement

huybery 60 Dec 31, 2022
StarGAN - Official PyTorch Implementation

StarGAN - Official PyTorch Implementation ***** New: StarGAN v2 is available at https://github.com/clovaai/stargan-v2 ***** This repository provides t

Yunjey Choi 5.1k Dec 30, 2022
pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks

A Python multilingual toolkit for Sentiment Analysis and Social NLP tasks

297 Dec 29, 2022
💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

Explosion 24.9k Jan 02, 2023
Multilingual text (NLP) processing toolkit

polyglot Polyglot is a natural language pipeline that supports massive multilingual applications. Free software: GPLv3 license Documentation: http://p

RAMI ALRFOU 2.1k Jan 07, 2023
IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet 🐦 🇮🇩 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

IndoLEM 40 Nov 30, 2022
Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

NERDA Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning

Ekstra Bladet 141 Dec 30, 2022
This is Assignment1 code for the Web Data Processing System.

This is a Python program to Entity Linking by processing WARC files. We recognize entities from web pages and link them to a Knowledge Base(Wikidata).

3 Dec 04, 2022
Binary LSTM model for text classification

Text Classification The purpose of this repository is to create a neural network model of NLP with deep learning for binary classification of texts re

Nikita Elenberger 1 Mar 11, 2022
A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支,删除 wavegan 分支! 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块! 2021/04/13 softdtw 分支 支持使用 Sof

Atomicoo 161 Dec 19, 2022
Athena is an open-source implementation of end-to-end speech processing engine.

Athena is an open-source implementation of end-to-end speech processing engine. Our vision is to empower both industrial application and academic research on end-to-end models for speech processing.

Ke Technologies 34 Sep 08, 2022
Longformer: The Long-Document Transformer

Longformer Longformer and LongformerEncoderDecoder (LED) are pretrained transformer models for long documents. ***** New December 1st, 2020: Longforme

AI2 1.6k Dec 29, 2022
Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS) Yoonhyung Lee, Joongbo Shin, Kyomin Jung Abstract: Although early

LEE YOON HYUNG 147 Dec 05, 2022
LSTM based Sentiment Classification using Tensorflow - Amazon Reviews Rating

LSTM based Sentiment Classification using Tensorflow - Amazon Reviews Rating (Dataset) The dataset is from Amazon Review Data (2018)

Immanuvel Prathap S 1 Jan 16, 2022
NLP, before and after spaCy

textacy: NLP, before and after spaCy textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the hig

Chartbeat Labs Projects 2k Jan 04, 2023
null

CP-Cluster Confidence Propagation Cluster aims to replace NMS-based methods as a better box fusion framework in 2D/3D Object detection, Instance Segme

Yichun Shen 41 Dec 08, 2022
Python library for interactive topic model visualization. Port of the R LDAvis package.

pyLDAvis Python library for interactive topic model visualization. This is a port of the fabulous R package by Carson Sievert and Kenny Shirley. pyLDA

Ben Mabey 1.7k Dec 20, 2022
Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers.

Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers. Cherche is meant to be used with small to medium sized corpora. C

Raphael Sourty 224 Nov 29, 2022
Spert NLP Relation Extraction API deployed with torchserve for inference

URLMask Python program for Linux users to change a URL to ANY domain. A program than can take any url and mask it to any domain name you like. E.g. ne

Zichu Chen 1 Nov 24, 2021
skweak: A software toolkit for weak supervision applied to NLP tasks

Labelled data remains a scarce resource in many practical NLP scenarios. This is especially the case when working with resource-poor languages (or text domains), or when using task-specific labels wi

Norsk Regnesentral (Norwegian Computing Center) 850 Dec 28, 2022