source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Last update: Dec 17, 2022

Related tags

Overview

WhiteningBERT

Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Preparation

git clone https://github.com/Jun-jie-Huang/WhiteningBERT.git
pip install -r requirements.txt
cd examples/evaluation

Usage

Datasets

We use seven STS datasets, including STSBenchmark, SICK-Relatedness, STS12, STS13, STS14, STS15, STS16.

The processed data can be found in ./examples/datasets/.

Run

To run a quick demo:

python evaluation_stsbenchmark.py \
			--pooling aver \
			--layer_num 1,12 \
			--whitening \
			--encoder_name bert-base-cased

Specify --pooing with cls or aver to choose whether use the [CLS] token or averaging all tokens. Also specify --layer_num to combine layers, separated by a comma.

To enumerate all possible combinations of two layers and automatically evaluate the combinations consequently:

python evaluation_stsbenchmark_layer2.py \
			--pooling aver \
			--whitening \
			--encoder_name bert-base-cased

To enumerate all possible combinations of N layers:

python evaluation_stsbenchmark_layerN.py \
			--pooling aver \
			--whitening \
			--encoder_name bert-base-cased\
			--combination_num 4

You can also save the embeddings of the sentences

python evaluation_stsbenchmark_save_embed.py \
			--pooling aver \
			--layer_num 1,12 \
			--whitening \
			--encoder_name bert-base-cased \
			--summary_dir ./save_embeddings

A list of PLMs you can select:

bert-base-uncased , bert-large-uncased
roberta-base, roberta-large
bert-base-multilingual-uncased
sentence-transformers/LaBSE
albert-base-v1 , albert-large-v1
microsoft/layoutlm-base-uncased , microsoft/layoutlm-large-uncased
SpanBERT/spanbert-base-cased , SpanBERT/spanbert-large-cased
microsoft/deberta-base , microsoft/deberta-large
google/electra-base-discriminator
google/mobilebert-uncased
microsoft/DialogRPT-human-vs-rand
distilbert-base-uncased
......

Acknowledgements

Codes are adapted from the repos of the EMNLP19 paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks and the EMNLP20 paper An Unsupervised Sentence Embedding Method by Mutual Information Maximization

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Related tags

Overview

WhiteningBERT

Preparation

Usage

Datasets

Run

A list of PLMs you can select:

Acknowledgements

Owner

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

문장단위로 분절된 나무위키 데이터셋. Releases에서 다운로드 받거나, tfds-korean을 통해 다운로드 받으세요.

Summarization module based on KoBART

Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization

The SVO-Probes Dataset for Verb Understanding

Scikit-learn style model finetuning for NLP

Asr abc - Automatic speech recognition(ASR),中文语音识别

Switch spaces for knowledge graph embeddings

Ecco is a python library for exploring and explaining Natural Language Processing models using interactive visualizations.

NeMo: a toolkit for conversational AI

Global Rhythm Style Transfer Without Text Transcriptions

🚀Clone a voice in 5 seconds to generate arbitrary speech in real-time

FB ID CLONER WUTHOT CHECKPOINT, FACEBOOK ID CLONE FROM FILE

KoBERT - Korean BERT pre-trained cased (KoBERT)

A linter to manage all your python exceptions and try/except blocks (limited only for those who like dinosaurs).

Athena is an open-source implementation of end-to-end speech processing engine.

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"

Repository for Project Insight: NLP as a Service