Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

Last update: Dec 27, 2022

Related tags

Overview

SIF

This is the code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

The code is written in python and requires numpy, scipy, pickle, sklearn, theano and the lasagne library. Some functions/classes are based on the code of John Wieting for the paper "Towards Universal Paraphrastic Sentence Embeddings" (Thanks John!). The example data sets are also preprocessed using the code there.

Install

To install all dependencies virtualenv is suggested:

$ virtualenv .env
$ . .env/bin/activate
$ pip install -r requirements.txt

Get started

To get started, cd into the directory examples/ and run demo.sh. It downloads the pretrained GloVe word embeddings, and then runs the scripts:

sif_embedding.py is an demo on how to generate sentence embedding using the SIF weighting scheme,
sim_sif.py and sim_tfidf.py are for the textual similarity tasks in the paper,
supervised_sif_proj.sh is for the supervised tasks in the paper.

Check these files to see the options.

Source code

The code is separated into the following parts:

SIF embedding: involves SIF_embedding.py. The SIF weighting scheme is very simple and is implmented in a few lines.
textual similarity tasks: involves data_io.py, eval.py, and sim_algo.py. data_io provides the code for reading the data, eval is for evaluating the performance, and sim_algo provides the code for our sentence embedding algorithm.
supervised tasks: involves data_io.py, eval.py, train.py, proj_model_sim.py, and proj_model_sentiment.py. train provides the entry for training the models (proj_model_sim is for the similarity and entailment tasks, and proj_model_sentiment is for the sentiment task). Check train.py to see the options.
utilities: includes lasagne_average_layer.py, params.py, and tree.py. These provides utility functions/classes for the above two parts.

References

For technical details and full experimental results, see the paper.

@article{arora2017asimple, 
	author = {Sanjeev Arora and Yingyu Liang and Tengyu Ma}, 
	title = {A Simple but Tough-to-Beat Baseline for Sentence Embeddings}, 
	booktitle = {International Conference on Learning Representations},
	year = {2017}
}

Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

Related tags

Overview

SIF

Install

Get started

Source code

References

Owner

Unsupervised intent recognition

A python gui program to generate reddit text to speech videos from the id of any post.

Natural Language Processing

Use the state-of-the-art m2m100 to translate large data on CPU/GPU/TPU. Super Easy!

GSoC'2021 | TensorFlow implementation of Wav2Vec2

SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

Installation, test and evaluation of Scribosermo speech-to-text engine

Script to download some free japanese lessons in portuguse from NHK

Code release for "COTR: Correspondence Transformer for Matching Across Images"

Lingtrain Aligner — ML powered library for the accurate texts alignment.

A demo of chinese asr

Utilize Korean BERT model in sentence-transformers library

Nateve compiler developed with python.

Source code for AAAI20 "Generating Persona Consistent Dialogues by Exploiting Natural Language Inference".

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

Faster, modernized fork of the language identification tool langid.py