NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

Related tags

Text Data & NLPMCSE
Overview

MCSE: Multimodal Contrastive Learning of Sentence Embeddings

This repository contains code and pre-trained models for our NAACL-2022 paper MCSE: Multimodal Contrastive Learning of Sentence Embeddings. If you find this reposity useful, please consider citing our paper.

Contact: Miaoran Zhang ([email protected])

Pre-trained Models & Results

Model Avg. STS
flickr-mcse-bert-base-uncased [Google Drive] 77.70
flickr-mcse-roberta-base [Google Drive] 78.44
coco-mcse-bert-base-uncased [Google Drive] 77.08
coco-mcse-roberta-base [Google Drive] 78.17

Note: flickr indicates that models are trained on wiki+flickr, and coco indicates that models are trained on wiki+coco.

Quickstart

Setup

  • Python 3.9.5
  • Pytorch 1.7.1
  • Install other packages:
pip install -r requirements.txt

Data Preparation

Please organize the data directory as following:

REPO ROOT
|
|--data    
|  |--wiki1m_for_simcse.txt  
|  |--flickr_random_captions.txt    
|  |--flickr_resnet.hdf5    
|  |--coco_random_captions.txt    
|  |--coco_resnet.hdf5  

Wiki1M

wget https://huggingface.co/datasets/princeton-nlp/datasets-for-simcse/resolve/main/wiki1m_for_simcse.txt

Flickr30k & MS-COCO
You can either download the preprocessed data we used:
(annotation sources: flickr30k-entities and coco).

Or preprocess the data by yourself (take Flickr30k as an example):

  1. Download the flickr30k-entities.
  2. Request access to the flickr-images from here. Note that the use of the images much abide by the Flickr Terms of Use.
  3. Run script:
    unzip ${path_to_flickr-entities}/annotations.zip
    
    python preprocess/prepare_flickr.py \
        --flickr_entities_dir ${path_to_flickr-entities}  \  
        --flickr_images_dir ${path_to_flickr-images} \
        --output_dir data/
        --batch_size 32
    

Train & Evaluation

  1. Prepare the senteval datasets for evaluation:

    cd SentEval/data/downstream/
    bash download_dataset.sh
    
  2. Run scripts:

    # For example:  (more examples are given in scripts/.)
    sh scripts/run_wiki_flickr.sh

    Note: In the paper we run experiments with 5 seeds (0,1,2,3,4). You can find the detailed parameter settings in Appendix.

Acknowledgements

  • The extremely clear and well organized codebase: SimCSE
  • SentEval toolkit
Owner
Saarland University Spoken Language Systems Group
Saarland University Spoken Language Systems Group
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Alexander Veysov 3.2k Dec 31, 2022
NLP tool to extract emotional phrase from tweets 🤩

Emotional phrase extractor Extract phrase in the given text that is used to express the sentiment. Capturing sentiment in language is important in the

Shahul ES 38 Oct 17, 2022
ByT5: Towards a token-free future with pre-trained byte-to-byte models

ByT5: Towards a token-free future with pre-trained byte-to-byte models ByT5 is a tokenizer-free extension of the mT5 model. Instead of using a subword

Google Research 409 Jan 06, 2023
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Hiring We are hiring at all levels (including FTE researchers and interns)! If you are interested in working with us on NLP and large-scale pre-traine

Microsoft 7.8k Jan 09, 2023
To be a next-generation DL-based phenotype prediction from genome mutations.

Sequence -----------+-- 3D_structure -- 3D_module --+ +-- ? | |

Eric Alcaide 18 Jan 11, 2022
A minimal Conformer ASR implementation adapted from ESPnet.

Conformer ASR A minimal Conformer ASR implementation adapted from ESPnet. Introduction I want to use the pre-trained English ASR model provided by ESP

Niu Zhe 3 Jan 24, 2022
Suite of 500 procedurally-generated NLP tasks to study language model adaptability

TaskBench500 The TaskBench500 dataset and code for generating tasks. Data The TaskBench dataset is available under wget http://web.mit.edu/bzl/www/Tas

Belinda Li 20 May 17, 2022
Higher quality textures for the Metal Gear Solid series.

Metal Gear Solid: HD Textures Higher quality textures for the Metal Gear Solid series. The goal is to maximize the quality of assets that the engine w

Samantha 6 Dec 06, 2022
Implementation of some unbalanced loss like focal_loss, dice_loss, DSC Loss, GHM Loss et.al

Implementation of some unbalanced loss for NLP task like focal_loss, dice_loss, DSC Loss, GHM Loss et.al Summary Here is a loss implementation reposit

121 Jan 01, 2023
Code for text augmentation method leveraging large-scale language models

HyperMix Code for our paper GPT3Mix and conducting classification experiments using GPT-3 prompt-based data augmentation. Getting Started Installing P

NAVER AI 47 Dec 20, 2022
A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

tfds-korean A collection of Korean Text Datasets ready to use using Tensorflow-Datasets. TensorFlow-Datasets를 이용한 한국어/한글 데이터셋 모음입니다. Dataset Catalog |

Jeong Ukjae 20 Jul 11, 2022
A relatively simple python program to generate one of those reddit text to speech videos dominating youtube.

Reddit text to speech generator A basic reddit tts video generator Current functionality Generate videos for subs based on comments,(askreddit) so rea

Aadvik 17 Dec 19, 2022
Help you discover excellent English projects and get rid of disturbing by other spoken language

GitHub English Top Charts 「Help you discover excellent English projects and get

GrowingGit 544 Jan 09, 2023
Implementation of legal QA system based on SentenceKoBART

LegalQA using SentenceKoBART Implementation of legal QA system based on SentenceKoBART How to train SentenceKoBART Based on Neural Search Engine Jina

Heewon Jeon(gogamza) 75 Dec 27, 2022
超轻量级bert的pytorch版本,大量中文注释,容易修改结构,持续更新

bert4pytorch 2021年8月27更新: 感谢大家的star,最近有小伙伴反映了一些小的bug,我也注意到了,奈何这个月工作上实在太忙,更新不及时,大约会在9月中旬集中更新一个只需要pip一下就完全可用的版本,然后会新添加一些关键注释。 再增加对抗训练的内容,更新一个完整的finetune

muqiu 317 Dec 18, 2022
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

CrossNER is a fully-labeled collected of named entity recognition (NER) data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specia

Zihan Liu 89 Nov 10, 2022
ChatterBot is a machine learning, conversational dialog engine for creating chat bots

ChatterBot ChatterBot is a machine-learning based conversational dialog engine build in Python which makes it possible to generate responses based on

Gunther Cox 12.8k Jan 03, 2023
Use Tensorflow2.7.0 Build OpenAI'GPT-2

TF2_GPT-2 Use Tensorflow2.7.0 Build OpenAI'GPT-2 使用最新tensorflow2.7.0构建openai官方的GPT-2 NLP模型 优点 使用无监督技术 拥有大量词汇量 可实现续写(堪比“xx梦续写”) 实现对话后续将应用于FloatTech的Bot

Watermelon 9 Sep 13, 2022
Signature remover is a NLP based solution which removes email signatures from the rest of the text.

Signature Remover Signature remover is a NLP based solution which removes email signatures from the rest of the text. It helps to enchance data conten

Forges Alterway 8 Jan 06, 2023
Official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

This repository is the official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

vanint 101 Dec 30, 2022