使用pytorch+transformers复现了SimCSE论文中的有监督训练和无监督训练方法

Related tags

Text Data & NLPSimCSE
Overview

SimCSE复现

项目描述

SimCSE是一种简单但是很巧妙的NLP对比学习方法,创新性地引入Dropout的方式,对样本添加噪声,从而达到对正样本增强的目的。 该框架的训练目的为:对于batch中的每个样本,拉近其与正样本之间的距离,拉远其与负样本之间的距离,使得模型能够在大规模无监督语料(也可以使用有监督的语料)中学习到文本相似关系。 详见论文:Simple Contrastive Learning of Sentence EmbeddingsSimCSE官方代码仓库

本项目使用pytorch+transformers复现了SimCSE论文中的有监督训练和无监督训练方法,并且在STS-B数据集上进行消融实验,评价指标为Spearman相关系数,预训练模型为Bert-base-uncased, 验证了SimCSE的有效性。在STS-B数据集上,有监督训练和无监督训练的复现效果如下表。

在无监督训练中,dropout=0.1时,复现效果比原文略差,但也比较接近。当dropout=0.2时,复现效果比原文略高。 ** 但在有监督训练中,不知是否由于batch size过小(原论文使用512),复现效果与论文的效果相差较远,后续会进行排查。 **

训练方法 learning rate batch size dropout Spearman’s correlation
原论文 无监督 3e-5 64 0.1 0.763
复现 无监督 3e-5 64 0.2 0.771
复现 无监督 3e-5 64 0.1 0.748
原论文 有监督 5e-5 512 0.1 0.816
复现 有监督 5e-5 64 0.1 0.764

运行环境

python==3.6、transformers==3.1.0、torch==1.6.0

项目结构

  • data:存放训练数据
    • stsbenchmark:STS-B数据集
      • sts-dev.csv:STS-B验证集
      • sts-test.csv:STS-B验测试集
    • nli_for_simcse.csv:数量275601为的NLI数据集
    • wiki1m_for_simcse.txt:维基百科上获取的100w的文本
  • output:输出目录
  • pretrain_model:预训练模型存放位置
  • script:脚本存放位置。
  • dataset.py
  • model.py:模型代码,包含有监督和无监督损失函数的计算方式
  • train.py:训练代码

使用方法

Quick Start

下载训练数据:

bash script/download_nli.sh
bash script/download_wiki.sh

无监督训练,运行脚本

bash script/run_unsup_train.sh

有监督训练,运行脚本

bash script/run_sup_train.sh

实验

无监督训练

从前四条实验数据中可以看到,较大的batch size在一定程度上可以增加模型的泛化性。

dropout为0.2的时候,训练效果比0.1与0.3更好,有可能dropout=0.1加入的噪声过小,而dropout=0.3加入的噪声过大,增强得到的样本与原始样本差异较大。

learning rate batch size dropout 在哪一步得到best checkpoint 验证集上的得分 测试集上的得分
3e-5 256 0.1 6000 0.800 0.761
3e-5 128 0.1 4200 0.799 0.747
3e-5 64 0.1 10900 0.803 0.748
3e-5 32 0.1 21300 0.787 0.714
3e-5 64 0.2 11200 0.811 0.771
3e-5 64 0.3 6300 0.781 0.745
1e-5 64 0.1 16400 0.798 0.751

有监督训练

有监督实验的复现结果未达到预期,超参数相同时,在验证集上的得分略高于无监督,但是在测试集上,得分基本没有差异。增大有监督训练的学习率,有监督的训练的得分略高于无监督训练, 但还是与论文声称的0.816相差较远,原论文使用512的batch size, 不知是否由于batch size的设置有关,后续会对有监督的训练代码进一步排查。

不过从训练曲线可以看到,有监督训练的收敛速度明显快于无监督训练,这也符合我们的认知。

训练方法 learning rate batch size dropout 在哪一步得到best checkpoint 验证集上的得分 测试集上的得分
无监督 3e-5 64 0.1 10900 0.803 0.748
有监督 3e-5 64 0.1 200 0.810 0.748
有监督 5e-5 64 0.1 2300 0.809 0.764
有监督 3e-5 32 0.1 200 0.808 0.743
有监督 5e-5 32 0.1 200 0.806 0.746

无监督训练过程中,验证集得分的变化曲线: avatar

有监督训练过程中,验证集得分的变化曲线: avatar

REFERENCE

TODO

  • 排查有监督学习的效果不符合预期的原因
Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

smaller-LaBSE LaBSE(Language-agnostic BERT Sentence Embedding) is a very good method to get sentence embeddings across languages. But it is hard to fi

Jeong Ukjae 13 Sep 02, 2022
Pretrain CPM - 大规模预训练语言模型的预训练代码

CPM-Pretrain 版本更新记录 为了促进中文自然语言处理研究的发展,本项目提供了大规模预训练语言模型的预训练代码。项目主要基于DeepSpeed、Megatron实现,可以支持数据并行、模型加速、流水并行的代码。 安装 1、首先安装pytorch等基础依赖,再安装APEX以支持fp16。 p

Tsinghua AI 37 Dec 06, 2022
A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

ParlAI (pronounced “par-lay”) is a python framework for sharing, training and testing dialogue models, from open-domain chitchat, to task-oriented dia

Facebook Research 9.7k Jan 09, 2023
Shared, streaming Python dict

UltraDict Sychronized, streaming Python dictionary that uses shared memory as a backend Warning: This is an early hack. There are only few unit tests

Ronny Rentner 192 Dec 23, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Dec 30, 2022
TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP

TextAttack 🐙 Generating adversarial examples for NLP models [TextAttack Documentation on ReadTheDocs] About • Setup • Usage • Design About TextAttack

QData 2.2k Jan 03, 2023
Natural Language Processing at EDHEC, 2022

Natural Language Processing Here you will find the teaching materials for the "Natural Language Processing" course at EDHEC Business School, 2022 What

1 Feb 04, 2022
A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

wav2vec-toolkit A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models This repository accompanies the

Anton Lozhkov 29 Oct 23, 2022
Rootski - Full codebase for rootski.io (without the data)

📣 Welcome to the Rootski codebase! This is the codebase for the application run

Eric 20 Nov 18, 2022
jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

jel: Japanese Entity Linker jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese. Usage Currently, link and question methods

izuna385 10 Jan 06, 2023
Voice Assistant inspired by Google Assistant, Cortana, Alexa, Siri, ...

author: @shival_gupta VoiceAI This program is an example of a simple virtual assitant It will listen to you and do accordingly It will begin with wish

Shival Gupta 1 Jan 06, 2022
Chinese segmentation library

What is loso? loso is a Chinese segmentation system written in Python. It was developed by Victor Lin ( Fang-Pen Lin 82 Jun 28, 2022

Multilingual word vectors in 78 languages

Aligning the fastText vectors of 78 languages Facebook recently open-sourced word vectors in 89 languages. However these vectors are monolingual; mean

Babylon Health 1.2k Dec 17, 2022
InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective This is the official code base for our ICLR 2021 paper

AI Secure 71 Nov 25, 2022
profile tools for pytorch nn models

nnprof Introduction nnprof is a profile tool for pytorch neural networks. Features multi profile mode: nnprof support 4 profile mode: Layer level, Ope

Feng Wang 42 Jul 09, 2022
AI_Assistant - This is a Python based Voice Assistant.

This is a Python based Voice Assistant. This was programmed to increase my understanding of python and also how the in-general Voice Assistants work.

1 Jan 06, 2022
This repository contains the codes for LipGAN. LipGAN was published as a part of the paper titled "Towards Automatic Face-to-Face Translation".

LipGAN Generate realistic talking faces for any human speech and face identity. [Paper] | [Project Page] | [Demonstration Video] Important Update: A n

Rudrabha Mukhopadhyay 438 Dec 31, 2022
Faster, modernized fork of the language identification tool langid.py

py3langid py3langid is a fork of the standalone language identification tool langid.py by Marco Lui. Original license: BSD-2-Clause. Fork license: BSD

Adrien Barbaresi 12 Nov 05, 2022
ASCEND Chinese-English code-switching dataset

ASCEND (A Spontaneous Chinese-English Dataset) introduces a high-quality resource of spontaneous multi-turn conversational dialogue Chinese-English code-switching corpus collected in Hong Kong.

CAiRE 11 Dec 09, 2022