Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

Overview

KR-BERT-SimCSE

Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

Training

Unsupervised

python train_unsupervised.py --mixed_precision

I used Korean Wikipedia Corpus that is divided into sentences in advance. (Check out tfds-korean catalog page for details)

  • Settings
    • KR-BERT character
    • peak learning rate 3e-5
    • batch size 64
    • Total steps: 25,000
    • 0.05 warmup rate, and linear decay learning rate scheduler
    • temperature 0.05
    • evalaute on KLUE STS and KorSTS every 250 steps
    • max sequence length 64
    • Use pooled outputs for training, and [CLS] token's representations for inference

The hyperparameters were not tuned and mostly followed the values in the paper.

Supervised

python train_supervised.py --mixed_precision

I used KorNLI for supervised training. (Check out tfds-korean catalog page)

  • Settings
    • KR-BERT character
    • batch size 128
    • epoch 3
    • peak learning rate 5e-5
    • 0.05 warmup rate, and linear decay learning rate scheduler
    • temperature 0.05
    • evalaute on KLUE STS and KorSTS every 125 steps
    • max sequence length 48
    • Use pooled outputs for training, and [CLS] token's representations for inference

The hyperparameters were not tuned and mostly followed the values in the paper.

Results

KorSTS (dev set results)

model 100 X Spearman correlation
KR-BERT base
SimCSE
unsupervised bi encoding 79.99
KR-BERT base
SimCSE-supervised
trained on KorNLI bi encoding 84.88
SRoBERTa base* unsupervised bi encoding 63.34
SRoBERTa base* trained on KorNLI bi encoding 76.48
SRoBERTa base* trained on KorSTS bi encoding 83.68
SRoBERTa base* trained on KorNLI -> KorSTS bi encoding 83.54
SRoBERTa large* trained on KorNLI bi encoding 77.95
SRoBERTa large* trained on KorSTS bi encoding 84.74
SRoBERTa large* trained on KorNLI -> KorSTS bi encoding 84.21

KorSTS (test set results)

model 100 X Spearman correlation
KR-BERT base
SimCSE
unsupervised bi encoding 73.25
KR-BERT base
SimCSE-supervised
trained on KorNLI bi encoding 80.72
SRoBERTa base* unsupervised bi encoding 48.96
SRoBERTa base* trained on KorNLI bi encoding 74.19
SRoBERTa base* trained on KorSTS bi encoding 78.94
SRoBERTa base* trained on KorNLI -> KorSTS bi encoding 80.29
SRoBERTa large* trained on KorNLI bi encoding 75.46
SRoBERTa large* trained on KorSTS bi encoding 79.55
SRoBERTa large* trained on KorNLI -> KorSTS bi encoding 80.49
SRoBERTa base* trained on KorSTS cross encoding 83.00
SRoBERTa large* trained on KorSTS cross encoding 85.27

KLUE STS (dev set results)

model 100 X Pearson's correlation
KR-BERT base
SimCSE
unsupervised bi encoding 74.45
KR-BERT base
SimCSE-supervised
trained on KorNLI bi encoding 79.42
KR-BERT base* supervised cross encoding 87.50

References

@misc{gao2021simcse,
    title={SimCSE: Simple Contrastive Learning of Sentence Embeddings},
    author={Tianyu Gao and Xingcheng Yao and Danqi Chen},
    year={2021},
    eprint={2104.08821},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
@misc{ham2020kornli,
    title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
    author={Jiyeon Ham and Yo Joong Choe and Kyubyong Park and Ilji Choi and Hyungjoon Soh},
    year={2020},
    eprint={2004.03289},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
@misc{park2021klue,
    title={KLUE: Korean Language Understanding Evaluation},
    author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jung-Woo Ha and Kyunghyun Cho},
    year={2021},
    eprint={2105.09680},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Owner
Jeong Ukjae
Jeong Ukjae
Contract Understanding Atticus Dataset

Contract Understanding Atticus Dataset This repository contains code for the Contract Understanding Atticus Dataset (CUAD), a dataset for legal contra

The Atticus Project 273 Dec 17, 2022
AMUSE - financial summarization

AMUSE AMUSE - financial summarization Unzip data.zip Train new model: python FinAnalyze.py --task train --start 0 --count how many files,-1 for all

1 Jan 11, 2022
Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data

Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data Authors: Yi-Chang Chen, Yu-Chuan Chang, Yen-Cheng Chang and Yi-Ren Ye

Yi-Chang Chen 5 Dec 15, 2022
Guide to using pre-trained large language models of source code

Large Models of Source Code I occasionally train and publicly release large neural language models on programs, including PolyCoder. Here, I describe

Vincent Hellendoorn 947 Dec 28, 2022
Active learning for text classification in Python

Active Learning allows you to efficiently label training data in a small-data scenario.

Webis 375 Dec 28, 2022
Code for EMNLP20 paper: "ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training"

ProphetNet-X This repo provides the code for reproducing the experiments in ProphetNet. In the paper, we propose a new pre-trained language model call

Microsoft 394 Dec 17, 2022
Library for Russian imprecise rhymes generation

TOM RHYMER Library for Russian imprecise rhymes generation. Quick Start Generate rhymes by any given rhyme scheme (aabb, abab, aaccbb, etc ...): from

Alexey Karnachev 6 Oct 18, 2022
Edge-Augmented Graph Transformer

Edge-augmented Graph Transformer Introduction This is the official implementation of the Edge-augmented Graph Transformer (EGT) as described in https:

Md Shamim Hussain 21 Dec 14, 2022
Finetune gpt-2 in google colab

gpt-2-colab finetune gpt-2 in google colab sample result (117M) from retraining on A Tale of Two Cities by Charles Di

212 Jan 02, 2023
A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis

WaveGlow A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis Quick Start: Install requirements: pip install

Yuchao Zhang 204 Jul 14, 2022
Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline

Twitter-News-Summarizer Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline 1.) Extracts all tweets fr

Rohit Govindan 1 Jan 27, 2022
Pytorch-Named-Entity-Recognition-with-BERT

BERT NER Use google BERT to do CoNLL-2003 NER ! Train model using Python and Inference using C++ ALBERT-TF2.0 BERT-NER-TENSORFLOW-2.0 BERT-SQuAD Requi

Kamal Raj 1.1k Dec 25, 2022
Pipeline for training LSA models using Scikit-Learn.

Latent Semantic Analysis Pipeline for training LSA models using Scikit-Learn. Usage Instead of writing custom code for latent semantic analysis, you j

Dani El-Ayyass 23 Sep 05, 2022
A text augmentation tool for named entity recognition.

neraug This python library helps you with augmenting text data for named entity recognition. Augmentation Example Reference from An Analysis of Simple

Hiroki Nakayama 48 Oct 11, 2022
Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS) Yoonhyung Lee, Joongbo Shin, Kyomin Jung Abstract: Although early

LEE YOON HYUNG 147 Dec 05, 2022
The source code of "Language Models are Few-shot Multilingual Learners" (MRL @ EMNLP 2021)

Language Models are Few-shot Multilingual Learners Paper This is the source code of the paper [Arxiv] [ACL Anthology]: This code has been written usin

Genta Indra Winata 45 Nov 21, 2022
Dé op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

Dé op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

Lau 1 Dec 17, 2021
pyMorfologik MorfologikpyMorfologik - Python binding for Morfologik.

Python binding for Morfologik Morfologik is Polish morphological analyzer. For more information see http://github.com/morfologik/morfologik-stemming/

Damian Mirecki 18 Dec 29, 2021
Crie tokens de autenticação íntegros e seguros com UToken.

UToken - Tokens seguros. UToken (ou Unhandleable Token) é uma bilioteca criada para ser utilizada na geração de tokens seguros e íntegros, ou seja, nã

Jaedson Silva 0 Nov 29, 2022
An algorithm that can solve the word puzzle Wordle with an optimal number of guesses on HARD mode.

WordleSolver An algorithm that can solve the word puzzle Wordle with an optimal number of guesses on HARD mode. How to use the program Copy this proje

Akil Selvan Rajendra Janarthanan 3 Mar 02, 2022