Korean Sentence Embedding Repository

Overview

Korean-Sentence-Embedding

🍭 Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides environments where individuals can train models.

Baseline Models

Baseline models used for korean sentence embedding - KLUE-PLMs

Model Embedding size Hidden size # Layers # Heads
KLUE-BERT-base 768 768 12 12
KLUE-RoBERTa-base 768 768 12 12

NOTE: All the pretrained models are uploaded in Huggingface Model Hub. Check https://huggingface.co/klue.

How to start

  • Get datasets to train or test.
bash get_model_dataset.sh
  • If you want to do inference quickly, download the pre-trained models and then you can start some downstream tasks.
bash get_model_checkpoint.sh
cd KoSBERT/
python SemanticSearch.py

Available Models

  1. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks [SBERT]-[EMNLP 2019]
  2. SimCSE: Simple Contrastive Learning of Sentence Embeddings [SimCSE]-[EMNLP 2021]

KoSentenceBERT

  • πŸ€— Model Training
  • Dataset
    • Train: snli_1.0_train.ko.tsv (First phase, training NLI), sts-train.tsv (Second phase, continued training STS)
    • Valid: sts-dev.tsv
    • Test: sts-test.tsv

KoSimCSE

  • πŸ€— Model Training
  • Dataset
    • Train: snli_1.0_train.ko.tsv + multinli.train.ko.tsv
    • Valid: sts-dev.tsv
    • Test: sts-test.tsv

Performance

  • Semantic Textual Similarity test set results
Model Cosine Pearson Cosine Spearman Euclidean Pearson Euclidean Spearman Manhattan Pearson Manhattan Spearman Dot Pearson Dot Spearman
KoSBERT†SKT 78.81 78.47 77.68 77.78 77.71 77.83 75.75 75.22
KoSBERTbase 82.13 82.25 80.67 80.75 80.69 80.78 77.96 77.90
KoSRoBERTabase 80.70 81.03 80.97 81.06 80.84 80.97 79.20 78.93
KoSimCSE-BERT†SKT 82.12 82.56 81.84 81.63 81.99 81.74 79.55 79.19
KoSimCSE-BERTbase 82.73 83.51 82.32 82.78 82.43 82.88 77.86 76.70
KoSimCSE-RoBERTabase 83.64 84.05 83.32 83.84 83.33 83.79 80.92 79.84

Downstream Tasks

  • KoSBERT: Semantic Search, Clustering
python SemanticSearch.py
python Clustering.py
  • KoSimCSE: Semantic Search
python SemanticSearch.py

Semantic Search (KoSBERT)

from sentence_transformers import SentenceTransformer, util
import numpy as np

model_path = '../Checkpoint/KoSBERT/kosbert-klue-bert-base'

embedder = SentenceTransformer(model_path)

# Corpus with example sentences
corpus = ['ν•œ λ‚¨μžκ°€ μŒμ‹μ„ λ¨ΉλŠ”λ‹€.',
          'ν•œ λ‚¨μžκ°€ λΉ΅ ν•œ 쑰각을 λ¨ΉλŠ”λ‹€.',
          'κ·Έ μ—¬μžκ°€ 아이λ₯Ό λŒλ³Έλ‹€.',
          'ν•œ λ‚¨μžκ°€ 말을 탄닀.',
          'ν•œ μ—¬μžκ°€ λ°”μ΄μ˜¬λ¦°μ„ μ—°μ£Όν•œλ‹€.',
          '두 λ‚¨μžκ°€ 수레λ₯Ό 숲 μ†¦μœΌλ‘œ λ°€μ—ˆλ‹€.',
          'ν•œ λ‚¨μžκ°€ λ‹΄μœΌλ‘œ 싸인 λ•…μ—μ„œ 백마λ₯Ό 타고 μžˆλ‹€.',
          'μ›μˆ­μ΄ ν•œ λ§ˆλ¦¬κ°€ λ“œλŸΌμ„ μ—°μ£Όν•œλ‹€.',
          'μΉ˜νƒ€ ν•œ λ§ˆλ¦¬κ°€ 먹이 λ’€μ—μ„œ 달리고 μžˆλ‹€.']

corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = ['ν•œ λ‚¨μžκ°€ νŒŒμŠ€νƒ€λ₯Ό λ¨ΉλŠ”λ‹€.',
           '고릴라 μ˜μƒμ„ μž…μ€ λˆ„κ΅°κ°€κ°€ λ“œλŸΌμ„ μ—°μ£Όν•˜κ³  μžˆλ‹€.',
           'μΉ˜νƒ€κ°€ λ“€νŒμ„ κ°€λ‘œ 질러 먹이λ₯Ό μ«“λŠ”λ‹€.']

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = 5
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
    cos_scores = cos_scores.cpu()

    #We use np.argpartition, to only partially sort the top_k results
    top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for idx in top_results[0:top_k]:
        print(corpus[idx].strip(), "(Score: %.4f)" % (cos_scores[idx]))
  • Results are as follows :

Query: ν•œ λ‚¨μžκ°€ νŒŒμŠ€νƒ€λ₯Ό λ¨ΉλŠ”λ‹€.

Top 5 most similar sentences in corpus:
ν•œ λ‚¨μžκ°€ μŒμ‹μ„ λ¨ΉλŠ”λ‹€. (Score: 0.6141)
ν•œ λ‚¨μžκ°€ λΉ΅ ν•œ 쑰각을 λ¨ΉλŠ”λ‹€. (Score: 0.5952)
ν•œ λ‚¨μžκ°€ 말을 탄닀. (Score: 0.1231)
ν•œ λ‚¨μžκ°€ λ‹΄μœΌλ‘œ 싸인 λ•…μ—μ„œ 백마λ₯Ό 타고 μžˆλ‹€. (Score: 0.0752)
두 λ‚¨μžκ°€ 수레λ₯Ό 숲 μ†¦μœΌλ‘œ λ°€μ—ˆλ‹€. (Score: 0.0486)


======================


Query: 고릴라 μ˜μƒμ„ μž…μ€ λˆ„κ΅°κ°€κ°€ λ“œλŸΌμ„ μ—°μ£Όν•˜κ³  μžˆλ‹€.

Top 5 most similar sentences in corpus:
μ›μˆ­μ΄ ν•œ λ§ˆλ¦¬κ°€ λ“œλŸΌμ„ μ—°μ£Όν•œλ‹€. (Score: 0.6656)
μΉ˜νƒ€ ν•œ λ§ˆλ¦¬κ°€ 먹이 λ’€μ—μ„œ 달리고 μžˆλ‹€. (Score: 0.2988)
ν•œ μ—¬μžκ°€ λ°”μ΄μ˜¬λ¦°μ„ μ—°μ£Όν•œλ‹€. (Score: 0.1566)
ν•œ λ‚¨μžκ°€ 말을 탄닀. (Score: 0.1112)
ν•œ λ‚¨μžκ°€ λ‹΄μœΌλ‘œ 싸인 λ•…μ—μ„œ 백마λ₯Ό 타고 μžˆλ‹€. (Score: 0.0262)


======================


Query: μΉ˜νƒ€κ°€ λ“€νŒμ„ κ°€λ‘œ 질러 먹이λ₯Ό μ«“λŠ”λ‹€.

Top 5 most similar sentences in corpus:
μΉ˜νƒ€ ν•œ λ§ˆλ¦¬κ°€ 먹이 λ’€μ—μ„œ 달리고 μžˆλ‹€. (Score: 0.7570)
두 λ‚¨μžκ°€ 수레λ₯Ό 숲 μ†¦μœΌλ‘œ λ°€μ—ˆλ‹€. (Score: 0.3658)
μ›μˆ­μ΄ ν•œ λ§ˆλ¦¬κ°€ λ“œλŸΌμ„ μ—°μ£Όν•œλ‹€. (Score: 0.3583)
ν•œ λ‚¨μžκ°€ 말을 탄닀. (Score: 0.0505)
κ·Έ μ—¬μžκ°€ 아이λ₯Ό λŒλ³Έλ‹€. (Score: -0.0087)

Clustering (KoSBERT)

from sentence_transformers import SentenceTransformer, util
import numpy as np

model_path = '../Checkpoint/KoSBERT/kosbert-klue-bert-base'

embedder = SentenceTransformer(model_path)

# Corpus with example sentences
corpus = ['ν•œ λ‚¨μžκ°€ μŒμ‹μ„ λ¨ΉλŠ”λ‹€.',
          'ν•œ λ‚¨μžκ°€ λΉ΅ ν•œ 쑰각을 λ¨ΉλŠ”λ‹€.',
          'κ·Έ μ—¬μžκ°€ 아이λ₯Ό λŒλ³Έλ‹€.',
          'ν•œ λ‚¨μžκ°€ 말을 탄닀.',
          'ν•œ μ—¬μžκ°€ λ°”μ΄μ˜¬λ¦°μ„ μ—°μ£Όν•œλ‹€.',
          '두 λ‚¨μžκ°€ 수레λ₯Ό 숲 μ†¦μœΌλ‘œ λ°€μ—ˆλ‹€.',
          'ν•œ λ‚¨μžκ°€ λ‹΄μœΌλ‘œ 싸인 λ•…μ—μ„œ 백마λ₯Ό 타고 μžˆλ‹€.',
          'μ›μˆ­μ΄ ν•œ λ§ˆλ¦¬κ°€ λ“œλŸΌμ„ μ—°μ£Όν•œλ‹€.',
          'μΉ˜νƒ€ ν•œ λ§ˆλ¦¬κ°€ 먹이 λ’€μ—μ„œ 달리고 μžˆλ‹€.',
          'ν•œ λ‚¨μžκ°€ νŒŒμŠ€νƒ€λ₯Ό λ¨ΉλŠ”λ‹€.',
          '고릴라 μ˜μƒμ„ μž…μ€ λˆ„κ΅°κ°€κ°€ λ“œλŸΌμ„ μ—°μ£Όν•˜κ³  μžˆλ‹€.',
          'μΉ˜νƒ€κ°€ λ“€νŒμ„ κ°€λ‘œ 질러 먹이λ₯Ό μ«“λŠ”λ‹€.']

corpus_embeddings = embedder.encode(corpus)

# Then, we perform k-means clustering using sklearn:
from sklearn.cluster import KMeans

num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_

clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
    clustered_sentences[cluster_id].append(corpus[sentence_id])

for i, cluster in enumerate(clustered_sentences):
    print("Cluster ", i+1)
    print(cluster)
    print("")
  • Results are as follows:
Cluster  1
['ν•œ λ‚¨μžκ°€ μŒμ‹μ„ λ¨ΉλŠ”λ‹€.', 'ν•œ λ‚¨μžκ°€ λΉ΅ ν•œ 쑰각을 λ¨ΉλŠ”λ‹€.', 'ν•œ λ‚¨μžκ°€ νŒŒμŠ€νƒ€λ₯Ό λ¨ΉλŠ”λ‹€.']

Cluster  2
['μ›μˆ­μ΄ ν•œ λ§ˆλ¦¬κ°€ λ“œλŸΌμ„ μ—°μ£Όν•œλ‹€.', '고릴라 μ˜μƒμ„ μž…μ€ λˆ„κ΅°κ°€κ°€ λ“œλŸΌμ„ μ—°μ£Όν•˜κ³  μžˆλ‹€.']

Cluster  3
['ν•œ λ‚¨μžκ°€ 말을 탄닀.', '두 λ‚¨μžκ°€ 수레λ₯Ό 숲 μ†¦μœΌλ‘œ λ°€μ—ˆλ‹€.', 'ν•œ λ‚¨μžκ°€ λ‹΄μœΌλ‘œ 싸인 λ•…μ—μ„œ 백마λ₯Ό 타고 μžˆλ‹€.']

Cluster  4
['μΉ˜νƒ€ ν•œ λ§ˆλ¦¬κ°€ 먹이 λ’€μ—μ„œ 달리고 μžˆλ‹€.', 'μΉ˜νƒ€κ°€ λ“€νŒμ„ κ°€λ‘œ 질러 먹이λ₯Ό μ«“λŠ”λ‹€.']

Cluster  5
['κ·Έ μ—¬μžκ°€ 아이λ₯Ό λŒλ³Έλ‹€.', 'ν•œ μ—¬μžκ°€ λ°”μ΄μ˜¬λ¦°μ„ μ—°μ£Όν•œλ‹€.']

References

@misc{park2021klue,
    title={KLUE: Korean Language Understanding Evaluation},
    author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jung-Woo Ha and Kyunghyun Cho},
    year={2021},
    eprint={2105.09680},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
@inproceedings{gao2021simcse,
   title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
   author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
   booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
   year={2021}
}
@article{ham2020kornli,
  title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
  author={Ham, Jiyeon and Choe, Yo Joong and Park, Kyubyong and Choi, Ilji and Soh, Hyungjoon},
  journal={arXiv preprint arXiv:2004.03289},
  year={2020}
}
@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "http://arxiv.org/abs/1908.10084",
}
Owner
Self-softmax
HF's ML for Audio study group

Hugging Face Machine Learning for Audio Study Group Welcome to the ML for Audio Study Group. Through a series of presentations, paper reading and disc

Vaibhav Srivastav 110 Jan 01, 2023
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

ASYML 2.3k Jan 07, 2023
Blazing fast language detection using fastText model

Luga A blazing fast language detection using fastText's language models Luga is a Swahili word for language. fastText provides a blazing fast language

Prayson Wilfred Daniel 18 Dec 20, 2022
A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code or write code yourself

Scriptfab - What is it? A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code

DevNugget 3 Jul 28, 2021
PyTorch impelementations of BERT-based Spelling Error Correction Models.

PyTorch impelementations of BERT-based Spelling Error Correction Models

Heng Cai 209 Dec 30, 2022
PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI

data2vec-pytorch PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI (F

Aryan Shekarlaban 105 Jan 04, 2023
Random-Word-Generator - Generates meaningful words from dictionary with given no. of letters and words.

Random Word Generator Generates meaningful words from dictionary with given no. of letters and words. This might be useful for generating short links

Mohammed Rabil 1 Jan 01, 2022
Generate a cool README/About me page for your Github Profile

Github Profile README/ About Me Generator πŸ’― This webapp lets you build a cool README for your profile. A few inputs + ~15 mins = Your Github Profile

Rahul Banerjee 179 Jan 07, 2023
profile tools for pytorch nn models

nnprof Introduction nnprof is a profile tool for pytorch neural networks. Features multi profile mode: nnprof support 4 profile mode: Layer level, Ope

Feng Wang 42 Jul 09, 2022
Python port of Google's libphonenumber

phonenumbers Python Library This is a Python port of Google's libphonenumber library It supports Python 2.5-2.7 and Python 3.x (in the same codebase,

David Drysdale 3.1k Dec 29, 2022
nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

Bernhard Liebl 2 Jun 10, 2022
Sequence Modeling with Structured State Spaces

Structured State Spaces for Sequence Modeling This repository provides implementations and experiments for the following papers. S4 Efficiently Modeli

HazyResearch 902 Jan 06, 2023
Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration This is the official repository for the EMNLP 2021 long pa

70 Dec 11, 2022
Trains an OpenNMT PyTorch model and SentencePiece tokenizer.

Trains an OpenNMT PyTorch model and SentencePiece tokenizer. Designed for use with Argos Translate and LibreTranslate.

Argos Open Tech 61 Dec 13, 2022
Abhijith Neil Abraham 2 Nov 05, 2021
Extracting Summary Knowledge Graphs from Long Documents

GraphSum This repo contains the data and code for the G2G model in the paper: Extracting Summary Knowledge Graphs from Long Documents. The other basel

Zeqiu (Ellen) Wu 10 Oct 21, 2022
The ibet-Prime security token management system for ibet network.

ibet-Prime The ibet-Prime security token management system for ibet network. Features ibet-Prime is an API service that enables the issuance and manag

BOOSTRY 8 Dec 22, 2022
Study German declensions (dER nettE Mann, ein nettER Mann, mit dEM nettEN Mann, ohne dEN nettEN Mann ...) Generate as many exercises as you want using the incredible power of SPACY!

Study German declensions (dER nettE Mann, ein nettER Mann, mit dEM nettEN Mann, ohne dEN nettEN Mann ...) Generate as many exercises as you want using the incredible power of SPACY!

Hans AlemΓ£o 4 Jul 20, 2022
Prompt tuning toolkit for GPT-2 and GPT-Neo

mkultra mkultra is a prompt tuning toolkit for GPT-2 and GPT-Neo. Prompt tuning injects a string of 20-100 special tokens into the context in order to

61 Jan 01, 2023
This is a modification of the OpenAI-CLIP repository of moein-shariatnia

This is a modification of the OpenAI-CLIP repository of moein-shariatnia

Sangwon Beak 2 Mar 04, 2022