Utilize Korean BERT model in sentence-transformers library

Overview

ko-sentence-transformers

이 프로젝트는 KoBERT 모델을 sentence-transformers 에서 보다 쉽게 사용하기 위해 만들어졌습니다. Ko-Sentence-BERT-SKTBERT 프로젝트에서는 KoBERT 모델을 sentence-transformers 에서 활용할 수 있도록 하였습니다. 하지만 설치 과정에 약간의 번거로움이 있었고, 라이브러리 코드를 직접 수정하기 때문에 허깅페이스 허브를 활용하기 어려웠습니다. ko-sentence-transformers 는 간단한 설치만으로 한국어 사전학습 모델을 문장 임베딩에 활용할 수 있도록 합니다.

Installation

pip install 을 통해 설치할 수 있습니다.

pip install ko-sentence-transformers

Examples

사전학습된 KoBERT 모델을 가져와 sentence-transformers API 에서 활용할 수 있습니다. training_nli_v2.py, training_sts.py 파일에서 모델 파인튜닝 예시를 확인할 수 있습니다.

from sentence_transformers import SentenceTransformer, models
from ko_sentence_transformers.models import KoBertTransformer
word_embedding_model = KoBertTransformer("monologg/kobert", max_seq_length=75)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode='mean')
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

허깅페이스 허브에 업로드된 모델 역시 간단히 불러와 활용할 수 있습니다.

from sentence_transformers import SentenceTransformer, util
import numpy as np

embedder = SentenceTransformer("jhgan/ko-sbert-sts")

# Corpus with example sentences
corpus = ['한 남자가 음식을 먹는다.',
          '한 남자가 빵 한 조각을 먹는다.',
          '그 여자가 아이를 돌본다.',
          '한 남자가 말을 탄다.',
          '한 여자가 바이올린을 연주한다.',
          '두 남자가 수레를 숲 속으로 밀었다.',
          '한 남자가 담으로 싸인 땅에서 백마를 타고 있다.',
          '원숭이 한 마리가 드럼을 연주한다.',
          '치타 한 마리가 먹이 뒤에서 달리고 있다.']

corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = ['한 남자가 파스타를 먹는다.',
           '고릴라 의상을 입은 누군가가 드럼을 연주하고 있다.',
           '치타가 들판을 가로 질러 먹이를 쫓는다.']

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = 5
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
    cos_scores = cos_scores.cpu()

    #We use np.argpartition, to only partially sort the top_k results
    top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for idx in top_results[0:top_k]:
        print(corpus[idx].strip(), "(Score: %.4f)" % (cos_scores[idx]))
======================


Query: 한 남자가 파스타를 먹는다.

Top 5 most similar sentences in corpus:
한 남자가 음식을 먹는다. (Score: 0.7417)
한 남자가 빵 한 조각을 먹는다. (Score: 0.6684)
한 남자가 말을 탄다. (Score: 0.1089)
한 남자가 담으로 싸인 땅에서 백마를 타고 있다. (Score: 0.0717)
두 남자가 수레를 숲 속으로 밀었다. (Score: 0.0244)


======================


Query: 고릴라 의상을 입은 누군가가 드럼을 연주하고 있다.

Top 5 most similar sentences in corpus:
원숭이 한 마리가 드럼을 연주한다. (Score: 0.7057)
한 여자가 바이올린을 연주한다. (Score: 0.3154)
치타 한 마리가 먹이 뒤에서 달리고 있다. (Score: 0.2171)
두 남자가 수레를 숲 속으로 밀었다. (Score: 0.1294)
그 여자가 아이를 돌본다. (Score: 0.0979)


======================


Query: 치타가 들판을 가로 질러 먹이를 쫓는다.

Top 5 most similar sentences in corpus:
치타 한 마리가 먹이 뒤에서 달리고 있다. (Score: 0.7986)
두 남자가 수레를 숲 속으로 밀었다. (Score: 0.3255)
한 남자가 담으로 싸인 땅에서 백마를 타고 있다. (Score: 0.2688)
한 남자가 말을 탄다. (Score: 0.1530)
원숭이 한 마리가 드럼을 연주한다. (Score: 0.0913)

KorSTS Benchmarks

카카오브레인의 KorNLU 데이터셋을 활용하여 sentence-BERT 모델을 학습시킨 후 다국어 모델의 성능과 비교한 결과입니다. ko-sbert-nli 모델은 KorNLI 데이터셋을 활용하여 학습되었고, ko-sbert-sts 모델은 KorSTS 데이터셋을 활용하여 학습되었습니다. ko-sbert-multitask 모델은 두 데이터셋을 모두 활용하여 멀티태스크로 학습되었습니다. 학습 및 성능 평가 과정은 training_*.py, benchmark.py 에서 확인할 수 있습니다. 학습된 모델은 허깅페이스 모델 허브에 공개되어있습니다.

모델 Cosine Pearson Cosine Spearman Manhattan Pearson Manhattan Spearman Euclidean Pearson Euclidean Spearman Dot Pearson Dot Spearman
ko-sbert-multitask 83.78 84.02 81.61 81.72 81.68 81.81 79.16 78.69
ko-sbert-nli 82.03 82.36 80.08 79.91 80.06 79.85 75.76 74.72
ko-sbert-sts 80.79 79.91 78.08 77.35 78.03 77.31 75.96 75.20
paraphrase-multilingual-mpnet-base-v2 80.69 82.00 80.33 80.39 80.48 80.61 70.30 68.48
distiluse-base-multilingual-cased-v1 75.50 74.83 73.05 73.15 73.67 73.86 74.79 73.95
distiluse-base-multilingual-cased-v2 75.62 74.83 73.03 72.87 73.68 73.62 63.80 62.35
paraphrase-multilingual-MiniLM-L12-v2 73.87 74.44 72.55 71.95 72.45 71.85 55.86 55.26

References

  • Ham, J., Choe, Y. J., Park, K., Choi, I., & Soh, H. (2020). Kornli and korsts: New benchmark datasets for korean natural language understanding. arXiv preprint arXiv:2004.03289
  • Reimers, Nils and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” ArXiv abs/1908.10084 (2019)
  • Ko-Sentence-BERT-SKTBERT
  • KoBERT
Owner
Junghyun
Junghyun
Unofficial PyTorch implementation of Google AI's VoiceFilter system

VoiceFilter Note from Seung-won (2020.10.25) Hi everyone! It's Seung-won from MINDs Lab, Inc. It's been a long time since I've released this open-sour

MINDs Lab 881 Jan 03, 2023
Backend for the Autocomplete platform. An AI assisted coding platform.

Introduction A custom predictor allows you to deploy your own prediction implementation, useful when the existing serving implementations don't fit yo

Tatenda Christopher Chinyamakobvu 1 Jan 31, 2022
Turkish Stop Words Türkçe Dolgu Sözcükleri

trstop Turkish Stop Words Türkçe Dolgu Sözcükleri In this repository I put Turkish stop words that is contained in the first 10 thousand words with th

Ahmet Aksoy 103 Nov 12, 2022
STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch.

st3 STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch. Currently it supports converting pbmm models to pt scripts with integra

Vlad Ki 8 Oct 18, 2021
Topic Modelling for Humans

gensim – Topic Modelling in Python Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Targ

RARE Technologies 13.8k Jan 02, 2023
Arabic speech recognition, classification and text-to-speech.

klaam Arabic speech recognition, classification and text-to-speech using many advanced models like wave2vec and fastspeech2. This repository allows tr

ARBML 177 Dec 27, 2022
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (S

InstaDeep Ltd 72 Dec 09, 2022
Words-per-minute - A terminal app written in python utilizing the curses module that tests the user's ability to type

words-per-minute A terminal app written in python utilizing the curses module th

Tanim Islam 1 Jan 14, 2022
Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings Trong bài viết này mình sẽ sử dụng pretrain model SimCS

Vo Van Phuc 18 Nov 25, 2022
A method for cleaning and classifying text using transformers.

NLP Translation and Classification The repository contains a method for classifying and cleaning text using NLP transformers. Overview The input data

Ray Chamidullin 0 Nov 15, 2022
[WWW 2021 GLB] New Benchmarks for Learning on Non-Homophilous Graphs

New Benchmarks for Learning on Non-Homophilous Graphs Here are the codes and datasets accompanying the paper: New Benchmarks for Learning on Non-Homop

94 Dec 21, 2022
A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings)

A2T: Towards Improving Adversarial Training of NLP Models This is the source code for the EMNLP 2021 (Findings) paper "Towards Improving Adversarial T

QData 17 Oct 15, 2022
A PyTorch Implementation of End-to-End Models for Speech-to-Text

speech Speech is an open-source package to build end-to-end models for automatic speech recognition. Sequence-to-sequence models with attention, Conne

Awni Hannun 647 Dec 25, 2022
A highly sophisticated sequence-to-sequence model for code generation

CoderX A proof-of-concept AI system by Graham Neubig (June 30, 2021). About CoderX CoderX is a retrieval-based code generation AI system reminiscent o

Graham Neubig 39 Aug 03, 2021
Dust model dichotomous performance analysis

Dust-model-dichotomous-performance-analysis Using a collated dataset of 90,000 dust point source observations from 9 drylands studies from around the

1 Dec 17, 2021
wxPython app for converting encodings, modifying and fixing SRT files

Subtitle Converter Program za obradu srt i txt fajlova. Requirements: Python version 3.8 wxPython version 4.1.0 or newer Libraries: srt, PyDispatcher

4 Nov 25, 2022
Machine translation models released by the Gourmet project

Gourmet Models Overview The Gourmet project has released several machine translation models to translate low-resource languages. This repository conta

Edinburgh NLP 5 Dec 08, 2021
Continuously update some NLP practice based on different tasks.

NLP_practice We will continuously update some NLP practice based on different tasks. prerequisites Software pytorch = 1.10 torchtext = 0.11.0 sklear

0 Jan 05, 2022
GNES enables large-scale index and semantic search for text-to-text, image-to-image, video-to-video and any-to-any content form

GNES is Generic Neural Elastic Search, a cloud-native semantic search system based on deep neural network.

GNES.ai 1.2k Jan 06, 2023
NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Artefact 114 Dec 15, 2022