KoBART model on huggingface transformers

Overview

KoBART-Transformers

  • SKT에서 공개한 KoBART를 편리하게 사용할 수 있게 transformers로 포팅하였습니다.

Install (Optional)

  • BartModelPreTrainedTokenizerFast를 이용하면 설치하실 필요 없습니다.
pip install kobart-transformers

Tokenizer

  • PreTrainedTokenizerFast를 이용하여 구현되었습니다.
  • PreTrainedTokenizerFast.from_pretrained("hyunwoongko/kobart")와 동일합니다.
>>> from kobart_transformers import get_kobart_tokenizer
>>> # from transformers import PreTrainedTokenizerFast

>>> kobart_tokenizer = get_kobart_tokenizer()
>>> # kobart_tokenizer = PreTrainedTokenizerFast.from_pretrained("hyunwoongko/kobart")

>>> kobart_tokenizer.tokenize("안녕하세요. 한국어 BART 입니다.🤣:)l^o")
['▁안녕하', '세요.', '▁한국어', '▁B', 'A', 'R', 'T', '▁입', '니다.', '🤣', ':)', 'l^o']

Model

  • BartModel을 이용하여 구현되었습니다.
  • BartModel.from_pretrained("hyunwoongko/kobart")와 동일합니다.
>>> from kobart_transformers import get_kobart_model, get_kobart_tokenizer
>>> # from transformers import BartModel

>>> kobart_tokenizer = get_kobart_tokenizer()
>>> model = get_kobart_model()
>>> # model = BartModel.from_pretrained("hyunwoongko/kobart")

>>> inputs = kobart_tokenizer(['안녕하세요.'], return_tensors='pt')
>>> model(inputs['input_ids'])
Seq2SeqModelOutput(last_hidden_state=tensor([[[-0.4488, -4.3651,  3.2349,  ...,  5.8916,  4.0497,  3.5468],
         [-0.4096, -4.6106,  2.7189,  ...,  6.1745,  2.9832,  3.0930]]],
       grad_fn=<TransposeBackward0>), past_key_values=None, decoder_hidden_states=None, decoder_attentions=None, cross_attentions=None, encoder_last_hidden_state=tensor([[[ 0.4624, -0.2475,  0.0902,  ...,  0.1127,  0.6529,  0.2203],
         [ 0.4538, -0.2948,  0.2556,  ..., -0.0442,  0.6858,  0.4372]]],
       grad_fn=<TransposeBackward0>), encoder_hidden_states=None, encoder_attentions=None)

For Seq2Seq Training

  • seq2seq 학습시에는 아래와 같이 get_kobart_for_conditional_generation()을 이용합니다.
  • BartForConditionalGeneration.from_pretrained("hyunwoongko/kobart")와 동일합니다.
>>> from kobart_transformers import get_kobart_for_conditional_generation
>>> # from transformers import BartForConditionalGeneration

>>> model = get_kobart_for_conditional_generation()
>>> # model = BartForConditionalGeneration.from_pretrained("hyunwoongko/kobart")

Updates Notes

version 0.1

  • pad 토큰이 설정되지 않은 에러를 해결하였습니다.
from kobart import get_kobart_tokenizer
kobart_tokenizer = get_kobart_tokenizer()
kobart_tokenizer(["한국어", "BART 모델을", "소개합니다."], truncation=True, padding=True)
{
'input_ids': [[28324, 3, 3, 3, 3], [15085, 264, 281, 283, 24224], [15630, 20357, 3, 3, 3]], 
'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 
'attention_mask': [[1, 0, 0, 0, 0], [1, 1, 1, 1, 1], [1, 1, 0, 0, 0]]
}

version 0.1.3

  • get_kobart_for_conditional_generation()__init__.py에 등록하였습니다.

version 0.1.4

  • 누락되었던 special_tokens_map.json을 추가하였습니다.
  • 이제 pip install 없이 KoBART를 이용할 수 있습니다.
  • thanks to bernardscumm

Reference

Owner
Hyunwoong Ko
be a creator not a programmer.
Hyunwoong Ko
Python wrapper for Stanford CoreNLP tools v3.4.1

Python interface to Stanford Core NLP tools v3.4.1 This is a Python wrapper for Stanford University's NLP group's Java-based CoreNLP tools. It can eit

Dustin Smith 610 Sep 07, 2022
ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python)

ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python) 日本語は以下に続きます (Japanese follows) English: This book is written in Japanese and primaril

Ryuichi Yamamoto 189 Dec 29, 2022
Full Spectrum Bioinformatics - a free online text designed to introduce key topics in Bioinformatics using the Python

Full Spectrum Bioinformatics is a free online text designed to introduce key topics in Bioinformatics using the Python programming language. The text is written in interactive Jupyter Notebooks, whic

Jesse Zaneveld 33 Dec 28, 2022
A simple visual front end to the Maya UE4 RBF plugin delivered with MetaHumans

poseWrangler Overview PoseWrangler is a simple UI to create and edit pose-driven relationships in Maya using the MayaUE4RBF plugin. This plugin is dis

Christopher Evans 105 Dec 18, 2022
Large-scale Knowledge Graph Construction with Prompting

Large-scale Knowledge Graph Construction with Prompting across tasks (predictive and generative), and modalities (language, image, vision + language, etc.)

ZJUNLP 161 Dec 28, 2022
KoBERT - Korean BERT pre-trained cased (KoBERT)

KoBERT KoBERT Korean BERT pre-trained cased (KoBERT) Why'?' Training Environment Requirements How to install How to use Using with PyTorch Using with

SK T-Brain 1k Jan 02, 2023
JaQuAD: Japanese Question Answering Dataset

JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension (2022, Skelter Labs)

SkelterLabs 84 Dec 27, 2022
A spaCy wrapper of OpenTapioca for named entity linking on Wikidata

spaCyOpenTapioca A spaCy wrapper of OpenTapioca for named entity linking on Wikidata. Table of contents Installation How to use Local OpenTapioca Vizu

Universitätsbibliothek Mannheim 80 Jan 03, 2023
Semantic search through a vectorized Wikipedia (SentenceBERT) with the Weaviate vector search engine

Semantic search through Wikipedia with the Weaviate vector search engine Weaviate is an open source vector search engine with build-in vectorization a

SeMI Technologies 191 Dec 26, 2022
Random Directed Acyclic Graph Generator

DAG_Generator Random Directed Acyclic Graph Generator verison1.0 简介 工作流通常由DAG(有向无环图)来定义,其中每个计算任务$T_i$由一个顶点(node,task,vertex)表示。同时,任务之间的每个数据或控制依赖性由一条加权

Livion 17 Dec 27, 2022
PyTorch source code of NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models"

This repository contains source code for NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models" (P

Alexandra Chronopoulou 89 Aug 12, 2022
Codes to pre-train Japanese T5 models

t5-japanese Codes to pre-train a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts. The model is available at https://hug

Megagon Labs 37 Dec 25, 2022
DAGAN - Dual Attention GANs for Semantic Image Synthesis

Contents Semantic Image Synthesis with DAGAN Installation Dataset Preparation Generating Images Using Pretrained Model Train and Test New Models Evalu

Hao Tang 104 Oct 08, 2022
GSoC'2021 | TensorFlow implementation of Wav2Vec2

GSoC'2021 | TensorFlow implementation of Wav2Vec2

Vasudev Gupta 73 Nov 28, 2022
This is an incredibly powerful calculator that is capable of many useful day-to-day functions.

Description 💻 This is an incredibly powerful calculator that is capable of many useful day-to-day functions. Such functions include solving basic ari

Jordan Leich 37 Nov 19, 2022
Задания КЕГЭ по информатике 2021 на Python

КЕГЭ 2021 на Python В этом репозитории мои решения типовых заданий КЕГЭ по информатике в 2021 году, БЕСПЛАТНО! Задания Взяты с https://inf-ege.sdamgia

8 Oct 13, 2022
Chinese version of GPT2 training code, using BERT tokenizer.

GPT2-Chinese Description Chinese version of GPT2 training code, using BERT tokenizer or BPE tokenizer. It is based on the extremely awesome repository

Zeyao Du 5.6k Jan 04, 2023
The (extremely) naive sentiment classification function based on NBSVM trained on wisesight_sentiment

thai_sentiment The naive sentiment classification function based on NBSVM trained on wisesight_sentiment วิธีติดตั้ง pip install thai_sentiment==0.1.3

Charin 7 Dec 08, 2022
Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration This repo contains only model Implementation of Zero-Shot Text-to-Speech for Text

Rishikesh (ऋषिकेश) 33 Sep 22, 2022
Generate custom detailed survey paper with topic clustered sections and proper citations, from just a single query in just under 30 mins !!

Auto-Research A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting arti

Sidharth Pal 20 Dec 14, 2022