๐Ÿฆ… Pretrained BigBird Model for Korean (up to 4096 tokens)

Overview

Pretrained BigBird Model for Korean

What is BigBird โ€ข How to Use โ€ข Pretraining โ€ข Evaluation Result โ€ข Docs โ€ข Citation

ํ•œ๊ตญ์–ด | English

Apache 2.0 Issues linter DOI

What is BigBird?

BigBird: Transformers for Longer Sequences์—์„œ ์†Œ๊ฐœ๋œ sparse-attention ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ๋กœ, ์ผ๋ฐ˜์ ์ธ BERT๋ณด๋‹ค ๋” ๊ธด sequence๋ฅผ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿฆ… Longer Sequence - ์ตœ๋Œ€ 512๊ฐœ์˜ token์„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š” BERT์˜ 8๋ฐฐ์ธ ์ตœ๋Œ€ 4096๊ฐœ์˜ token์„ ๋‹ค๋ฃธ

โฑ๏ธ Computational Efficiency - Full attention์ด ์•„๋‹Œ Sparse Attention์„ ์ด์šฉํ•˜์—ฌ O(n2)์—์„œ O(n)์œผ๋กœ ๊ฐœ์„ 

How to Use

  • ๐Ÿค— Huggingface Hub์— ์—…๋กœ๋“œ๋œ ๋ชจ๋ธ์„ ๊ณง๋ฐ”๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:)
  • ์ผ๋ถ€ ์ด์Šˆ๊ฐ€ ํ•ด๊ฒฐ๋œ transformers>=4.11.0 ์‚ฌ์šฉ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค. (MRC ์ด์Šˆ ๊ด€๋ จ PR)
  • BigBirdTokenizer ๋Œ€์‹ ์— BertTokenizer ๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. (AutoTokenizer ์‚ฌ์šฉ์‹œ BertTokenizer๊ฐ€ ๋กœ๋“œ๋ฉ๋‹ˆ๋‹ค.)
  • ์ž์„ธํ•œ ์‚ฌ์šฉ๋ฒ•์€ BigBird Tranformers documentation์„ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”.
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("monologg/kobigbird-bert-base")  # BigBirdModel
tokenizer = AutoTokenizer.from_pretrained("monologg/kobigbird-bert-base")  # BertTokenizer

Pretraining

์ž์„ธํ•œ ๋‚ด์šฉ์€ [Pretraining BigBird] ์ฐธ๊ณ 

Hardware Max len LR Batch Train Step Warmup Step
KoBigBird-BERT-Base TPU v3-8 4096 1e-4 32 2M 20k
  • ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜, ํ•œ๊ตญ์–ด ์œ„ํ‚ค, Common Crawl, ๋‰ด์Šค ๋ฐ์ดํ„ฐ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต
  • ITC (Internal Transformer Construction) ๋ชจ๋ธ๋กœ ํ•™์Šต (ITC vs ETC)

Evaluation Result

1. Short Sequence (<=512)

์ž์„ธํ•œ ๋‚ด์šฉ์€ [Finetune on Short Sequence Dataset] ์ฐธ๊ณ 

NSMC
(acc)
KLUE-NLI
(acc)
KLUE-STS
(pearsonr)
Korquad 1.0
(em/f1)
KLUE MRC
(em/rouge-w)
KoELECTRA-Base-v3 91.13 86.87 93.14 85.66 / 93.94 59.54 / 65.64
KLUE-RoBERTa-Base 91.16 86.30 92.91 85.35 / 94.53 69.56 / 74.64
KoBigBird-BERT-Base 91.18 87.17 92.61 87.08 / 94.71 70.33 / 75.34

2. Long Sequence (>=1024)

์ž์„ธํ•œ ๋‚ด์šฉ์€ [Finetune on Long Sequence Dataset] ์ฐธ๊ณ 

TyDi QA
(em/f1)
Korquad 2.1
(em/f1)
Fake News
(f1)
Modu Sentiment
(f1-macro)
KLUE-RoBERTa-Base 76.80 / 78.58 55.44 / 73.02 95.20 42.61
KoBigBird-BERT-Base 79.13 / 81.30 67.77 / 82.03 98.85 45.42

Docs

Citation

KoBigBird๋ฅผ ์‚ฌ์šฉํ•˜์‹ ๋‹ค๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ์ธ์šฉํ•ด์ฃผ์„ธ์š”.

@software{jangwon_park_2021_5654154,
  author       = {Jangwon Park and Donggyu Kim},
  title        = {KoBigBird: Pretrained BigBird Model for Korean},
  month        = nov,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.5654154},
  url          = {https://doi.org/10.5281/zenodo.5654154}
}

Contributors

Jangwon Park and Donggyu Kim

Acknowledgements

KoBigBird๋Š” Tensorflow Research Cloud (TFRC) ํ”„๋กœ๊ทธ๋žจ์˜ Cloud TPU ์ง€์›์œผ๋กœ ์ œ์ž‘๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ ๋ฉ‹์ง„ ๋กœ๊ณ ๋ฅผ ์ œ๊ณตํ•ด์ฃผ์‹  Seyun Ahn๋‹˜๊ป˜ ๊ฐ์‚ฌ๋ฅผ ์ „ํ•ฉ๋‹ˆ๋‹ค.

You might also like...
KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KoGPT KoGPT (Korean Generative Pre-trained Transformer) https://github.com/kakaobrain/kogpt https://huggingface.co/kakaobrain/kogpt Model Descriptions

Generating Korean Slogans with phonetic and structural repetition
Generating Korean Slogans with phonetic and structural repetition

LexPOS_ko Generating Korean Slogans with phonetic and structural repetition Generating Slogans with Linguistic Features LexPOS is a sequence-to-sequen

Korean extractive summarization. 2021 AI ํ…์ŠคํŠธ ์š”์•ฝ ์˜จ๋ผ์ธ ํ•ด์ปคํ†ค ํ™”์„ฑ๊ฐˆ๋„๋‹ˆ๊นŒํŒ€ ์ฝ”๋“œ
Korean extractive summarization. 2021 AI ํ…์ŠคํŠธ ์š”์•ฝ ์˜จ๋ผ์ธ ํ•ด์ปคํ†ค ํ™”์„ฑ๊ฐˆ๋„๋‹ˆ๊นŒํŒ€ ์ฝ”๋“œ

korean extractive summarization 2021 AI ํ…์ŠคํŠธ ์š”์•ฝ ์˜จ๋ผ์ธ ํ•ด์ปคํ†ค ํ™”์„ฑ๊ฐˆ๋„๋‹ˆ๊นŒํŒ€ ์ฝ”๋“œ Leaderboard Notice Text Summarization with Pretrained Encoders์— ๋‚˜์˜ค๋Š” bertsumext๋ชจ๋ธ(ext

Training code for Korean multi-class sentiment analysis

KoSentimentAnalysis Bert implementation for the Korean multi-class sentiment analysis ์™œ ํ•œ๊ตญ์–ด ๊ฐ์ • ๋‹ค์ค‘๋ถ„๋ฅ˜ ๋ชจ๋ธ์€ ๊ฑฐ์˜ ์—†๋Š” ๊ฒƒ์ผ๊นŒ?์—์„œ ์‹œ์ž‘๋œ ํ”„๋กœ์ ํŠธ Environment: Pytorch, Da

Korean Sentence Embedding Repository

Korean-Sentence-Embedding ๐Ÿญ Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in a matter of minutes. Based on our experiments with a wide range of benchmarks, ProteinBERT usually achieves state-of-the-art performance. ProteinBERT is built on TenforFlow/Keras.

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet ๐Ÿฆ ๐Ÿ‡ฎ๐Ÿ‡ฉ 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).
BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

Crie tokens de autenticaรงรฃo รญntegros e seguros com UToken.

UToken - Tokens seguros. UToken (ou Unhandleable Token) รฉ uma bilioteca criada para ser utilizada na geraรงรฃo de tokens seguros e รญntegros, ou seja, nรฃ

Comments
  • Pretraining Epoch ์งˆ๋ฌธ

    Pretraining Epoch ์งˆ๋ฌธ

    Checklist

    • [x] I've searched the project's issues

    โ“ Question

    ์•ˆ๋…•ํ•˜์„ธ์š” ์ €๋Š” ํ˜„์žฌ ์นœ๊ตฌ๋“ค๊ณผ ํ•จ๊ป˜ 4096 ํ† ํฐ์„ ์ž…๋ ฅ๋ฐ›์•„ ์š”์•ฝ ํƒœ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ฒ˜์Œ์—” ๋น…๋ฒ„๋“œ + ๋ฒ„ํŠธ ์กฐํ•ฉ์œผ๋กœ ํ•ด๋ณด๋ ค๊ณ  ํ–ˆ๋Š”๋ฐ, ์ด๋ฏธ monologg ๋‹˜๊ป˜์„œ ๋งŒ๋“ค์–ด์ฃผ์…จ๋”๋ผ๊ตฌ์š” ใ…Žใ…Ž ๊ทธ๋ž˜์„œ ๋กฑํฌ๋จธ + ๋ฐ”ํŠธ + ํŽ˜๊ฐ€์ˆ˜์Šค ์กฐํ•ฉ์œผ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๋ ค ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. pretrained๋œ KoBart๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์–ดํ…์…˜์„ ๋กฑํฌ๋จธ๋กœ ๋ฐ”๊พผ ํ›„, ํŽ˜๊ฐ€์ˆ˜์Šค task๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ตฌ์กฐ๋กœ ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

    ํ˜„์žฌ 13GB์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ์•„์„œ ์ „์ฒ˜๋ฆฌ์™€ ๋ฐ์ดํ„ฐ๋กœ๋” ์ž‘์„ฑ, ๋ชจ๋ธ ์ฝ”๋“œ๊นŒ์ง€๋Š” ์™„๋ฃŒํ•œ ์ƒํƒœ์ž…๋‹ˆ๋‹ค. ์ด๋ฒˆ ์ฃผ ๋‚ด๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๋ ค ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

    ์ €ํฌ๊ฐ€ ๊ฐ€์ง„ GPU๋กœ๋Š” ๋Œ€๋žต ์ดํ‹€์ด๋ฉด 1 ์—ํฌํฌ๋ฅผ ๋Œ ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์€๋ฐ, monologg๋‹˜๊ป˜์„œ๋Š” KoBirBird ๋ชจ๋ธ ๊ฐœ๋ฐœ ์‹œ ์—ํฌํฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ๋„์…จ๋Š”์ง€ ์—ฌ์ญค๋ณด๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค.

    ์•„๋ฌด๋ž˜๋„ pretrained ๋œ ๋ชจ๋ธ์„ ๊ฐ€์ ธ๋‹ค ์“ฐ๋‹ค๋ณด๋‹ˆ ์—ํฌํฌ๋ฅผ ๋งŽ์ด ๋Œ ํ•„์š”๋Š” ์—†์„ ๊ฒƒ ๊ฐ™์€๋ฐ, ๊ธฐ์ค€์ ์œผ๋กœ ์‚ผ๊ณ  ์‹ถ์–ด์„œ์š”!

    ๋ง์ด ๊ธธ์–ด์กŒ๋Š”๋ฐ ์š”์•ฝํ•˜์ž๋ฉด, KoBirBird ํ•™์Šต ์‹œ ์—ํฌํฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ฃผ์…จ๋Š”์ง€ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ๊ทธ ๊ธฐ์ค€์€ ๋ฌด์—‡์œผ๋กœ ์‚ผ์œผ์…จ๋Š”์ง€๋„ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค.

    question 
    opened by KimJaehee0725 2
  • Specific information about this model.

    Specific information about this model.

    Checklist

    • [ x ] I've searched the project's issues

    โ“ Question

    • You mentioned "๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜, ํ•œ๊ตญ์–ด ์œ„ํ‚ค, Common Crawl, ๋‰ด์Šค ๋ฐ์ดํ„ฐ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต" and I want to know the size of total corpus for pre-training.

    • Also I want to know the vocab size of this model.

    ๐Ÿ“Ž Additional context

    question 
    opened by midannii 2
  • Fix some minors

    Fix some minors

    Description

    ์ฝ”๋“œ์™€ ์ฃผ์„ ๋“ฑ์„ ์ฝ๋‹ค๊ฐ€ ๋ณด์ธ ์ž‘์€ ์˜คํƒ€ ๋“ฑ์„ ์ˆ˜์ •ํ–ˆ์Šต๋‹ˆ๋‹ค

    ๋‹ค์–‘ํ•œ ๋…ธํ•˜์šฐ๋ฅผ ์•„๋‚Œ์—†์ด ๊ณต์œ ํ•ด์ฃผ์‹  @monologg , @donggyukimc ์—๊ฒŒ ๊ฐ์‚ฌ์˜ ๋ง์”€๋“œ๋ฆฝ๋‹ˆ๋‹ค.

    ์ดํ›„์—๋Š” GPU ํ™˜๊ฒฝ์—์„œ finetuning์„ ํ…Œ์ŠคํŠธํ•ด ๋ณผ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค ๊ณ ๋ง™์Šต๋‹ˆ๋‹ค.

    Related Issue

    chore 
    opened by sackoh 0
Releases(v1.0.0)
Enterprise Scale NLP with Hugging Face & SageMaker Workshop series

Workshop: Enterprise-Scale NLP with Hugging Face & Amazon SageMaker Earlier this year we announced a strategic collaboration with Amazon to make it ea

Philipp Schmid 161 Dec 16, 2022
ProtFeat is protein feature extraction tool that utilizes POSSUM and iFeature.

Description: ProtFeat is designed to extract the protein features by employing POSSUM and iFeature python-based tools. ProtFeat includes a total of 39

GOKHAN OZSARI 5 Dec 16, 2022
Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologรญas del Lenguaje" (Plan-TL).

Spanish Language Models ๐Ÿ’ƒ๐Ÿป A repository part of the MarIA project. Corpora ๐Ÿ“ƒ Corpora Number of documents Number of tokens Size (GB) BNE 201,080,084

Plan de Tecnologรญas del Lenguaje - Gobierno de Espaรฑa 203 Dec 20, 2022
An Explainable Leaderboard for NLP

ExplainaBoard: An Explainable Leaderboard for NLP Introduction | Website | Download | Backend | Paper | Video | Bib Introduction ExplainaBoard is an i

NeuLab 319 Dec 20, 2022
NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

This file contains the following documents sumbited for Baruch CIS9665 group 9 fall 2021. 1. Dataset: drug_reviews.csv 2. python codes for text classi

Aarif Munwar Jahan 2 Jan 04, 2023
Collection of scripts to pinpoint obfuscated code

Obfuscation Detection (v1.0) Author: Tim Blazytko Automatically detect control-flow flattening and other state machines Description: Scripts and binar

Tim Blazytko 230 Nov 26, 2022
This repository implements a brute-force spellchecker utilizing the Damerau-Levenshtein edit distance.

About spellchecker.py Implementing a highly-accurate, brute-force, and dynamically programmed spellchecking program that utilizes the Damerau-Levensht

Raihan Ahmed 1 Dec 11, 2021
DAGAN - Dual Attention GANs for Semantic Image Synthesis

Contents Semantic Image Synthesis with DAGAN Installation Dataset Preparation Generating Images Using Pretrained Model Train and Test New Models Evalu

Hao Tang 104 Oct 08, 2022
The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

Kay Savetz 60 Dec 25, 2022
Finally, some decent sample sentences

tts-dataset-prompts This repository aims to be a decent set of sentences for people looking to clone their own voices (e.g. using Tacotron 2). Each se

hecko 19 Dec 13, 2022
Data manipulation and transformation for audio signal processing, powered by PyTorch

torchaudio: an audio library for PyTorch The aim of torchaudio is to apply PyTorch to the audio domain. By supporting PyTorch, torchaudio follows the

1.9k Jan 08, 2023
Tools for curating biomedical training data for large-scale language modeling

Tools for curating biomedical training data for large-scale language modeling

BigScience Workshop 242 Dec 25, 2022
Milaan Parmar / ะœะธะปะฐะฝ ะฟะฐั€ะผะฐั€ / _็ฑณๅ…ฐ ๅธ•ๅฐ”้ฉฌ 170 Dec 13, 2022
A unified tokenization tool for Images, Chinese and English.

ICE Tokenizer Token id [0, 20000) are image tokens. Token id [20000, 20100) are common tokens, mainly punctuations. E.g., icetk[20000] == 'unk', ice

THUDM 42 Dec 27, 2022
Auto_code_complete is a auto word-completetion program which allows you to customize it on your needs

auto_code_complete is a auto word-completetion program which allows you to customize it on your needs. the model for this program is one of the deep-learning NLP(Natural Language Process) model struc

RUO 2 Feb 22, 2022
nlpๅŸบ็ก€ไปปๅŠก

NLP็ฎ—ๆณ• ่ฏดๆ˜Ž ๆญค็ฎ—ๆณ•ไป“ๅบ“ๅŒ…ๆ‹ฌๆ–‡ๆœฌๅˆ†็ฑปใ€ๅบๅˆ—ๆ ‡ๆณจใ€ๅ…ณ็ณปๆŠฝๅ–ใ€ๆ–‡ๆœฌๅŒน้…ใ€ๆ–‡ๆœฌ็›ธไผผๅบฆๅŒน้…่ฟ™ไบ”ไธชไธปๆตNLPไปปๅŠก๏ผŒๆถ‰ๅŠๅˆฐ22ไธช็›ธๅ…ณ็š„ๆจกๅž‹็ฎ—ๆณ•ใ€‚ ๆก†ๆžถ็ป“ๆž„ ๆ–‡ไปถ็ป“ๆž„ all_models โ”œโ”€โ”€ Base_line โ”‚ย ย  โ”œโ”€โ”€ __init__.py โ”‚ย ย  โ”œโ”€โ”€ base_data_process.

zuxinqi 23 Sep 22, 2022
Twitter-Sentiment-Analysis - Analysis of twitter posts' positive and negative score.

Twitter-Sentiment-Analysis The hands-on project is in Python 3 Programming class offered by University of Michigan via Coursera. The task is to build

Eszter Pai 1 Jan 03, 2022
Neural-Machine-Translation - Implementation of revolutionary machine translation models

Neural Machine Translation Framework: PyTorch Repository contaning my implementa

Utkarsh Jain 1 Feb 17, 2022
PyTorch original implementation of Cross-lingual Language Model Pretraining.

XLM NEW: Added XLM-R model. PyTorch original implementation of Cross-lingual Language Model Pretraining. Includes: Monolingual language model pretrain

Facebook Research 2.7k Dec 27, 2022
๐Ÿฆ† Contextually-keyed word vectors

sense2vec: Contextually-keyed word vectors sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting and detaile

Explosion 1.5k Dec 25, 2022