🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

Last update: Dec 14, 2022

Overview

Pretrained BigBird Model for Korean

What is BigBird • How to Use • Pretraining • Evaluation Result • Docs • Citation

한국어 | English

What is BigBird?

BigBird: Transformers for Longer Sequences에서 소개된 sparse-attention 기반의 모델로, 일반적인 BERT보다 더 긴 sequence를 다룰 수 있습니다.

🦅 Longer Sequence - 최대 512개의 token을 다룰 수 있는 BERT의 8배인 최대 4096개의 token을 다룸

⏱️ Computational Efficiency - Full attention이 아닌 Sparse Attention을 이용하여 O(n²)에서 O(n)으로 개선

How to Use

🤗 Huggingface Hub에 업로드된 모델을 곧바로 사용할 수 있습니다:)
일부 이슈가 해결된 transformers>=4.11.0 사용을 권장합니다. (MRC 이슈 관련 PR)
BigBirdTokenizer 대신에 BertTokenizer 를 사용해야 합니다. (AutoTokenizer 사용시 BertTokenizer가 로드됩니다.)
자세한 사용법은 BigBird Tranformers documentation을 참고해주세요.

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("monologg/kobigbird-bert-base")  # BigBirdModel
tokenizer = AutoTokenizer.from_pretrained("monologg/kobigbird-bert-base")  # BertTokenizer

Pretraining

자세한 내용은 [Pretraining BigBird] 참고

	Hardware	Max len	LR	Batch	Train Step	Warmup Step
KoBigBird-BERT-Base	TPU v3-8	4096	1e-4	32	2M	20k

모두의 말뭉치, 한국어 위키, Common Crawl, 뉴스 데이터 등 다양한 데이터로 학습
ITC (Internal Transformer Construction) 모델로 학습 (ITC vs ETC)

Evaluation Result

1. Short Sequence (<=512)

자세한 내용은 [Finetune on Short Sequence Dataset] 참고

	NSMC (acc)	KLUE-NLI (acc)	KLUE-STS (pearsonr)	Korquad 1.0 (em/f1)	KLUE MRC (em/rouge-w)
KoELECTRA-Base-v3	91.13	86.87	93.14	85.66 / 93.94	59.54 / 65.64
KLUE-RoBERTa-Base	91.16	86.30	92.91	85.35 / 94.53	69.56 / 74.64
KoBigBird-BERT-Base	91.18	87.17	92.61	87.08 / 94.71	70.33 / 75.34

2. Long Sequence (>=1024)

자세한 내용은 [Finetune on Long Sequence Dataset] 참고

	TyDi QA (em/f1)	Korquad 2.1 (em/f1)	Fake News (f1)	Modu Sentiment (f1-macro)
KLUE-RoBERTa-Base	76.80 / 78.58	55.44 / 73.02	95.20	42.61
KoBigBird-BERT-Base	79.13 / 81.30	67.77 / 82.03	98.85	45.42

Docs

Citation

KoBigBird를 사용하신다면 아래와 같이 인용해주세요.

@software{jangwon_park_2021_5654154,
  author       = {Jangwon Park and Donggyu Kim},
  title        = {KoBigBird: Pretrained BigBird Model for Korean},
  month        = nov,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.5654154},
  url          = {https://doi.org/10.5281/zenodo.5654154}
}

Contributors

Jangwon Park and Donggyu Kim

Acknowledgements

KoBigBird는 Tensorflow Research Cloud (TFRC) 프로그램의 Cloud TPU 지원으로 제작되었습니다.

또한 멋진 로고를 제공해주신 Seyun Ahn님께 감사를 전합니다.

KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KoGPT KoGPT (Korean Generative Pre-trained Transformer) https://github.com/kakaobrain/kogpt https://huggingface.co/kakaobrain/kogpt Model Descriptions

797 Dec 26, 2022

Generating Korean Slogans with phonetic and structural repetition

LexPOS_ko Generating Korean Slogans with phonetic and structural repetition Generating Slogans with Linguistic Features LexPOS is a sequence-to-sequen

3 May 23, 2022

Korean extractive summarization. 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

korean extractive summarization 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드 Leaderboard Notice Text Summarization with Pretrained Encoders에 나오는 bertsumext모델(ext

3 Aug 10, 2022

Training code for Korean multi-class sentiment analysis

KoSentimentAnalysis Bert implementation for the Korean multi-class sentiment analysis 왜 한국어 감정 다중분류 모델은 거의 없는 것일까?에서 시작된 프로젝트 Environment: Pytorch, Da

3 Dec 2, 2022

Korean Sentence Embedding Repository

Korean-Sentence-Embedding 🍭 Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides

80 Jan 2, 2023

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in a matter of minutes. Based on our experiments with a wide range of benchmarks, ProteinBERT usually achieves state-of-the-art performance. ProteinBERT is built on TenforFlow/Keras.

241 Jan 4, 2023

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet 🐦 🇮🇩 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

40 Nov 30, 2022

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

377 Jan 2, 2023

Crie tokens de autenticação íntegros e seguros com UToken.

UToken - Tokens seguros. UToken (ou Unhandleable Token) é uma bilioteca criada para ser utilizada na geração de tokens seguros e íntegros, ou seja, nã

0 Nov 29, 2022

Comments

Pretraining Epoch 질문
Checklist

[x] I've searched the project's issues

❓ Question

안녕하세요 저는 현재 친구들과 함께 4096 토큰을 입력받아 요약 태스크를 수행할 수 있는 모델을 만들고 있습니다. 처음엔 빅버드 + 버트 조합으로 해보려고 했는데, 이미 monologg 님께서 만들어주셨더라구요 ㅎㅎ 그래서 롱포머 + 바트 + 페가수스 조합으로 학습을 진행하려 하고 있습니다. pretrained된 KoBart를 기반으로 어텐션을 롱포머로 바꾼 후, 페가수스 task를 수행하는 구조로 되어 있습니다.

현재 13GB의 데이터를 모아서 전처리와 데이터로더 작성, 모델 코드까지는 완료한 상태입니다. 이번 주 내로 학습을 진행하려 하고 있습니다.

저희가 가진 GPU로는 대략 이틀이면 1 에포크를 돌 수 있을 것 같은데, monologg님께서는 KoBirBird 모델 개발 시 에포크를 얼마나 도셨는지 여쭤보고 싶습니다.

아무래도 pretrained 된 모델을 가져다 쓰다보니 에포크를 많이 돌 필요는 없을 것 같은데, 기준점으로 삼고 싶어서요!

말이 길어졌는데 요약하자면, KoBirBird 학습 시 에포크를 얼마나 주셨는지 궁금합니다. 또한, 그 기준은 무엇으로 삼으셨는지도 궁금합니다.
question
opened by KimJaehee0725 2
Specific information about this model.
Checklist

[ x ] I've searched the project's issues

❓ Question

You mentioned "모두의 말뭉치, 한국어 위키, Common Crawl, 뉴스 데이터 등 다양한 데이터로 학습" and I want to know the size of total corpus for pre-training.

Also I want to know the vocab size of this model.

📎 Additional context
question
opened by midannii 2
Fix some minors

Description

코드와 주석 등을 읽다가 보인 작은 오타 등을 수정했습니다

다양한 노하우를 아낌없이 공유해주신 @monologg , @donggyukimc 에게 감사의 말씀드립니다.

이후에는 GPU 환경에서 finetuning을 테스트해 볼 예정입니다 고맙습니다.

Related Issue
chore

opened by sackoh 0

Releases(v1.0.0)

v1.0.0(Nov 8, 2021)

Initial release for KoBigBird - Pretrained BigBird Model for Korean
Source code(tar.gz)
Source code(zip)

Owner

Jangwon Park

GitHub Repository https://huggingface.co/monologg/kobigbird-bert-base

Big Bird: Transformers for Longer Sequences

BigBird, is a sparse-attention based transformer which extends Transformer based models, such as BERT to much longer sequences. Moreover, BigBird comes along with a theoretical understanding of the c

457 Dec 23, 2022

GSoC'2021 | TensorFlow implementation of Wav2Vec2

73 Nov 28, 2022

YACLC - Yet Another Chinese Learner Corpus

汉语学习者文本多维标注数据集YACLC V1.0 中文 | English 汉语学习者文本多维标注数据集（Yet Another Chinese Learner

47 Dec 15, 2022

2021海华AI挑战赛·中文阅读理解·技术组·第三名

文字是人类用以记录和表达的最基本工具，也是信息传播的重要媒介。透过文字与符号，我们可以追寻人类文明的起源，可以传播知识与经验，读懂文字是认识与了解的第一步。对于人工智能而言，它的核心问题之一就是认知，而认知的核心则是语义理解。

21 Dec 26, 2022

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

20.5k Jan 08, 2023

Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

This Repository contains a sample code for Tacotron 2, WaveGlow with multi-speaker, emotion embeddings together with a script for data preprocessing.

106 Jan 01, 2023

Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP

Stat4ML Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP This is the first course from our trio courses: Statistics Foundatio

83 Dec 29, 2022

An Open-Source Package for Neural Relation Extraction (NRE)

OpenNRE We have a DEMO website (http://opennre.thunlp.ai/). Try it out! OpenNRE is an open-source and extensible toolkit that provides a unified frame

3.9k Jan 03, 2023

Sequence-to-Sequence learning using PyTorch

Seq2Seq in PyTorch This is a complete suite for training sequence-to-sequence models in PyTorch. It consists of several models and code to both train

514 Nov 17, 2022

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

8.4k Dec 30, 2022

The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Data and code for EMNLP 2021 paper "FinQA: A Dataset of Numerical Reasoning over Financial Data"

114 Dec 29, 2022

Code for CVPR 2021 paper: Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning This is the PyTorch companion code for the paper: A

69 Jan 03, 2023

🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

Related tags

Overview

Pretrained BigBird Model for Korean

What is BigBird?

How to Use

Pretraining

Evaluation Result

1. Short Sequence (<=512)

2. Long Sequence (>=1024)

Docs

Citation

Contributors

Acknowledgements

You might also like...

KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

Generating Korean Slogans with phonetic and structural repetition

Korean extractive summarization. 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

Training code for Korean multi-class sentiment analysis

Korean Sentence Embedding Repository

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

Crie tokens de autenticação íntegros e seguros com UToken.

Comments

Pretraining Epoch 질문

Checklist

❓ Question

Specific information about this model.

Checklist

❓ Question

📎 Additional context

Fix some minors

Description

Related Issue

Releases(v1.0.0)

v1.0.0(Nov 8, 2021)

Owner

Jangwon Park

Big Bird: Transformers for Longer Sequences

GSoC'2021 | TensorFlow implementation of Wav2Vec2

YACLC - Yet Another Chinese Learner Corpus

2021海华AI挑战赛·中文阅读理解·技术组·第三名

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP

An Open-Source Package for Neural Relation Extraction (NRE)

Sequence-to-Sequence learning using PyTorch

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Code for CVPR 2021 paper: Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

SpikeX - SpaCy Pipes for Knowledge Extraction

OpenAI CLIP text encoders for multiple languages!

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

Opal-lang - A WIP programming language based on Python

NLP, Machine learning

Korea Spell Checker

Code voor mijn Master project omtrent VideoBERT

KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark.