Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

Last update: Jan 07, 2023

Related tags

Overview

japanese-gpt2

This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium released on HuggingFace model hub by rinna.

Please open an issue (in English/日本語) if you encounter any problem using the code or using our models via Huggingface.

Train a Japanese GPT-2 from scratch on your own machine

Download training corpus Japanese CC-100 and extract the ja.txt file.
Move the ja.txt file or modify src/corpus/jp_cc100/config.py to match the filepath of ja.txt with self.raw_data_dir in the config file.
Split ja.txt to smaller files by running:

cd src/
python -m corpus.jp_cc100.split_to_small_files

Train a medium-sized GPT-2 on 4 GPUs by running:

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m task.pretrain.train --n_gpus 4 --save_model True --enable_log True

Interact with the trained model

Assume you have run the training script and saved your medium-sized GPT-2 to data/model/gpt2-medium-xxx.checkpoint. Run the following command to use it to complete text on one GPU by nucleus sampling with p=0.95 and k=40:

CUDA_VISIBLE_DEVICES=0 python -m task.pretrain.interact --checkpoint_path ../data/model/gpt2-medium-xxx.checkpoint --gen_type top --top_p 0.95 --top_k 40

Prepare files for uploading to Huggingface

Make your Huggingface account; Create a model repo; Clone it to your local machine.
Create model and config files from a checkpoint by running:

python -m task.pretrain.checkpoint2huggingface --checkpoint_path ../data/model/gpt2-medium-xxx.checkpoint --save_dir {huggingface's model repo directory}

Validate the created files by running:

python -m task.pretrain.check_huggingface --model_dir {huggingface's model repo directory}

Add files, commit, and push to your Huggingface repo.

Customize your training script

Check available arguments by running:

python -m task.pretrain.train --help

License

The MIT license

Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

Related tags

Overview

japanese-gpt2

Train a Japanese GPT-2 from scratch on your own machine

Interact with the trained model

Prepare files for uploading to Huggingface

Customize your training script

License

Owner

rinna Co.,Ltd.

The Classical Language Toolkit

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).

Making text a first-class citizen in TensorFlow.

Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

LUKE -- Language Understanding with Knowledge-based Embeddings

A Python 3.6+ package to run .many files, where many programs written in many languages may exist in one file.

This is a general repo that helps you develop fast/effective NLP classifiers using Huggingface

Sentiment-Analysis and EDA on the IMDB Movie Review Dataset

The Easy-to-use Dialogue Response Selection Toolkit for Researchers

Japanese synonym library

Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention

Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

null

DeepAmandine is an artificial intelligence that allows you to talk to it for hours, you won't know the difference.

This is a modification of the OpenAI-CLIP repository of moein-shariatnia

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Reproduction process of BERT on SST2 dataset

Text Classification in Turkish Texts with Bert

Local cross-platform machine translation GUI, based on CTranslate2