Transformer-based Text Auto-encoder (T-TA) using TensorFlow 2.

Related tags

Text Data & NLPtta
Overview

T-TA (Transformer-based Text Auto-encoder)

This repository contains codes for Transformer-based Text Auto-encoder (T-TA, paper: Fast and Accurate Deep Bidirectional Language Representations for Unsupervised Learning) using TensorFlow 2.

How to train T-TA using custom dataset

  1. Prepare datasets. You need text line files.

    Example:

    Sentence 1.
    Sentence 2.
    Sentence 3.
    
  2. Train the sentencepiece tokenizer. You can use the train_sentencepiece.py or train sentencepiece model by yourself.

  3. Train T-TA model. Run train.py with customizable arguments. Here's the usage.

    $ python train.py --help
    usage: train.py [-h] [--train-data TRAIN_DATA] [--dev-data DEV_DATA] [--model-config MODEL_CONFIG] [--batch-size BATCH_SIZE] [--spm-model SPM_MODEL]
                    [--learning-rate LEARNING_RATE] [--target-epoch TARGET_EPOCH] [--steps-per-epoch STEPS_PER_EPOCH] [--warmup-ratio WARMUP_RATIO]
    
    optional arguments:
        -h, --help            show this help message and exit
        --train-data TRAIN_DATA
        --dev-data DEV_DATA
        --model-config MODEL_CONFIG
        --batch-size BATCH_SIZE
        --spm-model SPM_MODEL
        --learning-rate LEARNING_RATE
        --target-epoch TARGET_EPOCH
        --steps-per-epoch STEPS_PER_EPOCH
        --warmup-ratio WARMUP_RATIO

    I want to train models until the designated steps, so I added the steps_per_epoch and target_epoch arguments. The total steps will be the steps_per_epoch * target_epoch.

  4. (Optional) Test your model using KorSTS data. I trained my model with the Korean corpus, so I tested it using KorSTS data. You can evaluate KorSTS score (Spearman correlation) using evaluate_unsupervised_korsts.py. Here's the usage.

    $ python evaluate_unsupervised_korsts.py --help
    usage: evaluate_unsupervised_korsts.py [-h] --model-weight MODEL_WEIGHT --dataset DATASET
    
    optional arguments:
        -h, --help            show this help message and exit
        --model-weight MODEL_WEIGHT
        --dataset DATASET
    $ # To evaluate on dev set
    $ # python evaluate_unsupervised_korsts.py --model-weight ./path/to/checkpoint --dataset ./path/to/dataset/sts-dev.tsv

Training details

  • Training data: lovit/namuwikitext
  • Peak learning rate: 1e-4
  • learning rate scheduler: Linear Warmup and Linear Decay.
  • Warmup ratio: 0.05 (warmup steps: 1M * 0.05 = 50k)
  • Vocab size: 15000
  • num layers: 3
  • intermediate size: 2048
  • hidden size: 512
  • attention heads: 8
  • activation function: gelu
  • max sequence length: 128
  • tokenizer: sentencepiece
  • Total steps: 1M
  • Final validation accuracy of auto encoding task (ignores padding): 0.5513
  • Final validation loss: 2.1691

Unsupervised KorSTS

Model Params development test
My Implementation 17M 65.98 56.75
- - - -
Korean SRoBERTa (base) 111M 63.34 48.96
Korean SRoBERTa (large) 338M 60.15 51.35
SXLM-R (base) 270M 64.27 45.05
SXLM-R (large) 550M 55.00 39.92
Korean fastText - - 47.96

KorSTS development and test set scores (100 * Spearman Correlation). You can check the details of other models on this paper (KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding).

How to use pre-trained weight using tensorflow-hub

>>> import tensorflow as tf
>>> import tensorflow_text as text
>>> import tensorflow_hub as hub
>>> # load model
>>> model = hub.KerasLayer("https://github.com/jeongukjae/tta/releases/download/0/model.tar.gz")
>>> preprocess = hub.KerasLayer("https://github.com/jeongukjae/tta/releases/download/0/preprocess.tar.gz")
>>> # inference
>>> input_tensor = preprocess(["이 모델은 나무위키로 학습되었습니다.", "근데 이 모델 어디다가 쓸 수 있을까요?", "나는 고양이를 좋아해!", "나는 강아지를 좋아해!"])
>>> representation = model(input_tensor)
>>> representation = tf.reduce_sum(representation * tf.cast(input_tensor["input_mask"], representation.dtype)[:, :, tf.newaxis], axis=1)
>>> representation = tf.nn.l2_normalize(representation, axis=-1)
>>> similarities = tf.tensordot(representation, representation, axes=[[1], [1]])
>>> # results
>>> similarities
<tf.Tensor: shape=(4, 4), dtype=float32, numpy=
array([[0.9999999 , 0.76468784, 0.7384633 , 0.7181306 ],
       [0.76468784, 1.        , 0.81387675, 0.79722893],
       [0.7384633 , 0.81387675, 0.9999999 , 0.96217746],
       [0.7181306 , 0.79722893, 0.96217746, 1.        ]], dtype=float32)>

References


짧은 영어를 뒤로 하고, 대부분의 독자분이실 한국분들을 위해 적어보자면, 단순히 "회사에서 구상중인 모델 구조가 좋을까?"를 테스트해보기 위해 개인적으로 학습해본 모델입니다. 어느정도로 잘 나오는지 궁금해서 작성한 코드이기 때문에 하이퍼 파라미터 튜닝이라던가, 데이터셋을 신중히 골랐다던가 하는 것은 없었습니다. 단지 학습해보다보니 생각보다 값이 잘 나와서 결과와 함께 공개하게 되었습니다. 커밋 로그를 보시면 짐작하실 수 있겠지만, 하루 정도에 후다닥 짜서 작은 GPU로 약 50시간 가량 돌린 모델입니다.

원 논문에 나온 값들을 최대한 따라가려 했으며, 밤에 작성했던 코드라 조금 명확하지 않은 부분이 있을 수도 있고, 원 구현과 다를 수도 있습니다. 해당 부분은 이슈로 달아주신다면 다시 확인해보겠습니다.

트러블 슈팅에 도움을 주신 백영민님(@baekyeongmin)께 감사드립니다.

You might also like...
ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab

AliceMind AliceMind: ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab This repository provides pre-trained encode

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

tfds-korean A collection of Korean Text Datasets ready to use using Tensorflow-Datasets. TensorFlow-Datasets를 이용한 한국어/한글 데이터셋 모음입니다. Dataset Catalog |

Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration
Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration This repo contains only model Implementation of Zero-Shot Text-to-Speech for Text

Making text a first-class citizen in TensorFlow.
Making text a first-class citizen in TensorFlow.

TensorFlow Text - Text processing in Tensorflow IMPORTANT: When installing TF Text with pip install, please note the version of TensorFlow you are run

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow.  This is part of the CASL project: http://casl-project.ai/
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

Releases(0)
  • 0(Feb 6, 2021)

    • Training data: lovit/namuwikitext
    • Peak learning rate: 1e-4
    • learning rate scheduler: Linear Warmup and Linear Decay.
    • Warmup ratio: 0.05 (warmup steps: 1M * 0.05 = 50k)
    • Vocab size: 15000
    • num layers: 3
    • intermediate size: 2048
    • hidden size: 512
    • attention heads: 8
    • activation function: gelu
    • max sequence length: 128
    • tokenizer: sentencepiece
    • Total steps: 1M
    • Final validation accuracy of auto encoding task (ignores padding): 0.5513
    • Final validation loss: 2.1691
    Source code(tar.gz)
    Source code(zip)
    model.tar.gz(60.93 MB)
    preprocess.tar.gz(507.45 KB)
Owner
Jeong Ukjae
Machine Learning Engineer
Jeong Ukjae
A simple version of DeTR

DeTR-Lite A simple version of DeTR Before you enjoy this DeTR-Lite The purpose of this project is to allow you to learn the basic knowledge of DeTR. P

Jianhua Yang 11 Jun 13, 2022
Programme de chiffrement et de déchiffrement inverse d'un message en python3.

Chiffrement Inverse En Python3 Programme de chiffrement et de déchiffrement inverse d'un message en python3. Explication du chiffrement inverse avec c

Malik Makkes 2 Mar 26, 2022
Sentiment Analysis Project using Count Vectorizer and TF-IDF Vectorizer

Sentiment Analysis Project This project contains two sentiment analysis programs for Hotel Reviews using a Hotel Reviews dataset from Datafiniti. The

Simran Farrukh 0 Mar 28, 2022
KoBERT - Korean BERT pre-trained cased (KoBERT)

KoBERT KoBERT Korean BERT pre-trained cased (KoBERT) Why'?' Training Environment Requirements How to install How to use Using with PyTorch Using with

SK T-Brain 1k Jan 02, 2023
A python project made to generate code using either OpenAI's codex or GPT-J (Although not as good as codex)

CodeJ A python project made to generate code using either OpenAI's codex or GPT-J (Although not as good as codex) Install requirements pip install -r

TheProtagonist 1 Dec 06, 2021
Need: Image Search With Python

Need: Image Search The problem is that a user needs to search for a specific ima

Surya Komandooru 1 Dec 30, 2021
Repository for Project Insight: NLP as a Service

Project Insight NLP as a Service Contents Introduction Features Installation Setup and Documentation Project Details Demonstration Directory Details H

Abhishek Kumar Mishra 286 Dec 06, 2022
Graph4nlp is the library for the easy use of Graph Neural Networks for NLP

Graph4NLP Graph4NLP is an easy-to-use library for R&D at the intersection of Deep Learning on Graphs and Natural Language Processing (i.e., DLG4NLP).

Graph4AI 1.5k Dec 23, 2022
This repository implements a brute-force spellchecker utilizing the Damerau-Levenshtein edit distance.

About spellchecker.py Implementing a highly-accurate, brute-force, and dynamically programmed spellchecking program that utilizes the Damerau-Levensht

Raihan Ahmed 1 Dec 11, 2021
The source code of "Language Models are Few-shot Multilingual Learners" (MRL @ EMNLP 2021)

Language Models are Few-shot Multilingual Learners Paper This is the source code of the paper [Arxiv] [ACL Anthology]: This code has been written usin

Genta Indra Winata 45 Nov 21, 2022
Auto_code_complete is a auto word-completetion program which allows you to customize it on your needs

auto_code_complete is a auto word-completetion program which allows you to customize it on your needs. the model for this program is one of the deep-learning NLP(Natural Language Process) model struc

RUO 2 Feb 22, 2022
A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

Rebiber: A tool for normalizing bibtex with official info. We often cite papers using their arXiv versions without noting that they are already PUBLIS

(Bill) Yuchen Lin 2k Jan 01, 2023
:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

deepset 1.6k Dec 27, 2022
📔️ Generate a text-based journal from a template file.

JGen 📔️ Generate a text-based journal from a template file. Contents Getting Started Example Overview Usage Details Reserved Keywords Gotchas Getting

Harrison Broadbent 21 Sep 25, 2022
State of the art faster Natural Language Processing in Tensorflow 2.0 .

tf-transformers: faster and easier state-of-the-art NLP in TensorFlow 2.0 ****************************************************************************

74 Dec 05, 2022
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 2.1k Jan 01, 2023
My implementation of Safaricom Machine Learning Codility test. The code has bugs, logical I guess I made errors and any correction will be appreciated.

Safaricom_Codility Machine Learning 2022 The test entails two questions. Question 1 was on Machine Learning. Question 2 was on SQL I ran out of time.

Lawrence M. 1 Mar 03, 2022
Fully featured implementation of Routing Transformer

Routing Transformer A fully featured implementation of Routing Transformer. The paper proposes using k-means to route similar queries / keys into the

Phil Wang 246 Jan 02, 2023
Question and answer retrieval in Turkish with BERT

trfaq Google supported this work by providing Google Cloud credit. Thank you Google for supporting the open source! 🎉 What is this? At this repo, I'm

M. Yusuf Sarıgöz 13 Oct 10, 2022
Official PyTorch implementation of Time-aware Large Kernel (TaLK) Convolutions (ICML 2020)

Time-aware Large Kernel (TaLK) Convolutions (Lioutas et al., 2020) This repository contains the source code, pre-trained models, as well as instructio

Vasileios Lioutas 28 Dec 07, 2022