SAS: Self-Augmentation Strategy for Language Model Pre-training

This repository contains the official pytorch implementation for the paper "SAS: Self-Augmentation Strategy for Language Model Pre-training" based on Huggingface transformers version 4.3.0.

Only the SAS without the disentangled attention mechanism is released for now. To be updated.

File structure

train.py: The file for pre-training.
run_glue.py: The file for finetuning.
models
- modeling_sas.py: The main algorithm for the SAS.
- trainer_sas.py: It is inherited from Huggingface transformers. It is mainly modified for data processing.
utils: It includes all the utilities.
- data_collator_sas.py: It includes the details about self-augmentations.
The rest of codes are supportive.

How to

Download and Install

Clone this repository.
Download dataset for wiki-corpus. Store it to data folder. Currently, we only provide a trail data with 1 million sentence. Full dataset can be pre-processed according to BERT. Detail to be released.

(Optional) Create an environment through conda by the provided environment.yml
- You can also manually install the package:
  - Python==3.9, pytorch==1.10.0, transformers==4.3.0, etc.

    # Clone package
    git clone [email protected]:fei960922/SAS-Self-Augmentation-Strategy.git
    cd SAS-Self-Augmentation-Strategy

    # Establish the environment.
    conda env create -f environment.yml 
    conda activate cssl

    # Download dataset and checkpoint
    wget http://www.stat.ucla.edu/~yifeixu/sas/wiki_corpus_1M.npy

Train from stractch

    # Run default setting 
    bash script/pretrain.sh

    # Run custom setting
    python train.py

    # Starting from checkpoint 
    python train.py --start_from_checkpoint 1 --pretrain_path {PATH_TH_CHECKPOINT}

Caclulate GLUE scores

    # By running this bash, GLUE dataset will be automatically downloaded.
    bash finetune.sh MNLI 0 sas-base output_dir 5e-5 32 4 42
    bash finetune.sh MNLI 0 sas-small output_dir 1e-4 32 4 42

SAS: Self-Augmentation Strategy for Language Model Pre-training

Related tags

Overview

SAS: Self-Augmentation Strategy for Language Model Pre-training

File structure

How to

Download and Install

Train from stractch

Caclulate GLUE scores

Owner

Alibaba

On Generating Extended Summaries of Long Documents

On the Limits of Pseudo Ground Truth in Visual Camera Re-Localization

Official implementation of the MM'21 paper Constrained Graphic Layout Generation via Latent Optimization

A hifiasm fork for metagenome assembly using Hifi reads.

Repository for the electrical and ICT benchmark model developed in the ERIGrid 2.0 project.

Justmagic - Use a function as a method with this mystic script, like in Nim

A series of Jupyter notebooks with Chinese comment that walk you through the fundamentals of Machine Learning and Deep Learning in python using Scikit-Learn and TensorFlow.

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）

Minimal deep learning library written from scratch in Python, using NumPy/CuPy.

Google Recaptcha solver.

Using knowledge-informed machine learning on the PRONOSTIA (FEMTO) and IMS bearing data sets. Predict remaining-useful-life (RUL).

A distributed, plug-n-play algorithm for multi-robot applications with a priori non-computable objective functions

A Python module for parallel optimization of expensive black-box functions

FishNet: One Stage to Detect, Segmentation and Pose Estimation

A minimalist implementation of score-based diffusion model

Code for SALT: Stackelberg Adversarial Regularization, EMNLP 2021.

AI drive app that can help user become beautiful.

Deep Learning and Reinforcement Learning Library for Scientists and Engineers 🔥

TuckER: Tensor Factorization for Knowledge Graph Completion

A knowledge base construction engine for richly formatted data