Source code for the ACL-IJCNLP 2021 paper entitled "T-DNA: Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation" by Shizhe Diao et al.

Related tags

Deep LearningT-DNA
Overview

T-DNA

Source code for the ACL-IJCNLP 2021 paper entitled Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation.

Our implementation is built on the source code from huggingface transformers.

Model

We aim to adapt a generic pretrained model with a relatively small amount of domain-specific data. We demonstrate that by explicitly incorporating the multi-granularity information of unseen and domain-specific words via the adaptation of (word based) n-grams, the performance of a generic pretrained model can be greatly improved. Specifically, we introduce a Transformer-based Domain-aware N-gram Adaptor, T-DNA, to effectively learn and incorporate the semantic representation of different combinations of words in the new domain. T-DNA is able to achieve significant improvements compared to existing methods on most tasks using limited data with lower computational costs.

The overall architechture of T-DNA is shown in the figure below. image info

Requirements

Our code works with the following environment.

  • python=3.7.9
  • pytorch=1.4.0

To install the necessary packages for the project, please run: pip install -r requirements.txt.

Quick Start (For reproducing results)

  1. To do RoBERTa+T-DNA+FT, please refer to auto_FT.sh and you can simply run CUDA_VISIBLE_DEVICES=<GPU_ID> bash auto_FT.sh and get the expected results:
09/08/2021 19:56:58 - INFO - __main__ -   ***** Test results ag *****
09/08/2021 19:56:58 - INFO - __main__ -     eval_loss = 0.4393280267715454
09/08/2021 19:56:58 - INFO - __main__ -     eval_acc_and_f1 = {'acc': 0.8889473684210526, 'f1': 0.8889374532466023, 'acc_and_f1': 0.8889424108338275}
  1. To do RoBERTa+T-DNA+TAPT, please refer to auto_TAPT.sh and you can simply run CUDA_VISIBLE_DEVICES=<GPU_ID> bash auto_TAPT.sh and get the expected results:
09/08/2021 19:47:03 - INFO - __main__ -   ***** Test results ag *****
09/08/2021 19:47:03 - INFO - __main__ -     eval_loss = 0.48006332549609637
09/08/2021 19:47:03 - INFO - __main__ -     eval_acc_and_f1 = {'acc': 0.8943421052631579, 'f1': 0.8939718422143115, 'acc_and_f1': 0.8941569737387347}
  1. Important arguments:
    • task_name: ag, amazon, citation_intent, chemprot, hyperpartisan_news, imdb, rct-20k, sciie
    • data_dir: path of processed data
    • output_dir: path of saved results

Datasets

Following Gururangan et al. (2020), we conduct our experiments on eight classification tasks from four domains including biomedical sciences, computer scie nce, news and reviews. They are:

  • ChemProt: a manually annotated chemical–protein interaction dataset extracted from 5,031 abstracts for relation classification;
  • RCT: contains approximately 200,000 abstracts from public medicine with the role of each sentence clearly identified;
  • CitationIntent: contains around 2,000 citations annotated for their function;
  • SciERC: consists of 500 scientific abstracts annotated for relation classification;
  • HyperPartisan: which contains 645 articles from Hyperpartisan news with either extreme left-wing or right-wing stand-point used for partisanship classification;
  • AGNews: consists of 127,600 categorized articles from more than 2000 news source for topic classification;
  • Amazon: consists of 145,251 reviews on Women’s and Men’s Clothing & Accessories, each representing users’ implicit feedback on items with a binary label signifying whether the majority of customers found the review helpful;
  • IMDB: 50,000 balanced positive and negative reviews from the Internet Movie Database for sentiment classification

The datasets can be downloaded from the code associated with the Don't Stop Pretraining ACL 2020 paper. Please create a folder ./data in the root directory and put the downloaded datasets into it. After downloading, please convert them to *.tsv files referring to the script convert_dont_stop_corpus.py. Note that to create a low-resource setting, we constrain the size of all datasets into thousand-level. To do so, we randomly select a subset for RCT, AG, Amazon, IMDB with the ratio 1%, 1%, 1%, 10%, respectively.

To extract n-grams for datasets, please run pmi_ngram.py with the following parameters:

  • --dataset: the path of training data file
  • --output_dir: the path of output directory

Use with your own data

In this repo, we conducted experiments on eight classification tasks as described in the paper. In addition, it supports any classification task with just a little adjustment on your dataset. Here are the instructions to conduct experiments with your own data.

Firstly, please adjust your data format as following and put your data into the corresponding path.

Task adaptive pre-training:

Input dataset (./data/):

  • train: text \t label per line
  • dev: text \t label per line

Output: it will save the trained models to results folder automatically, and print out loss.

Fine-tuning dataset:

Input dataset (./data/tapt_data/):

  • train: text \t label per line
  • dev: text \t label per line
  • test: text \t label per line

Then, please modify the configuration file at ./TDNA/config.py

  1. define the desired evaluation metric in glue_compute_metrics(), e.g.,
elif task_name == "ag":
   return {"acc_and_f1": acc_and_f1(preds, labels)}
  1. create a new processor specifying the labels, e.g.,
class agProcessor(generalProcessor):
    def get_labels(self):
        return ['1', '2', '3', '4']
  1. specify the number of labels, e.g.,
glue_tasks_num_labels = {
    "citation_intent": 6,
    "ag": 4,
    "amazon": 2,
    "chemprot": 13,
    "hyperpartisan_news": 2,
    "imdb": 2,
    "rct-20k": 5,
    "sciie": 7,
    "SST2": 2
}
  1. include the new processor into glue_processors, e.g.,
glue_processors = {
    "citation_intent": citation_intentProcessor,
    "ag": agProcessor,
    "amazon": amazonProcessor,
    "chemprot": chemprotProcessor,
    "hyperpartisan_news": hyperpartisan_newsProcessor,
    "imdb": imdbProcessor,
    "rct-20k": rct_20kProcessor,
    "sciie": sciieProcessor,
    "SST2": SST2Processor
}
  1. specify the output mode in glue_output_modes, e.g.,
glue_output_modes = {
    "citation_intent": "classification",
    "ag": "classification",
    "amazon": "classification",
    "chemprot": "classification",
    "hyperpartisan_news": "classification",
    "imdb": "classification",
    "rct-20k": "classification",
    "sciie": "classification",
    "SST2": "classification"
}

Run

For FT,

python ./examples/run_classification.py --model_name_or_path roberta-base \
--task_name <task_name> --max_seq_length 256 --per_device_train_batch_size 16 \
--learning_rate 4e-5 --num_train_epochs 3.0 --output_dir ./results/<task_name>_FT/ \
--data_dir ./data/<task_name>/ --Ngram_path ./ngram/pmi_<task_name>_ngram.txt \
--fasttext_model_path ./ngram/<task_name>.npy --overwrite_output_dir

For TAPT + FT,

python ./examples/run_language_modeling.py \
--output_dir=./models/<task_name>_TAPT/ --model_type=roberta  --overwrite_output_dir \
--model_name_or_path=roberta-base --train_data_file=./data/tapt_data/<task_name>/train.tsv \
--eval_data_file=./data/tapt_data/<task_name>/dev.tsv --mlm --line_by_line \
--Ngram_path ./ngram/pmi_<task_name>_ngram.txt --num_train_epochs 10.0 \
--fasttext_model_path ./ngram/<task_name>.npy --learning_rate 4e-5

python ./examples/run_classification.py \
--model_name_or_path ./models/<task_name>_TAPT \
--task_name <task_name> --max_seq_length 256 --per_device_train_batch_size 16 \
--learning_rate 2e-5 --num_train_epochs 5.0 --output_dir ./results/<task_name>_TAPT_FT/ \
--data_dir ./data/<task_name>/ --Ngram_path ./ngram/pmi_<task_name>_ngram.txt --overwrite_output_dir --save_steps 5000

Output:

The run_classification.py program will save the trained models to results folder automatically, and print out loss, accuracy, f1 score. In addition, you can get the prediction results in args.output_dir/test_pred_{task_name}.txt. Take test_pred_ag.txt as an example:

input   label   pred
Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul. 3       3
SPACE.com - TORONTO, Canada -- A second\team of rocketeers competing for the  #36;10 million Ansari X Prize, a contest for\privately funded suborbital space flight, has officially announced the first\launch date for its manned rocket.      4       4
...

Contact information

For help or issues using T-DNA, please submit a GitHub issue.

For personal communication related to T-DNA, please contact Shizhe Diao ([email protected]).

Citation

If you use or extend our work, please cite the following paper:

@inproceedings{DXSJSZ2021,
    title = "Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation",
    author = "Diao, Shizhe  and
      Xu, Ruijia  and
      Su, Hongjin  and
      Jiang, Yilei  and
      Song, Yan  and
      Zhang, Tong",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.259",
    doi = "10.18653/v1/2021.acl-long.259",
    pages = "3336--3349",
}
Owner
shizhediao
shizhediao
This repo holds the code of TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation

TransFuse This repo holds the code of TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation Requirements Pytorch=1.6.0, 1.9.0 (=1.

Rayicer 93 Dec 19, 2022
Distributed Asynchronous Hyperparameter Optimization better than HyperOpt.

UltraOpt : Distributed Asynchronous Hyperparameter Optimization better than HyperOpt. UltraOpt is a simple and efficient library to minimize expensive

98 Aug 16, 2022
Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

AudioCLIP Extending CLIP to Image, Text and Audio This repository contains implementation of the models described in the paper arXiv:2106.13043. This

458 Jan 02, 2023
Distilled coarse part of LoFTR adapted for compatibility with TensorRT and embedded divices

Coarse LoFTR TRT Google Colab demo notebook This project provides a deep learning model for the Local Feature Matching for two images that can be used

Kirill 46 Dec 24, 2022
[AAAI 2021] MVFNet: Multi-View Fusion Network for Efficient Video Recognition

MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021) Overview We release the code of the MVFNet (Multi-View Fusion Network).

Wenhao Wu 114 Nov 27, 2022
Contains supplementary materials for reproduce results in HMC divergence time estimation manuscript

Scalable Bayesian divergence time estimation with ratio transformations This repository contains the instructions and files to reproduce the analyses

Suchard Research Group 1 Sep 21, 2022
A Python library for working with arbitrary-dimension hypercomplex numbers following the Cayley-Dickson construction of algebras.

Hypercomplex A Python library for working with quaternions, octonions, sedenions, and beyond following the Cayley-Dickson construction of hypercomplex

7 Nov 04, 2022
Ensemble Knowledge Guided Sub-network Search and Fine-tuning for Filter Pruning

Ensemble Knowledge Guided Sub-network Search and Fine-tuning for Filter Pruning This repository is official Tensorflow implementation of paper: Ensemb

Seunghyun Lee 12 Oct 18, 2022
Writeups for the challenges from DownUnderCTF 2021

cloud Challenge Author Difficulty Release Round Bad Bucket Blue Alder easy round 1 Not as Bad Bucket Blue Alder easy round 1 Lost n Found Blue Alder m

DownUnderCTF 161 Dec 31, 2022
This repository contains the code for the binaural-detection model used in the publication arXiv:2111.04637

This repository contains the code for the binaural-detection model used in the publication arXiv:2111.04637 Dependencies The model depends on the foll

Jörg Encke 2 Oct 14, 2022
Malware Env for OpenAI Gym

Malware Env for OpenAI Gym Citing If you use this code in a publication please cite the following paper: Hyrum S. Anderson, Anant Kharkar, Bobby Fila

ENDGAME 563 Dec 29, 2022
Code release for Local Light Field Fusion at SIGGRAPH 2019

Local Light Field Fusion Project | Video | Paper Tensorflow implementation for novel view synthesis from sparse input images. Local Light Field Fusion

1.1k Dec 27, 2022
ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge (ManiSkill Challenge), a large-scale learning-from-demonstrations benchmark for object manipulation.

ManiSkill-Learn ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge, a large-scale learning-from-dem

Hao Su's Lab, UCSD 48 Dec 30, 2022
[AAAI22] Reliable Propagation-Correction Modulation for Video Object Segmentation

Reliable Propagation-Correction Modulation for Video Object Segmentation (AAAI22) Preview version paper of this work is available at: https://arxiv.or

Xiaohao Xu 70 Dec 04, 2022
Code for "MetaMorph: Learning Universal Controllers with Transformers", Gupta et al, ICLR 2022

MetaMorph: Learning Universal Controllers with Transformers This is the code for the paper MetaMorph: Learning Universal Controllers with Transformers

Agrim Gupta 50 Jan 03, 2023
GAN-generated image detection based on CNNs

GAN-image-detection This repository contains a GAN-generated image detector developed to distinguish real images from synthetic ones. The detector is

Image and Sound Processing Lab 17 Dec 15, 2022
RoMa: A lightweight library to deal with 3D rotations in PyTorch.

RoMa: A lightweight library to deal with 3D rotations in PyTorch. RoMa (which stands for Rotation Manipulation) provides differentiable mappings betwe

NAVER 90 Dec 27, 2022
Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr\"om Method (NeurIPS 2021)

Skyformer This repository is the official implementation of Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr"om Method (NeurIPS 2021).

Qi Zeng 46 Sep 20, 2022
In this project, two programs can help you take full agvantage of time on the model training with a remote server

In this project, two programs can help you take full agvantage of time on the model training with a remote server, which can push notification to your phone about the information during model trainin

GrayLee 8 Dec 27, 2022
RefineGNN - Iterative refinement graph neural network for antibody sequence-structure co-design (RefineGNN)

Iterative refinement graph neural network for antibody sequence-structure co-des

Wengong Jin 83 Dec 31, 2022