EMNLP 2021 paper "Pre-train or Annotate? Domain Adaptation with a Constrained Budget".

Overview

Pre-train or Annotate? Domain Adaptation with a Constrained Budget

This repo contains code and data associated with EMNLP 2021 paper "Pre-train or Annotate? Domain Adaptation with a Constrained Budget".

@inproceedings{bai-etal-2021-pre,
    title = "Pre-train or Annotate? Domain Adaptation with a Constrained Budget",
    author = "Bai, Fan  and
              Ritter, Alan  and
              Xu, Wei",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
}

Installment

git clone https://github.com/bflashcp3f/ProcBERT.git
cd ProcBERT
conda env create -f environment.yml
conda activate procbert

Data & Model Checkpoints

Three procedural-text datasets (WLP, PubMed and ChemSyn) can be downloaded here, and model checkpoints (ProcBERT and Proc-RoBERTa) are accessible through HuggingFace.

Experiment

Setup

# After downloading the data, update the DATA_PATH variable in code/utils.py
DATA_PATH=<DATA_PATH>

Budget-aware Domain Adaptation Experiments (with EasyAdapt)

# Named Entity Recognition (NER) 
python code/ner_da_budget.py     \
  --lm_model procbert     \
  --src_data pubmed     \
  --tgt_data chemsyn     \
  --gpu_ids 0,1   \
  --output_dir ./output/da/pubmed_chemsyn     \
  --learning_rate 1e-5     \
  --task_name fa_ner     \
  --batch_size 16     \
  --max_len 512    \
  --epochs 25 \
  --budget 700 \
  --alpha 1   \
  --save_model

# Relation Extraction (RE)
python code/rel_da_budget.py \
  --lm_model procbert \
  --src_data pubmed     \
  --tgt_data chemsyn     \
  --gpu_ids 0,1  \
  --output_dir ./output/da/pubmed_chemsyn \
  --learning_rate 1e-5 \
  --task_name fa_rel \
  --batch_size 48 \
  --max_len 256 \
  --epochs 5 \
  --budget 700 \
  --alpha 1 \
  --down_sample \
  --down_sample_rate 0.4 \
  --save_model

To obtain ProcBERT results with different budgets under six domain adaptation settings:

# NER
sh script/ner/run_ner_da_budget_all.sh

# RE
sh script/rel/run_rel_da_budget_all.sh

Budget-aware Target-domain-only Experiments

# Named Entity Recognition (NER) 
python code/ner_budget.py \
  --lm_model procbert \
  --data_name chemsyn \
  --gpu_ids 0,1  \
  --output_dir ./output/chemsyn \
  --learning_rate 1e-5 \
  --task_name ner \
  --batch_size 16 \
  --max_len 512 \
  --epochs 25 \
  --budget 700 \
  --save_model

# Relation Extraction (RE)
python code/rel_budget.py \
  --lm_model procbert \
  --data_name chemsyn \
  --gpu_ids 0,1  \
  --output_dir ./output/chemsyn \
  --learning_rate 1e-5 \
  --task_name rel \
  --batch_size 48 \
  --max_len 256 \
  --epochs 5 \
  --budget 700 \
  --down_sample \
  --down_sample_rate 0.4 \
  --save_model

To obtain ProcBERT results with different budgets on three datasets:

# NER
sh script/ner/run_ner_budget_all.sh

# RE
sh script/rel/run_rel_budget_all.sh

Auxiliary Experiments

# Named Entity Recognition (NER) 
python code/ner.py \
  --lm_model procbert \
  --data_name chemsyn \
  --gpu_ids 0,1  \
  --output_dir ./output/chemsyn \
  --learning_rate 1e-5 \
  --task_name ner \
  --batch_size 16 \
  --max_len 512 \
  --epochs 20 \
  --save_model

# Relation Extraction (RE)
python code/rel.py \
  --lm_model procbert \
  --data_name chemsyn \
  --gpu_ids 0,1  \
  --output_dir ./output/chemsyn \
  --learning_rate 1e-5 \
  --task_name rel \
  --batch_size 48 \
  --max_len 256 \
  --epochs 5 \
  --down_sample \
  --down_sample_rate 0.4 \
  --save_model

To obtain ProcBERT results on all three datasets:

# NER
sh script/ner/run_ner_all.sh

# RE
sh script/rel/run_rel_all.sh
Owner
Fan Bai
Fan Bai
pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks

A Python multilingual toolkit for Sentiment Analysis and Social NLP tasks

297 Dec 29, 2022
This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project

Common Voice Utils This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project. It aims t

Francis Tyers 40 Dec 20, 2022
kochat

Kochat 챗봇 빌더는 성에 안차고, 자신만의 딥러닝 챗봇 애플리케이션을 만드시고 싶으신가요? Kochat을 이용하면 손쉽게 자신만의 딥러닝 챗봇 애플리케이션을 빌드할 수 있습니다. # 1. 데이터셋 객체 생성 dataset = Dataset(ood=True) #

1 Oct 25, 2021
An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.

GPT Neo 🎉 1T or bust my dudes 🎉 An implementation of model & data parallel GPT3-like models using the mesh-tensorflow library. If you're just here t

EleutherAI 6.7k Dec 28, 2022
A Python script which randomly chooses and prints a file from a directory.

___ ____ ____ _ __ ___ / _ \ | _ \ | _ \ ___ _ __ | '__| / _ \ | |_| || | | || | | | / _ \| '__| | | | __/ | _ || |_| || |_| || __

yesmaybenookay 0 Aug 06, 2021
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Hiring We are hiring at all levels (including FTE researchers and interns)! If you are interested in working with us on NLP and large-scale pre-traine

Microsoft 7.8k Jan 09, 2023
FB ID CLONER WUTHOT CHECKPOINT, FACEBOOK ID CLONE FROM FILE

* MY SOCIAL MEDIA : Programming And Memes Want to contact Mr. Error ? CONTACT : [ema

Mr. Error 9 Jun 17, 2021
neural network based speaker embedder

Content What is deepaudio-speaker? Installation Get Started Model Architecture How to contribute to deepaudio-speaker? Acknowledge What is deepaudio-s

20 Dec 29, 2022
NeMo: a toolkit for conversational AI

NVIDIA NeMo Introduction NeMo is a toolkit for creating Conversational AI applications. NeMo product page. Introductory video. The toolkit comes with

NVIDIA Corporation 5.3k Jan 04, 2023
Malware-Related Sentence Classification

Malware-Related Sentence Classification This repo contains the code for the ICTAI 2021 paper "Enrichment of Features for Malware-Related Sentence Clas

Chau Nguyen 1 Mar 26, 2022
InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective This is the official code base for our ICLR 2021 paper

AI Secure 71 Nov 25, 2022
Yet Another Neural Machine Translation Toolkit

YANMTT YANMTT is short for Yet Another Neural Machine Translation Toolkit. For a backstory how I ended up creating this toolkit scroll to the bottom o

Raj Dabre 121 Jan 05, 2023
Snowball compiler and stemming algorithms

Snowball is a small string processing language for creating stemming algorithms for use in Information Retrieval, plus a collection of stemming algori

Snowball Stemming language and algorithms 613 Jan 07, 2023
LewusBot - Twitch ChatBot built in python with twitchio library

LewusBot Twitch ChatBot built in python with twitchio library. Uses twitch/leagu

Lewus 25 Dec 04, 2022
Fuzzy String Matching in Python

FuzzyWuzzy Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek 8.8k Jan 01, 2023
Pretty-doc - Composable text objects with python

pretty-doc from __future__ import annotations from dataclasses import dataclass

Taine Zhao 2 Jan 17, 2022
Repository for the paper "Optimal Subarchitecture Extraction for BERT"

Bort Companion code for the paper "Optimal Subarchitecture Extraction for BERT." Bort is an optimal subset of architectural parameters for the BERT ar

Alexa 461 Nov 21, 2022
Text Normalization(文本正则化)

Text Normalization(文本正则化) 任务描述:通过机器学习算法将英文文本的“手写”形式转换成“口语“形式,例如“6ft”转换成“six feet”等 实验结果 XGBoost + bag-of-words: 0.99159 XGBoost+Weights+rules:0.99002

Jason_Zhang 0 Feb 26, 2022
Various capabilities for static malware analysis.

Malchive The malchive serves as a compendium for a variety of capabilities mainly pertaining to malware analysis, such as scripts supporting day to da

MITRE Cybersecurity 64 Nov 22, 2022
Codes for processing meeting summarization datasets AMI and ICSI.

Meeting Summarization Dataset Meeting plays an essential part in our daily life, which allows us to share information and collaborate with others. Wit

xcfeng 39 Dec 14, 2022