Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Last update: Oct 28, 2022

Related tags

Text Data & NLP ood-text-emnlp

Overview

ood-text-emnlp

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Files

fine_tune.py is used to finetune the GPT-2 models, and roberta_fine_tune.py is used to finetune the Roberta models.
perplexity.py and msp_eval.py is used to find the PPLs and MSPs of a dataset pair's exxamples using the finetuned model.

How to run

These steps show how to train both density estimation and calibration models on the MNLI dataset, and evaluated against SNLI.

A differet dataset pair can be used by updating the approriate dataset_name or id_data/ood_data values as shown below:

Training the Density Estimation Model (GPT-2)

Two options:

Using HF Datasets -

python fine_tune.py --dataset_name glue --dataset_config_name mnli --key premise --key2 hypothesis

This also generates a txt train file corresponding to the dataset's text.

Using previously generated txt file -

python fine_tune.py --train_file data/glue_mnli_train.txt --fname glue_mnli"

Finding Perplexity (PPL)

This uses the txt files generated after running fine_tune.py to find the perplexity of the ID model on both ID and OOD validation sets -

id_data="glue_mnli"
ood_data="snli"
python perplexity.py --model_path ckpts/gpt2-$id_data/ --dataset_path data/${ood_data}_val.txt --fname ${id_data}_$ood_data

python perplexity.py --model_path ckpts/gpt2-$id_data/ --dataset_path data/${id_data}_val.txt --fname ${id_data}_$id_data

Training the Calibration Model (RoBERTa)

Two options:

Using HF Datasets -

id_data="mnli"
python roberta_fine_tune.py --task_name $id_data --output_dir /scratch/ua388/roberta_ckpts/roberta-$id_data --fname ${id_data}_$id_data

Using txt file generated earlier -

id_data="mnli"
python roberta_fine_tune.py --train_file data/mnli/${id_data}_conditional_train.txt --val_file data/mnli/${id_data}_val.txt --output_dir roberta_ckpts/roberta-$id_data --fname ${id_data}_$id_data"

The *_conditional_train.txt file contains both the labels as well as the text.

Finding Maximum Softmax Probability (MSP)

Two options:

Using HF Datasets -

id_data="mnli"
ood_data="snli"
python msp_eval.py --model_path roberta_ckpts/roberta-$id_data --dataset_name $ood_data --fname ${id_data}_$ood_data

Using txt file generated earlier -

id_data="mnli"
ood_data="snli"
python msp_eval.py --model_path roberta_ckpts/roberta-$id_data --val_file data/${ood_data}_val.txt --fname ${id_data}_$ood_data --save_msp True

Evaluating AUROC

Compute AUROC of PPL using compute_auroc in utils.py -

id_data = 'glue_mnli'
ood_data = 'snli'
id_pps = utils.read_model_out(f'output/gpt2/{id_data}_{id_data}_pps.npy')
ood_pps = utils.read_model_out(f'output/gpt2/{id_data}_{ood_data}_pps.npy')
score = compute_auroc(id_pps, ood_pps)
print(score)

Compute AUROC of MSP -

 id_data = 'mnli'
 ood_data = 'snli'
 id_msp = utils.read_model_out(f'output/roberta/{id_data}_{id_data}_msp.npy')
 ood_msp = utils.read_model_out(f'output/roberta/{id_data}_{ood_data}_msp.npy')
 score = compute_auroc(-id_msp, -ood_msp)
 print(score)

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Related tags

Overview

ood-text-emnlp

Files

How to run

Training the Density Estimation Model (GPT-2)

Finding Perplexity (PPL)

Training the Calibration Model (RoBERTa)

Finding Maximum Softmax Probability (MSP)

Evaluating AUROC

Owner

Udit Arora

CLIPfa: Connecting Farsi Text and Images

小布助手对话短文本语义匹配的一个baseline

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

DiY Oxygen Concentrator based on the OxiKit

A simple Speech Emotion Recognition (SER) API created using Flask and running in a Docker container.

Phrase-Based & Neural Unsupervised Machine Translation

Synthetic data for the people.

Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline

Subtitle Workshop (subshop): tools to download and synchronize subtitles

Machine learning classifiers to predict American Sign Language .

This is a NLP based project to extract effective date of the contract from their text files.

Summarization module based on KoBART

Applied Natural Language Processing in the Enterprise - An O'Reilly Media Publication

Implementation of the Hybrid Perception Block and Dual-Pruned Self-Attention block from the ITTR paper for Image to Image Translation using Transformers

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

A spaCy wrapper of OpenTapioca for named entity linking on Wikidata

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

基于“Seq2Seq+前缀树”的知识图谱问答