Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

Last update: Dec 22, 2022

Related tags

Overview

Dataset Cartography

Code for the paper Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics at EMNLP 2020.

This repository contains implementation of data maps, as well as other data selection baselines, along with notebooks for data map visualizations.

If using, please cite:

@inproceedings{swayamdipta2020dataset,
    title={Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics},
    author={Swabha Swayamdipta and Roy Schwartz and Nicholas Lourie and Yizhong Wang and Hannaneh Hajishirzi and Noah A. Smith and Yejin Choi},
    booktitle={Proceedings of EMNLP},
    url={https://arxiv.org/abs/2009.10795},
    year={2020}
}

This repository can be used to build Data Maps, like this one for SNLI using a RoBERTa-Large classifier.

Pre-requisites

This repository is based on the HuggingFace Transformers library.

Train GLUE-style model and compute training dynamics

To train a GLUE-style model using this repository:

python -m cartography.classification.run_glue \
    -c configs/$TASK.jsonnet \
    --do_train \
    --do_eval \
    -o $MODEL_OUTPUT_DIR

The best configurations for our experiments for each of the $TASKs (SNLI, MNLI, QNLI or WINOGRANDE) are provided under configs.

This produces a training dynamics directory $MODEL_OUTPUT_DIR/training_dynamics, see a sample here.

Note: you can use any other set up to train your model (independent of this repository) as long as you produce the dynamics_epoch_$X.jsonl for plotting data maps, and filtering different regions of the data. The .jsonl file must contain the following fields for every training instance:

guid : instance ID matching that in the original data file, for filtering,
logits_epoch_$X : logits for the training instance under epoch $X,
gold : index of the gold label, must match the logits array.

Plot Data Maps

To plot data maps for a trained $MODEL (e.g. RoBERTa-Large) on a given $TASK (e.g. SNLI, MNLI, QNLI or WINOGRANDE):

python -m cartography.selection.train_dy_filtering \
    --plot \
    --task_name $TASK \
    --model_dir $PATH_TO_MODEL_OUTPUT_DIR_WITH_TRAINING_DYNAMICS \
    --model $MODEL_NAME

Data Selection

To select (different amounts of) data based on various metrics from training dynamics:

python -m cartography.selection.train_dy_filtering \
    --filter \
    --task_name $TASK \
    --model_dir $PATH_TO_MODEL_OUTPUT_DIR_WITH_TRAINING_DYNAMICS \
    --metric $METRIC \
    --data_dir $PATH_TO_GLUE_DIR_WITH_ORIGINAL_DATA_IN_TSV_FORMAT

Supported $TASKs include SNLI, QNLI, MNLI and WINOGRANDE, and $METRICs include confidence, variability, correctness, forgetfulness and threshold_closeness; see paper for more details.

To select hard-to-learn instances, set $METRIC as "confidence" and for ambiguous, set $METRIC as "variability". For easy-to-learn instances: set $METRIC as "confidence" and use the flag --worst.

Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

Related tags

Overview

Dataset Cartography

Pre-requisites

Train GLUE-style model and compute training dynamics

Plot Data Maps

Data Selection

Owner

AI2

Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP

PaddleRobotics is an open-source algorithm library for robots based on Paddle, including open-source parts such as human-robot interaction, complex motion control, environment perception, SLAM positioning, and navigation.

Discovering Interpretable GAN Controls [NeurIPS 2020]

PyTorch implementation of Weak-shot Fine-grained Classification via Similarity Transfer

PyTorch implementation of image classification models for CIFAR-10/CIFAR-100/MNIST/FashionMNIST/Kuzushiji-MNIST/ImageNet

Monocular Depth Estimation - Weighted-average prediction from multiple pre-trained depth estimation models

[CVPR 21] Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting, IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2021.

Codes for CyGen, the novel generative modeling framework proposed in "On the Generative Utility of Cyclic Conditionals" (NeurIPS-21)

The Easy-to-use Dialogue Response Selection Toolkit for Researchers

Chinese Advertisement Board Identification(Pytorch)

PyTorch Implementation of the paper Learning to Reweight Examples for Robust Deep Learning

Code for paper entitled "Improving Novelty Detection using the Reconstructions of Nearest Neighbours"

Adversarial Learning for Semi-supervised Semantic Segmentation, BMVC 2018

[TIP 2021] SADRNet: Self-Aligned Dual Face Regression Networks for Robust 3D Dense Face Alignment and Reconstruction

Implementation of DropLoss for Long-Tail Instance Segmentation in Pytorch

Noise Conditional Score Networks (NeurIPS 2019, Oral)

Official Pytorch implementation of 'GOCor: Bringing Globally Optimized Correspondence Volumes into Your Neural Network' (NeurIPS 2020)

Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding

Context Decoupling Augmentation for Weakly Supervised Semantic Segmentation

This is the repository for The Machine Learning Workshops, published by AI DOJO