Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Last update: Jan 08, 2023

Overview

OFA

OFA is a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image generation, visual grounding, image captioning, image classification, text generation, etc.) to a simple sequence-to-sequence learning framework. For more information, please refer to our paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework.

News

2022.2.13: Released the demo of image captioning. Have fun!
2022.2.11: Released the Colab notebook for image captioning . Enjoy!
2022.2.11: Released the pretrained checkpoint of OFA-Large and the complete (2-staged) finetuning code for image captioning.
2022.2.10: Released the inference code & finetuned checkpoint for image captioning, which can reproduce the results on COCO Karparthy test split (149.6 CIDEr)

TODO

To release finetuning and inference codes for multimodal downstream tasks soon, including image captioning, VQA, text-to-image generation, SNLI-VE, Referring expression, comprehension, etc.
To release codes for pretraining soon.

Approach

Requirements

python 3.7.4
pytorch 1.8.1
torchvision 0.9.1
JAVA 1.8 (for COCO evaluation)

Installation

git clone https://github.com/OFA-Sys/OFA
pip install -r requirements.txt

Datasets and Checkpoints

See datasets.md and checkpoints.md.

Pretraining

To release soon:)

Finetuning & Inference

Below we provide methods for fintuning and inference on different downstream tasks.

Caption

Download data and files and put them in the correct directory
Train

cd run_scripts/caption
nohup sh train_caption_stage1.sh &  # stage1, train with cross-entropy loss
nohup sh train_caption_stage2.sh &  # stage2, load the best ckpt of stage1 and train with CIDEr optimization

Inference

cd run_scripts/caption ; sh evaluate_caption.sh  # inference & evaluate

Gallery

Below we provide examples of OFA in text-to-image generation and open-ended VQA. Also, we demonstrate its performance in unseen task (Grounded QA) as well as unseen domain (Visual Grounding on images from unseen domains).

Text-to-Image Generation (normal query)

Text-to-Image Generation (counterfactual query)

Open-Ended VQA

Grounded QA (unseen task)

Viusal Grounding (unseen domain)

Citation

Please cite our paper if you find it helpful :)

@article{wang2022OFA,
  title={Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework},
  author={Wang, Peng and Yang, An and Men, Rui and Lin, Junyang and Bai, Shuai and Li, Zhikang and Ma, Jianxin and Zhou, Chang and Zhou, Jingren and Yang, Hongxia},
  journal={arXiv e-prints},
  pages={arXiv--2202},
  year={2022}
}

Related Codebase

Fairseq

License

Apache-2.0

Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Related tags

Overview

OFA

News

TODO

Approach

Requirements

Installation

Datasets and Checkpoints

Pretraining

Finetuning & Inference

Caption

Gallery

Text-to-Image Generation (normal query)

Text-to-Image Generation (counterfactual query)

Open-Ended VQA

Grounded QA (unseen task)

Viusal Grounding (unseen domain)

Citation

Related Codebase

License

Owner

OFA Sys

A MNIST-like fashion product database. Benchmark

HyperPose is a library for building high-performance custom pose estimation applications.

Python Classes: Medical Insurance Project using Object Oriented Programming Concepts

NeuralDiff: Segmenting 3D objects that move in egocentric videos

[NeurIPS 2021] COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

Publication describing 3 ML examples at NSLS-II and interfacing into Bluesky

Let Python optimize the best stop loss and take profits for your TradingView strategy.

Official PyTorch Implementation of Embedding Transfer with Label Relaxation for Improved Metric Learning, CVPR 2021

Tensorflow2 Keras-based Semantic Segmentation Models Implementation

DexterRedTool - Dexter's Red Team Tool that creates cronjob/task scheduler to consistently creates users

Codebase for the self-supervised goal reaching benchmark introduced in the LEXA paper

Implementation of gMLP, an all-MLP replacement for Transformers, in Pytorch

A series of Jupyter notebooks with Chinese comment that walk you through the fundamentals of Machine Learning and Deep Learning in python using Scikit-Learn and TensorFlow.

Code and dataset for ACL2018 paper "Exploiting Document Knowledge for Aspect-level Sentiment Classification"

The 2nd place solution of 2021 google landmark retrieval on kaggle.

A task-agnostic vision-language architecture as a step towards General Purpose Vision

Neural Style and MSG-Net

State of the art Semantic Sentence Embeddings

Churn-Prediction-Project - In this project, a churn prediction model is developed for a private bank as a term project for Data Mining class.

Deal or No Deal? End-to-End Learning for Negotiation Dialogues