Text-to-Image generation

Overview

Generate vivid Images for Any (Chinese) text

teaser

CogView is a pretrained (4B-param) transformer for text-to-image generation in general domain.

  • Read our paper CogView: Mastering Text-to-Image Generation via Transformers on ArXiv for a formal introduction. The PB-relax and Sandwich-LN can also help you train large and deep transformers stably (e.g. eliminating NaN losses).
  • Visit our demo at Github Page or Wudao! (Without post-selection or super-resolution, currently only supports simplified Chinese input, but one can translate text from other languages into Chinese for input. Note: Wudao provides faster access for users from China mainland.)
  • Download our pretrained models from Project Wudao-Wenhui(悟道-文汇).
  • Cite our paper if you find our work is helpful~
@article{ding2021cogview,
  title={CogView: Mastering Text-to-Image Generation via Transformers},
  author={Ding, Ming and Yang, Zhuoyi and Hong, Wenyi and Zheng, Wendi and Zhou, Chang and Yin, Da and Lin, Junyang and Zou, Xu and Shao, Zhou and Yang, Hongxia and Tang, Jie},
  journal={arXiv preprint arXiv:2105.13290},
  year={2021}
  • Google Colab Two contributors successfully setup up CogView on Colab Links to Colab!

Getting Started

Setup

  • Hardware: Linux servers with Nvidia V100s or A100s are recommended, but it is also okay to run the pretrained models with smaller --max-inference-batch-size or training smaller models on less powerful GPUs.

  • Environment (Option 1): Please first install PyTorch (>=1.7.0) and apex, and then install other dependencies via pip install -r requirements.txt.

  • Environment (Option 2): We prepare a docker image in case that you fail to handle the environments. Pull the image, create a (background) container and get into it via:

    docker pull cogview/cuda111_torch181_deepspeed040
    ./env/start_docker.sh && docker exec -it bg-cogview bash
    
    cd /root/cogview # in the container
    

Download

  1. Download the image tokenizer vqvae_hard_biggerset_011.pt from BAAI website or Tsinghua Cloud. Place the file under pretrained/vqvae.
wget https://cloud.tsinghua.edu.cn/f/71607a5dca69417baa8c/?dl=1 -O pretrained/vqvae/vqvae_hard_biggerset_011.pt
  1. Download models from Project Wudao-Wenhui.

    FileName Discription
    cogview-base.tar The pretrained text-to-image model.
    cogview-caption.tar Finetuned image-to-text model, also used for reranking.
    cogview-sr.tar Finetuned super-resolution model. (warning: it runs slow.)

    Uncompress them into pretrained/cogview/. The following command should be modified based on the model name.

    tar -xvf cogview-{base, sr, caption}.tar -C pretrained/cogview/
    
  2. (Only for training tutorial, skip it for inference.) Download a small "bird-and-animal" example dataset from our link at Tsinghua Cloud.

wget https://cloud.tsinghua.edu.cn/f/1e4963ec8ac84941ba68/?dl=1 -O data/bird_animal.bin

Run CogView! (Model Inference)

We encapsulate the generation functions into scripts. See generate_samples.py and arguments.py for details.

Text-to-Image Generation

Write text queries (one per line) into input.txt and run:

./scripts/text2image.sh --debug

The results will in a new folder samples_text2image/.

Arguments useful in inference are mainly:

  • --input-source [path or "interactive"]. The path of the input file, can also be "interactive", which will launch a CLI.
  • --output-path [path]. The folder containing the results.
  • --batch-size [int]. The number of samples will be generated per query.
  • --max-inference-batch-size [int]. Maximum batch size per forward. Reduce it if OOM.
  • --debug. Only save concatenated images for all generated samples, and name them by input text and date.
  • --with-id. When it toggled, you must specify an "id" before each input, e.g. 001\t一个漂亮的女孩, \t denoting TAB (NOT space). It will generate batch-size split images in a folder named "id" for each input. Confict with --debug.
  • --device [int]. Running on which GPU.

Super-resolution

Run the following script and input text\t{image_path}, where {image_path} means the path of a previously generated image.

./scripts/super_resolution.sh

Note: It is only effective for generated images from our Image Tokenizer (due to the token distribution).

Image-to-Text

The input is "one image path per line", and will print the results to stdout.

./scripts/image2text.sh

Note: Not optimized for this task, so it might not very competitive (but okay). We will consider to release a version funetuning for a longer period on this task in the future. (TODO)

Post-selection

This application only takes file inputs, where each line is {text}\t{image_path1}\t{image_path2}\t{image_path3}.... The output is {output_path}/scores.txt, a line of a list of scores, following a line from inputs.

./scripts/post_selection.sh

Note: In the released codes, for simplicity, we did not expose the raw API , which supports some advanced generation modes, e.g. text and part of image.

Training

Here we use a subset of our dataset from bird-and-animal for tutorial. The binary dataset is generated by our cogdata toolkit. Please wait for a formal release with tutorials of cogdata (although it is available now).

Single Node

After downloading the dataset, directly run

./scripts/pretrain_single_node.sh

Multiple Nodes

If you want to train the models on multiple servers inter-connected by infiniband without a shared file system (you may need pdsh to accelerate this process):

  1. On each server, use git clone to download this repo, and make sure the data (LMDB format) are moved into the data subfolder.
  2. On each server, echo "ip1 ip2 <other IPs>" > ./docker/ip_list.txt, and then start the docker by ./env/start_docker.sh.
  3. Get into the docker on the first node container via docker exec -it bg-cogview bash.
  4. Get into /root/cogview and run ./scripts/pretrain_multiple_nodes.sh. You may need to change the config (especially OPTIONS_NCCL) in the shell script.

See the arguments.py for advanced functions for training. TODO

Gallery

more_samples

Owner
THUDM
Data Mining Research Group at Tsinghua University
THUDM
A Pytorch loader for MVTecAD dataset.

MVTecAD A Pytorch loader for MVTecAD dataset. It strictly follows the code style of common Pytorch datasets, such as torchvision.datasets.CIFAR10. The

Jiyuan 1 Dec 27, 2021
Pytorch implementation for Semantic Segmentation/Scene Parsing on MIT ADE20K dataset

Semantic Segmentation on MIT ADE20K dataset in PyTorch This is a PyTorch implementation of semantic segmentation models on MIT ADE20K scene parsing da

MIT CSAIL Computer Vision 4.5k Jan 08, 2023
Code for the paper: Adversarial Training Against Location-Optimized Adversarial Patches. ECCV-W 2020.

Adversarial Training Against Location-Optimized Adversarial Patches arXiv | Paper | Code | Video | Slides Code for the paper: Sukrut Rao, David Stutz,

Sukrut Rao 32 Dec 13, 2022
Spontaneous Facial Micro Expression Recognition using 3D Spatio-Temporal Convolutional Neural Networks

Spontaneous Facial Micro Expression Recognition using 3D Spatio-Temporal Convolutional Neural Networks Abstract Facial expression recognition in video

Bogireddy Sai Prasanna Teja Reddy 103 Dec 29, 2022
Deep Learning Models for Causal Inference

Extensive tutorials for learning how to build deep learning models for causal inference using selection on observables in Tensorflow 2.

Bernard J Koch 151 Dec 31, 2022
HODEmu, is both an executable and a python library that is based on Ragagnin 2021 in prep.

HODEmu HODEmu, is both an executable and a python library that is based on Ragagnin 2021 in prep. and emulates satellite abundance as a function of co

Antonio Ragagnin 1 Oct 13, 2021
This program will stylize your photos with fast neural style transfer.

Neural Style Transfer (NST) Using TensorFlow Demo TensorFlow TensorFlow is an end-to-end open source platform for machine learning. It has a comprehen

Ismail Boularbah 1 Aug 08, 2022
Source code, datasets and trained models for the paper Learning Advanced Mathematical Computations from Examples (ICLR 2021), by François Charton, Amaury Hayat (ENPC-Rutgers) and Guillaume Lample

Maths from examples - Learning advanced mathematical computations from examples This is the source code and data sets relevant to the paper Learning a

Facebook Research 171 Nov 23, 2022
Mix3D: Out-of-Context Data Augmentation for 3D Scenes (3DV 2021)

Mix3D: Out-of-Context Data Augmentation for 3D Scenes (3DV 2021) Alexey Nekrasov*, Jonas Schult*, Or Litany, Bastian Leibe, Francis Engelmann Mix3D is

Alexey Nekrasov 189 Dec 26, 2022
Offical implementation for "Trash or Treasure? An Interactive Dual-Stream Strategy for Single Image Reflection Separation".

Trash or Treasure? An Interactive Dual-Stream Strategy for Single Image Reflection Separation (NeurIPS 2021) by Qiming Hu, Xiaojie Guo. Dependencies P

Qiming Hu 31 Dec 20, 2022
Code for classifying international patents based on the text of their titles/abstracts

Patent Classification Goal: To train a machine learning classifier that can automatically classify international patents downloaded from the WIPO webs

Prashanth Rao 1 Nov 08, 2022
DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021)

DeepLM DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021) Run Please install th

Jingwei Huang 130 Dec 02, 2022
Implementation of "Glancing Transformer for Non-Autoregressive Neural Machine Translation"

GLAT Implementation for the ACL2021 paper "Glancing Transformer for Non-Autoregressive Neural Machine Translation" Requirements Python = 3.7 Pytorch

117 Jan 09, 2023
A deep learning library that makes face recognition efficient and effective

Distributed Arcface Training in Pytorch This is a deep learning library that makes face recognition efficient, and effective, which can train tens of

Sajjad Aemmi 10 Nov 23, 2021
Unsupervised Real-World Super-Resolution: A Domain Adaptation Perspective

Unofficial pytorch implementation of the paper "Unsupervised Real-World Super-Resolution: A Domain Adaptation Perspective"

16 Nov 21, 2022
The Most Efficient Temporal Difference Learning Framework for 2048

moporgic/TDL2048+ TDL2048+ is a highly optimized temporal difference (TD) learning framework for 2048. Features Many common methods related to 2048 ar

Hung Guei 5 Nov 23, 2022
Official Repsoitory for "Activate or Not: Learning Customized Activation." [CVPR 2021]

CVPR 2021 | Activate or Not: Learning Customized Activation. This repository contains the official Pytorch implementation of the paper Activate or Not

184 Dec 27, 2022
Official Implementation for HyperStyle: StyleGAN Inversion with HyperNetworks for Real Image Editing

HyperStyle: StyleGAN Inversion with HyperNetworks for Real Image Editing Yuval Alaluf*, Omer Tov*, Ron Mokady, Rinon Gal, Amit H. Bermano *Denotes equ

885 Jan 06, 2023
Kernel Point Convolutions

Created by Hugues THOMAS Introduction Update 27/04/2020: New PyTorch implementation available. With SemanticKitti, and Windows supported. This reposit

Hugues THOMAS 584 Jan 07, 2023
Safe Control for Black-box Dynamical Systems via Neural Barrier Certificates

Safe Control for Black-box Dynamical Systems via Neural Barrier Certificates Installation Clone the repository: git clone https://github.com/Zengyi-Qi

Zengyi Qin 3 Oct 18, 2022