CLIPort: What and Where Pathways for Robotic Manipulation

Overview

CLIPort

CLIPort: What and Where Pathways for Robotic Manipulation
Mohit Shridhar, Lucas Manuelli, Dieter Fox
CoRL 2021

CLIPort is an end-to-end imitation-learning agent that can learn a single language-conditioned policy for various tabletop tasks. The framework combines the broad semantic understanding (what) of CLIP with the spatial precision (where) of TransporterNets to learn generalizable skills from limited training demonstrations.

For the latest updates, see: cliport.github.io

Guides

Installation

Clone Repo:

git clone https://github.com/cliport/cliport.git

Setup virtualenv and install requirements:

# setup virtualenv with whichever package manager you prefer
virtualenv -p $(which python3.8) --system-site-packages cliport_env  
source cliport_env/bin/activate
pip install --upgrade pip

cd cliport
pip install -r requirements.txt

export CLIPORT_ROOT=$(pwd)
python setup.py develop

Note: You might need versions of torch==1.7.1 and torchvision==0.8.2 that are compatible with your CUDA and hardware.

Quickstart

A quick tutorial on evaluating a pre-trained multi-task model.

Download a pre-trained checkpoint for multi-language-conditioned trained with 1000 demos:

python scripts/quickstart_download.py

Generate a small test set of 10 instances for stack-block-pyramid-seq-seen-colors inside $CLIPORT_ROOT/data:

python cliport/demos.py n=10 \
                        task=stack-block-pyramid-seq-seen-colors \
                        mode=test 

This will take a few minutes to finish.

Evaluate the best validation checkpoint for stack-block-pyramid-seq-seen-colors on the test set:

python cliport/eval.py model_task=multi-language-conditioned \
                       eval_task=stack-block-pyramid-seq-seen-colors \
                       agent=cliport \
                       mode=test \
                       n_demos=10 \
                       train_demos=1000 \
                       exp_folder=cliport_quickstart \
                       checkpoint_type=test_best \
                       update_results=True \
                       disp=True

If you are on a headless machine turn off the visualization with disp=False.

You can evaluate the same multi-language-conditioned model on other tasks. First generate a val set for the task and then specify eval_task=<task_name> with mode=val and checkpoint_type=val_missing (the quickstart doesn't include validation results for all tasks; download all task results from here).

Download

Google Scanned Objects

Download center-of-mass (COM) corrected Google Scanned Objects:

python scripts/google_objects_download.py

Credit: Google.

Pre-trained Checkpoints and Result JSONs

This Google Drive Folder contains pre-trained multi-language-conditioned checkpoints for n=1,10,100,1000 and validation/test result JSONs for all tasks. The *val-results.json files contain the name of the best checkpoint (from validation) to be evaluated on the test set.

Note: Google Drive might complain about bandwidth restrictions. I recommend using rclone with API access enabled.

Evaluate the best validation checkpoint on the test set:

python cliport/eval.py model_task=multi-language-conditioned \
                       eval_task=stack-block-pyramid-seq-seen-colors \
                       agent=cliport \
                       mode=test \
                       n_demos=10 \
                       train_demos=100 \
                       exp_folder=cliport_exps \
                       checkpoint_type=test_best \
                       update_results=True \
                       disp=True

Training and Evaluation

The following is a guide for training everything from scratch. All tasks follow a 4-phase workflow:

  1. Generate train, val, test datasets with demos.py
  2. Train agents with train.py
  3. Run validation with eval.py to find the best checkpoint on val tasks and save *val-results.json
  4. Evaluate the best checkpoint in *val-results.json on test tasks with eval.py

Dataset Generation

Single Task

Generate a train set of 1000 demonstrations for stack-block-pyramid-seq-seen-colors inside $CLIPORT_ROOT/data:

python cliport/demos.py n=1000 \
                        task=stack-block-pyramid-seq-seen-colors \
                        mode=train 

You can also do a sequential sweep with -m and comma-separated params task=towers-of-hanoi-seq-seen-colors,stack-block-pyramid-seq-seen-colors. Use disp=True to visualize the data generation.

Full Dataset

Run generate_dataset.sh to generate the full dataset and save it to $CLIPORT_ROOT/data:

sh scripts/generate_dataset.sh data

Note: This script is not parallelized and will take a long time (maybe days) to finish. The full dataset requires ~1.6TB of storage, which includes both language-conditioned and demo-conditioned (original TransporterNets) tasks. It's recommend that you start with single-task training if you don't have enough storage space.

Single-Task Training & Evaluation

Make sure you have a train (n demos) and val (100 demos) set for the task you want to train on.

Training

Train a cliport agent with 1000 demonstrations on the stack-block-pyramid-seq-seen-colors task for 200K iterations:

python cliport/train.py train.task=stack-block-pyramid-seq-seen-colors \
                        train.agent=cliport \
                        train.attn_stream_fusion_type=add \
                        train.trans_stream_fusion_type=conv \
                        train.lang_fusion_type=mult \
                        train.n_demos=1000 \
                        train.n_steps=201000 \
                        train.exp_folder=exps \
                        dataset.cache=False 

Validation

Iteratively evaluate all the checkpoints on val and save the results in exps/<task>-train/checkpoints/<task>-val-results.json:

python cliport/eval.py eval_task=stack-block-pyramid-seq-seen-colors \
                       agent=cliport \
                       mode=val \
                       n_demos=100 \
                       train_demos=1000 \
                       checkpoint_type=val_missing \
                       exp_folder=exps 

Test

Choose the best checkpoint from validation to run on the test set and save the results in exps/<task>-train/checkpoints/<task>-test-results.json:

python cliport/eval.py eval_task=stack-block-pyramid-seq-seen-colors \
                       agent=cliport \
                       mode=test \
                       n_demos=100 \
                       train_demos=1000 \
                       checkpoint_type=test_best \
                       exp_folder=exps 

Multi-Task Training & Evaluation

Training

Train multi-task models by specifying task=multi-language-conditioned, task=multi-loo-packing-box-pairs-unseen-colors (loo stands for leave-one-out or multi-attr tasks) etc.

python cliport/train.py train.task=multi-language-conditioned \
                        train.agent=cliport \
                        train.attn_stream_fusion_type=add \
                        train.trans_stream_fusion_type=conv \
                        train.lang_fusion_type=mult \
                        train.n_demos=1000 \
                        train.n_steps=601000 \
                        dataset.cache=False \
                        train.exp_folder=exps \
                        dataset.type=multi 

Important: You need to generate the full dataset of tasks specified in dataset.py before multi-task training or modify the list of tasks here.

Validation

Run validation with a trained multi-language-conditioned multi-task model on stack-block-pyramid-seq-seen-colors:

python cliport/eval.py model_task=multi-language-conditioned \
                       eval_task=stack-block-pyramid-seq-seen-colors \
                       agent=cliport \
                       mode=val \
                       n_demos=100 \
                       train_demos=1000 \
                       checkpoint_type=val_missing \
                       type=single \
                       exp_folder=exps 

Test

Evaluate the best checkpoint on the test set:

python cliport/eval.py model_task=multi-language-conditioned \
                       eval_task=stack-block-pyramid-seq-seen-colors \
                       agent=cliport \
                       mode=test \
                       n_demos=100 \
                       train_demos=1000 \
                       checkpoint_type=test_best \
                       type=single \
                       exp_folder=exps 

Disclaimers

  • Code Quality Level: Tired grad student.
  • Scaling: The code only works for batch size 1. See #issue1 for reference. In theory, there is nothing preventing larger batch sizes other than GPU memory constraints.
  • Memory and Storage: There are lots of places where memory usage can be reduced. You don't need 3 copies of the same CLIP ResNet50 and you don't need to save its weights in checkpoints since it's frozen anyway. Dataset sizes could be dramatically reduced with better storage formats and compression.
  • Frameworks: There are lots of leftover NumPy bits from when I was trying to reproduce the TransportNets results. I'll try to clean up when I get some time.
  • Rotation Augmentation: All tasks use the same distribution for sampling SE(2) rotation perturbations. This obviously leads to issues with tasks that involve spatial relationships like 'left' or 'forward'.
  • Evaluation Runs: In an ideal setting, the evaluation metrics should be averaged over 3 or more repetitions with different seeds. This might be feasible if you are working just with multi-task models.
  • Duplicate Training Sets: The train sets of some *seen and *unseen tasks are identical, and only the val and test sets differ for purposes of evaluating generalization performance. So you might not need two duplicate train sets or train two separate models.
  • Other Limitations: Checkout Appendix I in the paper.

Notebooks

Checkout Kevin Zakka's Colab for zero-shot detection with CLIP. This notebook might be a good way of gauging what sort of visual attributes CLIP can ground with language. But note that CLIPort does NOT do "object detection", but instead directly "detects actions".

Others Todos

  • Dataset Visualizer
  • Affordance Heatmap Visualizer
  • Evaluation Results Plot

Docker Guide

Install Docker and NVIDIA Docker.

Modify docker_build.py and docker_run.py to your needs.

Build

Build the image:

python scripts/docker_build.py 

Run

Start container:

python scripts/docker_run.py --nvidia_docker
 
  cd ~/cliport

Use scripts/docker_run.py --headless if you are on a headless machines like a remote server or cloud instance.

Real-Robot Training FAQ

How much training data do I need?

It depends on the complexity of the task. With 5-10 demonstrations the agent should start to do something useful, but it will often make mistakes by picking the wrong object. For robustness you probably need 50-100 demostrations. A good way to gauge how much data you might need is to setup a simulated version of the problem and evaluate agents trained with 1, 10, 100, and 1000 demonstrations.

Why doesn't the agent follow my language instruction?

This means either there is some sort of bias in the dataset that the agent is exploiting, or you don't have enough training data. Also make sure that the task is doable - if a referred attribute is barely legible in the input, then it's going to be hard for agent to figure out what you mean.

Does CLIPort predict height (z-values) of the end-effector?

CLIPort does not predict height values. You can either: (1) come up with a heuristic based on the heightmap to determine the height position, or (2) train a simple MLP like in TransportNets-6DOF to predict z-values.

Shouldn't CLIP help in zero-shot detection of things? Why do I need collect more data?

Note that CLIPort is not doing "object detection". CLIPort fine-tunes CLIP's representations to "detect actions" in SE(2). CLIP by itself has no understanding of actions or affordances; recognizing and localizing objects (e.g. detecting hammer) does not tell you anything about how to manipulate them (e.g. grasping the hammer by the handle).

What are the best hyperparams for real-robot training?

The default settings should work well. Although recently, I have been playing around with using FiLM (Perez et. al, 2017) to fuse language features inspired by BC-0 (Jang et. al, 2021). Qualitatively, it seems like FiLM is better for reading text etc. but I haven't conducted a full quantitative analysis. Try it out yourself with train.agent=two_stream_clip_film_lingunet_lat_transporter (non-residual FiLM).

How to pick the best checkpoint for real-robot tasks?

Ideally, you should create a validation set with heldout instances and then choose the checkpoint with the lowest translation and rotation errors. You can also reuse the training instances but swap the language instructions with unseen goals.

Why is the agent confusing directions like 'forward' and 'left'?

By default, training samples are augmented with SE(2) rotations sampled from N(0, 60 deg). For tasks with rotational symmetries (like moving pieces on a chessboard) you need to be careful with this rotation augmentation parameter.

Acknowledgements

This work use code from the following open-source projects and datasets:

Google Ravens (TransporterNets)

Original: https://github.com/google-research/ravens
License: Apache 2.0
Changes: All PyBullet tasks are directly adapted from the Ravens codebase. The original TransporterNets models were reimplemented in PyTorch.

OpenAI CLIP

Original: https://github.com/openai/CLIP
License: MIT
Changes: Minor modifications to CLIP-ResNet50 to save intermediate features for skip connections.

Google Scanned Objects

Original: Dataset
License: Creative Commons BY 4.0
Changes: Fixed center-of-mass (COM) to be geometric-center for selected objects.

U-Net

Original: https://github.com/milesial/Pytorch-UNet/
License: GPL 3.0
Changes: Used as is in unet.py. Note: This part of the code is GPL 3.0.

Citations

CLIPort

@inproceedings{shridhar2021cliport,
  title     = {CLIPort: What and Where Pathways for Robotic Manipulation},
  author    = {Shridhar, Mohit and Manuelli, Lucas and Fox, Dieter},
  booktitle = {Proceedings of the 5th Conference on Robot Learning (CoRL)},
  year      = {2021},
}

CLIP

@article{radford2021learning,
  title={Learning transferable visual models from natural language supervision},
  author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and others},
  journal={arXiv preprint arXiv:2103.00020},
  year={2021}
}

TransporterNets

@inproceedings{zeng2020transporter,
  title={Transporter networks: Rearranging the visual world for robotic manipulation},
  author={Zeng, Andy and Florence, Pete and Tompson, Jonathan and Welker, Stefan and Chien, Jonathan and Attarian, Maria and Armstrong, Travis and Krasin, Ivan and Duong, Dan and Sindhwani, Vikas and others},
  booktitle={Proceedings of the 4th Conference on Robot Learning (CoRL)},
  year= {2020},
}

Questions or Issues?

Please file an issue with the issue tracker.

Pytorch implementation of Deep Recursive Residual Network for Super Resolution (DRRN)

DRRN-pytorch This is an unofficial implementation of "Deep Recursive Residual Network for Super Resolution (DRRN)", CVPR 2017 in Pytorch. [Paper] You

yun_yang 192 Dec 12, 2022
Website for D2C paper

D2C This is the repository that contains source code for the D2C Website. If you find D2C useful for your work please cite: @article{sinha2021d2c au

1 Oct 21, 2021
A fast and easy to use, moddable, Python based Minecraft server!

PyMine PyMine - The fastest, easiest to use, Python-based Minecraft Server! Features Note: This list is not always up to date, and doesn't contain all

PyMine 144 Dec 30, 2022
Portfolio analytics for quants, written in Python

QuantStats: Portfolio analytics for quants QuantStats Python library that performs portfolio profiling, allowing quants and portfolio managers to unde

Ran Aroussi 2.7k Jan 08, 2023
Employee-Managment - Company employee registration software in the face recognition system

Employee-Managment Company employee registration software in the face recognitio

Alireza Kiaeipour 7 Jul 10, 2022
NVIDIA Deep Learning Examples for Tensor Cores

NVIDIA Deep Learning Examples for Tensor Cores Introduction This repository provides State-of-the-Art Deep Learning examples that are easy to train an

NVIDIA Corporation 10k Dec 31, 2022
The code for "Deep Level Set for Box-supervised Instance Segmentation in Aerial Images".

Deep Levelset for Box-supervised Instance Segmentation in Aerial Images Wentong Li, Yijie Chen, Wenyu Liu, Jianke Zhu* This code is based on MMdetecti

sunshine.lwt 112 Jan 05, 2023
Machine learning for NeuroImaging in Python

nilearn Nilearn enables approachable and versatile analyses of brain volumes. It provides statistical and machine-learning tools, with instructive doc

919 Dec 25, 2022
A clear, concise, simple yet powerful and efficient API for deep learning.

The Gluon API Specification The Gluon API specification is an effort to improve speed, flexibility, and accessibility of deep learning technology for

Gluon API 2.3k Dec 17, 2022
Implementation of Shape Generation and Completion Through Point-Voxel Diffusion

Shape Generation and Completion Through Point-Voxel Diffusion Project | Paper Implementation of Shape Generation and Completion Through Point-Voxel Di

Linqi Zhou 103 Dec 29, 2022
The official implementation of NeurIPS 2021 paper: Finding Optimal Tangent Points for Reducing Distortions of Hard-label Attacks

The official implementation of NeurIPS 2021 paper: Finding Optimal Tangent Points for Reducing Distortions of Hard-label Attacks

machen 11 Nov 27, 2022
Defending against Model Stealing via Verifying Embedded External Features

Defending against Model Stealing Attacks via Verifying Embedded External Features This is the official implementation of our paper Defending against M

20 Dec 30, 2022
Python inverse kinematics for your robot model based on Pinocchio.

Python inverse kinematics for your robot model based on Pinocchio.

Stéphane Caron 50 Dec 22, 2022
This project generates news headlines using a Long Short-Term Memory (LSTM) neural network.

News Headlines Generator bunnysaini/Generate-Headlines Goal This project aims to generate news headlines using a Long Short-Term Memory (LSTM) neural

Bunny Saini 1 Jan 24, 2022
AirCode: A Robust Object Encoding Method

AirCode This repo contains source codes for the arXiv preprint "AirCode: A Robust Object Encoding Method" Demo Object matching comparison when the obj

Chen Wang 30 Dec 09, 2022
High-Resolution Image Synthesis with Latent Diffusion Models

Latent Diffusion Models Requirements A suitable conda environment named ldm can be created and activated with: conda env create -f environment.yaml co

CompVis Heidelberg 5.6k Jan 04, 2023
This tool uses Deep Learning to help you draw and write with your hand and webcam.

This tool uses Deep Learning to help you draw and write with your hand and webcam. A Deep Learning model is used to try to predict whether you want to have 'pencil up' or 'pencil down'.

lmagne 169 Dec 10, 2022
Just-Now - This Is Just Now Login Friendlist Cloner Tools

JUST NOW LOGIN FRIENDLIST CLONER TOOLS Install $ apt update $ apt upgrade $ apt

MAHADI HASAN AFRIDI 21 Mar 09, 2022
AI创造营 :Metaverse启动机之重构现世,结合PaddlePaddle 和 Wechaty 创造自己的聊天机器人

paddle-wechaty-Zodiac AI创造营 :Metaverse启动机之重构现世,结合PaddlePaddle 和 Wechaty 创造自己的聊天机器人 12星座若穿越科幻剧,会拥有什么超能力呢?快来迎接你的专属超能力吧! 现在很多年轻人都喜欢看科幻剧,像是复仇者系列,里面有很多英雄、超

105 Dec 22, 2022
“英特尔创新大师杯”深度学习挑战赛 赛道3:CCKS2021中文NLP地址相关性任务

ccks2021-track3 CCKS2021中文NLP地址相关性任务-赛道三-冠军方案 团队:我的加菲鱼- wodejiafeiyu 初赛第二/复赛第一/决赛第一 前言 19年开始,陆陆续续参加了一些比赛,拿到过一些top,比较懒一直都没分享过,这次比较幸运又拿了top1,打算分享下 分类的任务

shaochenjie 131 Dec 31, 2022