Model Card: Perceiver-Actor

Model Details

Developed by Shridhar et al. at University of Washington and NVIDIA. PerAct is an end-to-end behavior cloning agent that learns to perform a wide variety of language-conditioned manipulation tasks. PerAct uses a Transformer that exploits the 3D structure of voxel patches to learn policies with just a few demonstrations per task.
Architecture: Transformer trained from scratch with end-to-end supervised learning.
Trained for 6-DoF manipulation tasks with objects that appear in tabletop scenes.

Nov 2022

Primary intended use case: PerAct is intended for robotic manipulation research. We hope the benchmark and pre-trained models will enable researchers to study the capabilities of Transformers for end-to-end 6-DoF Manipulation. Specifically, we hope the setup serves a reproducible framework for evaluating robustness and scaling capabilities of manipulation agents.
Primary intended users: Robotics researchers.
Out-of-scope use cases: Deployed use cases in real-world autonomous systems without human supervision during test-time is currently out-of-scope. Use cases that involve manipulating novel objects and observations with people, are not recommended for safety-critical systems. The agent is also intended to be trained and evaluated with English language instructions.

Pre-training Data for CLIP's language encoder: See OpenAI's Model Card for full details. Note: We do not use CLIP's vision encoders for any agents in the repo.
Manipulation Data for PerAct: The agent was trained with expert demonstrations. In simulation, we use oracle agents and in real-world we use human demonstrations. Since the agent is used in few-shot settings with very limited data, the agent might exploit intended and un-intented biases in the training demonstrations. Currently, these biases are limited to just objects that appear on tabletops.

See Appendix L in the paper for an extended discussion.