KOSMOS-E

Learning to Follow Instruction
for Robotic Grasping

Zhi Wang^{* 1,2}, Xun Wu^{* 1,2}, Shaohan Huang¹,
Li Dong¹, Wenhui Wang¹, Shuming Ma¹, Furu We¹,

¹Microsoft Research, ²Tsinghua University

IEEE International Conference on Intelligent Robots and System (IROS), 2024.

Paper

arXiv

Video Code

Paper ArXiv Summary Github Video

Overview

Tuning on instruction-following data has been shown to enhance the capabilities and controllability of language models, but the idea is less explored in the robotic field. In this work, we introduce KOSMOS-E, a Multimodal Large Language Model (MLLM) that leverages instruction-following robotic grasping data to enhance capabilities for precise and intricate robotic grasping maneuvers. To achieve this, we craft a large-scale instruction-following robotic grasping dataset, termed INSTRUCT-GRASP, primarily comprising two aspects: (i) grasp a single object following varying levels of granularity descriptions, e.g., different angles and aspects, and (ii) grasp a specific object within a multi-object environment following specific attributes, e.g., color and shape. Extensive experiments show the effectiveness of KOSMOS-E on robotic grasping tasks across a variety of environments.

Method

KOSMOS-E is a multimodal large language model that has new capabilities of robotic grasping, which can understand multimodal input and follow different instructions to generate a numerical grasp pose prediction (grasp center point [x, y] and rotation angle θ), guiding the robot to accurately grasp in both single object and multi-object scenes.

Dataset: INSTRUCT-GRASP

We create INSTRUCT-GRASP dataset based on Cornell Grasping Dataset. It includes three components: Non, Single and Multi with 8 kinds of intructions. It has 1.8 million grasping samples, with 250k unique language-image non-instruction samples and 1.56 million instruction-following samples. Among these instruction-following samples, 654k pertain to the single-object scene, while the remaining 654k relate to the multi-object scene.

- Purpose: Existing datasets don't have instructions, they only focus on visual info.
- Total Size: Non-Instruction: 250k; Instruction-Following: 1.56M (654k for single-object, 654k for multi-object)
- Variety: Name, Shape, Color, Purpose, Position, Angle, Part, Strategy

Architecture

KOSMOS-E's architecture.

Evaluation Results

1. Non-Instruction Grasping

We follow a cross-validation setup as in previous works and partition the datasets into 5 folds

Method	Modality	IW	OW
GR-ConvNet	RGBD	97.70	96.60
GG-CNN2	RGBD	84	82
RT-Grasp(Numbers Only)	RGB+text	58.44±6.04	50.31±14.34
RT-Grasp(With Prompts)	RGB+text	69.15±11.00	67.44±9.99
KOSMOS-E	RGB+text	85.19±0.27	72.63±4.91

2. Instruction-following Grasping

Our model was trained using a combination of non-instruction and instruction-following datasets. In contrast, four other baselines were each trained on a distinct dataset: non-instruction, single-object, multi-object, and a combination of single-object and multi-object datasets. We adopted image-wise grasp accuracy as our primary evaluation metric.

Model	Single Object		Multi Object
Model	angle	part	name	color	shape	purpose	position	strategy
KOSMOS-E	77.98	82.35	31.43	29.56	29.49	27.93	30.44	36.16
Non	79.16	76.80	0.42	4.80	1.48	0.42	7.34	2.47
Single	78.27	80.28	0.49	0.35	0.35	0.46	0.35	0.85
Multi	7.49	8.20	25.99	25.32	24.82	23.87	25.14	27.22
Single+Multi	78.02	80.92	30.23	30.12	28.46	27.23	29.69	33.58