Overview

Tuning on instruction-following data has been shown to enhance the capabilities and controllability of language models, but the idea is less explored in the robotic field. In this work, we introduce KOSMOS-E, a Multimodal Large Language Model (MLLM) that leverages instruction-following robotic grasping data to enhance capabilities for precise and intricate robotic grasping maneuvers. To achieve this, we craft a large-scale instruction-following robotic grasping dataset, termed INSTRUCT-GRASP, primarily comprising two aspects: (i) grasp a single object following varying levels of granularity descriptions, e.g., different angles and aspects, and (ii) grasp a specific object within a multi-object environment following specific attributes, e.g., color and shape. Extensive experiments show the effectiveness of KOSMOS-E on robotic grasping tasks across a variety of environments.

Method

KOSMOS-E is a multimodal large language model that has new capabilities of robotic grasping, which can understand multimodal input and follow different instructions to generate a numerical grasp pose prediction (grasp center point [x, y] and rotation angle θ), guiding the robot to accurately grasp in both single object and multi-object scenes.

espresso-and-latte

Dataset: INSTRUCT-GRASP

We create INSTRUCT-GRASP dataset based on Cornell Grasping Dataset. It includes three components: Non, Single and Multi with 8 kinds of intructions. It has 1.8 million grasping samples, with 250k unique language-image non-instruction samples and 1.56 million instruction-following samples. Among these instruction-following samples, 654k pertain to the single-object scene, while the remaining 654k relate to the multi-object scene.

espresso-and-latte

- Purpose: Existing datasets don't have instructions, they only focus on visual info.
- Total Size: Non-Instruction: 250k; Instruction-Following: 1.56M (654k for single-object, 654k for multi-object)
- Variety: Name, Shape, Color, Purpose, Position, Angle, Part, Strategy

Architecture

KOSMOS-E's architecture.

espresso-and-latte

Evaluation Results

1. Non-Instruction Grasping

We follow a cross-validation setup as in previous works and partition the datasets into 5 folds


Method Modality IW OW
GR-ConvNet RGBD 97.70 96.60
GG-CNN2 RGBD 84 82
RT-Grasp(Numbers Only) RGB+text 58.44±6.04 50.31±14.34
RT-Grasp(With Prompts) RGB+text 69.15±11.00 67.44±9.99
KOSMOS-E RGB+text 85.19±0.27 72.63±4.91

2. Instruction-following Grasping

Our model was trained using a combination of non-instruction and instruction-following datasets. In contrast, four other baselines were each trained on a distinct dataset: non-instruction, single-object, multi-object, and a combination of single-object and multi-object datasets. We adopted image-wise grasp accuracy as our primary evaluation metric.


Model Single Object Multi Object
angle part name color shape purpose position strategy
KOSMOS-E 77.98 82.35 31.43 29.56 29.49 27.93 30.44 36.16
Non 79.16 76.80 0.42 4.80 1.48 0.42 7.34 2.47
Single 78.27 80.28 0.49 0.35 0.35 0.46 0.35 0.85
Multi 7.49 8.20 25.99 25.32 24.82 23.87 25.14 27.22
Single+Multi 78.02 80.92 30.23 30.12 28.46 27.23 29.69 33.58

Other Results

Impact of different training data format

espresso-and-latte

Different training strategy

espresso-and-latte

Training Data Size

espresso-and-latte

Different grasp representations

espresso-and-latte

Comparison of different grasp representations

espresso-and-latte

Instruction-following Grasping Examples

- Single-Object Scene
- Multi-Object Scene
- Eight Different Instructions

espresso-and-latte