Vision-Language Interpreter for Robot Task Planning

ICRA 2024

Keisuke Shirai¹
Cristian C. Beltran-Hernandez²
Masashi Hamaya²
Atsushi Hashimoto²
Shohei Tanaka²
Kento Kawaharazuka³
Kazutoshi Tanaka²
Yoshitaka Ushiku²
Shinsuke Mori¹

¹Kyoto University
²OMRON SINIC X Corporation
³The University of Tokyo

Paper
Code

Abstract

Large language models (LLMs) are accelerating the development of language-guided robot planners. Meanwhile, symbolic planners offer the advantage of interpretability. This paper proposes a new task that bridges these two trends, namely, multimodal planning problem specification. The aim is to generate a problem description (PD), a machine-readable file used by the planners to find a plan. By generating PDs from language instruction and scene observation, we can drive symbolic planners in a language-guided framework. We propose a Vision-Language Interpreter (ViLaIn), a new framework that generates PDs using state-of-the-art LLM and vision-language models. ViLaIn can refine generated PDs via error message feedback from the symbolic planner. Our aim is to answer the question: How accurately can ViLaIn and the symbolic planner generate valid robot plans? To evaluate ViLaIn, we introduce a novel dataset called the problem description generation (ProDG) dataset. The framework is evaluated with four new evaluation metrics. Experimental results show that ViLaIn can generate syntactically correct problems with more than 99% accuracy and valid plans with more than 58% accuracy.

Vision-Language Interpreter (ViLaIn)

We propose a Vision-Language Interpreter (ViLaIn) that performs task planning based on linguistic instructions in an interpretable way. ViLaIn first generates problem descriptions from linguistic instructions and scene observations. ViLaIn then drives a symbolic planner to find valid plans based on the generated descriptions.

Problem Description Generation (ProDG) Dataset

We propose the Problem Description Generation (ProDG) dataset to evaluate ViLaIn. The ProDG dataset covers three domains: Cooking, Blocksworld, and Hanoi. Each domain consists of 10 tasks, each consisting of linguistic instruction, scene observation, and problem description in PDDL.

Video

Citation

@inproceedings{shirai2024vilain,
  title={Vision-Language Interpreter for Robot Task Planning}, 
  author={Keisuke Shirai and Cristian C. Beltran-Hernandez and Masashi Hamaya and Atsushi Hashimoto and Shohei Tanaka and Kento Kawaharazuka and Kazutoshi Tanaka and Yoshitaka Ushiku and Shinsuke Mori},
  booktitle={2024 IEEE International Conference on Robotics and Automation (ICRA)}, 
  year={2024},
}

Acknowledgements

We would like to thank Hirotaka Kameko for his helpful comments. This work was supported by JSPS KAKENHI Grant Number 20H04210 and 21H04910 and JST Moonshot R&D Grant Number JPMJMS2236.