ARTFEED — Contemporary Art Intelligence

Ablation Study of Multimodal Human-Robot Interaction System

other · 2026-05-06

This research outlines a systematic ablation study of a multimodal human-robot interaction framework, emphasizing three primary components: the large language model for extracting actions, the visual grounding perception system, and the motion execution controller. The investigation evaluates three different language models, five configurations for perception, and three types of controllers, subsequently conducting a factorial analysis on the top-performing options. The goal of this examination is to determine how various selections influence both execution time and success rates, providing insights for future enhancements to the system.

Key facts

  • The study extends a previous multimodal human-robot interaction system.
  • Three modules are ablated: large language model, perception system, and controller.
  • Three language models are compared.
  • Five perception configurations are compared.
  • Three controllers are compared.
  • A second-stage factorial study is conducted over the best candidates.
  • The analysis clarifies which choices affect execution time and success rate.
  • The goal is to identify where engineering gains are likely in future revisions.

Entities

Institutions

  • arXiv

Sources