IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models

Method Overview

Overview of IMPACT. There is a toy bear, a coffee cup and a tomato can on the table. The objective is to reach the tomato can. We use SAM2 to segment the image and label the objects using "1", "2", and "3" to assist GPT's visual reasoning. GPT also receives a language template prompt \( \ell \) with object information from SAM2. GPT produces costs for the three objects, which are projected into a voxel grid \( C \) indicating the cost for the robot end-effector to enter each voxel. The costs are high for the coffee cup (GPT-assigned cost: 8) and the tabletop (a fixed cost of 10). We use a cost of -1 for the target object. Finally, an off-the-shelf motion planner (RRT*) uses this to guide the robot, which avoids the coffee cup but makes contact with the toy bear to successfully reach the tomato can.

Cost Map Visualization

Interactive 3D Cost Map

Real World Results

The target object is the tomato can.

Baseline: LAPP

Ours: IMPACT+RRT*

Simulation Results

The target object is the foam brick.

Baseline:

Ours: IMPACT+RRT*

Full Prompt

You are an advanced AI system designed to assist in robotic navigation within cluttered environments.
Your task is to evaluate the safety of various objects in a scene, considering not only the individual characteristics of each object but also their interactions and the physical principles that govern their stability and potential movement.
For each listed object, you will assign a safety score ranging from 0 to 10, where 0 indicates minimal risk of damage upon collision and 10 signifies a high risk of damage.
Your assessment should consider the object's material, size, context, and its relationship with surrounding objects, including any potential domino effects or other interaction-induced risks.

Please provide a safety score for each object, taking into account the following:
- Object Material and Structural Integrity: How fragile or sturdy the object is (e.g., glass vs. plastic vs. metal).
- Stability and Likelihood of Movement: Whether the object is firmly placed or precariously balanced. For instance, a stable stack of books alone might be low risk, but if there is a fragile object on top, the risk increases significantly.
- Potential Domino Effects or Chain Reactions: If collision with one object could cause it to roll, topple, or otherwise move into other objects, increasing the overall risk. For example, a ball might be low risk in isolation, but if it can roll and knock over a wine glass, the effective risk is higher.
- Proximity and Arrangement: How close the object is to other fragile or easily toppled items. Even if an object (like a sugar box) is normally sturdy, being positioned next to a fragile wine glass can raise its overall risk score if it could collide or push the glass.
- Any Other Relevant Physical Interactions: Any additional factors that might increase the risk of damage, such as height above the ground, shape of the surface, or presence of liquids.

Each item is an object labelled in white with its respective ID number. Adhere to the specified format for your response, listing each object followed by its corresponding safety score. Do not include any additional text or output.

Format Requirements:
- The JSON object must be a single string.
- Each key must be the object's ID number in parentheses (e.g., "1"), and each value must be the safety score (an integer between 0 and 10).
- Do not include any text other than the JSON object in that final line. Do not add something like "```json" or "```".
- Do not include newlines, extra punctuation, or object names in the JSON.
- Every key/value should be strictly "ID": score.
- No explanations or reasoning should appear in the final JSON—only the scores.

Input objects:
{object_list}
Your analysis should be comprehensive, considering the dynamic interactions between objects and the physical principles that may affect the outcome of a collision.

IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models

Abstract

Method Overview

Cost Map Visualization

Real World Results

Baseline: LAPP

Ours: IMPACT+RRT*

Simulation Results

Baseline:

Ours: IMPACT+RRT*

Full Prompt

BibTeX