Microsoft, in collaboration with a team of academic researchers, has developed a new standard called GroundedPlanBench to address one of the persistent problems in robotics. This problem is the difficulty that robots have in deciding “what to do and where to do at the same time.”

Most existing systems are based on dividing these decisions into two stages: first, a visual linguistic model creates a text plan, and then another model transforms this plan into concrete actions. However, this separation often causes errors, even in simple tasks. For example, when a robot is asked to dispose of paper cups, it may choose the wrong cup or perform steps that were not asked of it. These errors are especially increased in crowded environments.

To address this problem, GroundedPlanBench was designed to test the ability of AI models to “plan tasks with the precise location of each action.” Instead of relying solely on text, each action is linked to a specific location in the image. Basic actions, such as grasping, placing, opening, and closing, are associated with objects or locations, forcing the system to make connections between decisions and the physical world.

The benchmark contains more than 1,000 tasks taken from real-life robot interactions, ranging from direct instructions, such as placing a spoon on a plate, to open-ended instructions, such as setting the table. This diversity is essential because robots often fail when instructions are ambiguous.

In one example, the system was asked to place four towels on a sofa, but it selected the same towel multiple times because the description was not clear. Even detailed phrases like “top left towel” were not precise enough for reliable execution. “Vague language leads to unenforceable procedures,” the researchers noted, highlighting a limitation of current systems.

To improve performance, the team developed a new training method called “Video-to-Spatially Grounded Planning (V2GP),” which learns from videos of robots performing tasks. This method detects interactions with objects, identifies them, and tracks their locations. The result is an organized plan that links each action to a specific location.

Using this method, more than 40,000 interconnected plans have been created, ranging from simple one-step actions to longer sequences of up to 26 steps. When the models were trained on this data, their ability to choose the right actions and associate them with the right things improved, and they reduced repetitive errors such as working on the same item multiple times.

However, challenges remain, especially with long and complex tasks or indirect instructions. “Models must consider a long sequence of actions and maintain consistency across several steps,” the researchers say.

Comparison with traditional systems that separate layout and spatial localization has shown that they have difficulty dealing with ambiguity, often matching multiple actions to the same object or location. But combining the two steps into a single process reduces this imbalance and keeps decisions about actions and locations linked coherently.

The team notes that future work may combine this approach with predictive models that predict the outcomes of actions before they are performed, which could help robots avoid errors in real time.

The current results illustrate a clear trend in robotics: “Systems that understand both actions and locations are better able to operate effectively in real-world environments.”