The spatial world model technology will be presented at CES 2026, which will be held in Las Vegas from January 6-9, 2026. Fujitsu also plans to conduct technical demonstrations at its headquarters throughout fiscal year 2026.

Features of Spatial World Model technology

1.- Construction of a spatial model of the environment using 3D scene graphs focused on the interactions between people, robots and objects

In physical environments, the spatial situation changes dynamically as the actors present in the space (i.e., people, robots, etc.) move and interact. Although technologies using camera-captured data to understand these spatial dynamics have been explored, significant differences in the field of view of each camera and variations in appearance—such as distortions—between fixed and moving cameras have hindered their real-time application.

For this reason, instead of relying on pixel-level integration, which is highly sensitive to differences in appearance, Fujitsu has developed a technology that uses cameras to evaluate space through 3D scene graphs; that is, hierarchical data structures that organize all objects in physical space as nodes within a graph. This approach minimizes the impact of field of view and distortion, enabling real-time understanding of complex and constantly changing real-world environments.

2. Prediction of future states and behaviors through modeling of interactions between people, robots and objects

For humans and robots to work together seamlessly, robots must be able to understand the intentions behind human actions and predict future behavior. World model technologies that allow robots to anticipate changes and act in their immediate environment are being extensively researched, but so far they have been limited to modeling only the near environment, failing to capture the dynamic changes that occur across an entire space.

The new method developed by Fujitsu accurately estimates behavioral intentions by interpreting causal relationships arising from various interactions between actors and objects within a space. By using this data to predict future actions, the technology helps avoid collisions and generate optimal cooperative action plans for multiple autonomous robots.

In tests conducted with publicly referenced academic datasets, it was confirmed that this technology can improve the accuracy of behavioral intention estimation by up to three times. [1].

Context

AI technology, which until now has been developed primarily in digital environments, is beginning to be applied to real-world scenarios. Physical AI is a branch of artificial intelligence in which AI is trained to understand the laws of physics and act autonomously, and it will play a key role in solving various real-world challenges, such as autonomous driving and smart factories. This approach is generating considerable interest as a potential way to help alleviate Japan's growing labor shortage and improve industrial productivity.

However, current applications of physical AI are largely limited to structured environments with defined routes, such as manufacturing plants or logistics warehouses. In homes and offices, where human movements are less predictable and the arrangement of objects changes frequently, AI struggles to assess spatial dynamics, rendering current solutions impractical. Furthermore, in environments requiring collaboration between large numbers of people and robots, cooperation remains complex, as AI is unable to understand the intentions behind each other's movements.

This new technology is based on Fujitsu's Computer Vision technology, primarily used for analyzing pedestrian traffic in commercial facilities and detecting anomalous behavior for crime prevention, as well as its digital AI technology, including the Fujitsu Kozuchi AI Agent, which autonomously performs tasks alongside humans. It is part of the research efforts of the Spatial Robotics Research Center, which Fujitsu established in April 2025 to bolster its research aimed at creating a new society where humans and robots coexist.

Note

[1] JRDB-Social: Benchmark for estimating human behavior and intentions from images captured by cameras.