3D Vision and Mobile Robotics Research Group

Tangible Augmented Reality

Recently 3D object detection and scene understanding have become increasingly popular research topics, due to having numerous applications ranging from robotics to augmented reality. Scene understanding offers great potential for Tangible Augmented Reality applications (TAR). In TAR systems the virtual objects used by the system are attached to real ones, and the real-world objects serve as input devices for user manipulation. This idea allows for intuitive man-machine interaction, which makes these systems easy to use. While most TAR systems use real objects with artificial markers, there are a few that are able to use any object with natural features. However, even these systems rely on the user to determine the pairing of the objects, making the setup of the scene time consuming.

Our main research goal is to create an automatic object pairing for TAR systems based on shape information. The idea is to pair real and virtual objects that are similar in shape, since this way the visual and tangible experience will be as similar as possible. The pairing method uses only natural visual features, no artificial markers are required. Moreover, our algorithm is also able to take scene-level requirements into account. These requirements are usually unique to the TAR environment (such as the required number of certain virtual objects, or some objects being complimentary/exclusive).

The current method solves this problem by first creating a 3D point cloud from monocular or RGBD images (SfM). Then, using a RANSAC-based algorithm the point cloud is segmented into primitive shapes. Using these primitives, we construct a graph with the nodes being the primitives, and the edges representing the geometrical relations of the nodes. Then, a graph node embedding process is used to create feature vectors for the nodes for a subsequent classification method. The classification scores are then used for the cost function of a genetic algorithm, which determines the final setup of the scene.

The main disadvantage of the previously presented approach is, that the learning classifier has to be trained for every virtual object category. This makes development unnecessarily tedious. Our current goal is to use shape similarity scores instead of classification scores. While there are numerous shape description methods, deep convolutional neural networks (CNNs) usually give the best performance. Our goal is to create an algorithm that uses deep convolutional shape features to determine the final setup of the virtual scene. The planned algorithm begins by computing multi-scale convolutional features for both the scene and the virtual object. Then, these features are compared at every position, creating a “goodness” map. From this, the algorithm proceeds by creating object proposals for every virtual object. Then, by applying the modified version of the aforementioned genetic algorithm the final setup can be determined.