Markerless tracking

This page is a stub, please expand it if you have more information.

See also Markerless inside-out tracking, Markerless outside-in tracking, Positional tracking

Introduction

Markerless tracking is a method of positional tracking - the determination of position and orientation of an object within its environment. This is a very important feature in virtual reality (VR) and augmented reality (AR), making it possible to know the field-of-view and perspective of the user - allowing for the virtual environment to react accordingly or the placement of augmented reality content in accordance with real objects. For a complete motion tracking experience, the tracking system needs to measure movement in six degrees-of-freedom. ^[1] ^[2]

While marker-based methods of motion tracking use specific optical markers, markerless positional tracking does not require them, making it a more flexible method. It also avoids the need for a prepared environment in which fiducial markers are placed beforehand, for example. Contrary to marked-based tracking, a markerless approach allows the user to walk freely in a room or a new environment and still receive positional feedback, expanding the applicability range. ^[1] ^[2] ^[3] ^[4] ^[5]

Markerless tracking only uses what the sensors can observe in the environment to calculate the position and orientation of the camera. ^[1] The method depends on natural features instead of specific markers, and it can use a model-based approach or do image processing in order to detect features which provide data to determine position and orientation. ^[5] ^[6]

For AR, a model-based approach might be used in order to determine the placement of virtual objects with real ones. The model of the real object can be encoded as a computer-aided design (CAD model). Markerless augmented reality tracking would therefore continuously search and compare the image it receives with the known 3D model. ^[6]

While markerless tracking is a technology that is expected to improve VR and AR application, especially mobile VR and AR, current technological limitations still require a trade-off between precision and efficiency. According to Ziegler (2010), “On the one hand, the more information the application gathers and uses, the more precise is the tracking. On the other hand, the fewer information the calculations have to consider, the more efﬁcient is the tracking. Efﬁciency is a huge issue for tracking on mobile devices. The available resources are very limited and the tracking cannot even use all of them, as the rest of the application needs processing power too.” ^[1]

Factors affecting markerless tracking

Besides the trade-off between precision and efficiency, the tracking of natural features also presents several challenges like dealing with large scenarios, objects small parts, variable illumination conditions, and materials with low texturedness, reflective and transparent properties. ^[5] Indeed, Yudiantika et al. (2015) observed several of these factors that affected the success of object tracking in their AR application:

Shape and texture of the object: Tracking is easier when an object presents a unique shape and texture.
Background of the object: The background color of the object determines the contrast between the object and its environment. Tracking is facilitated when there is a greater contrast between the two.
Room lighting: The intensity of the light illuminating will affect the markerless tracking since the camera needs to properly capture the specific features of the objects and environment.
Light reflection: since lighting is an important feature that affects tracking, light reflections such as those from glass barriers in a museum can interfere with the tracking.
Type and position of the lights: according to the study, object lighting should use focused light instead of incandescent light (bulb) for a better tracking. ^[7]

Model-based approach for markerless tracking

One of the first model-based systems was presented by Comport and colleagues, in 2003, and sparked the interest of other researchers. With a model-based tracking, models of the objects or environments to be tracked are used as references for the tracking system. The models from this kind of systems are rendered from different point of views, and there are two basic approaches that use the model images for tracking. ^[1]

Ziegler (2010) explains that one of the approaches “Extracts features from the model images and video-frames. It then com- pares the features found in a model image with the ones found in a frame. The comparison yields pairs of features which most likely show the same point in the world. These pairs are referred to as correspondences. The tracking system uses the correspondences to estimate the camera’s position and orientation (pose).” A measure of similarity, such as the amount of correspondences, is used to evaluate if the results need refinement by rendering the scene from other point of views. The system will continue refining the results until these meet the threshold defined by the similarity measure. ^[1]

The other approach “measures the similarity between a model image and a video-frame. Using the result, the system approximates a camera’s pose. It renders an image with this pose, and repeats this process until the image and the frame are close to equal.” ^[1] Generally, the models consist of edge-features or line-features, with edges being favored since they are easy to find and robust to lighting changes. An advantage of using models is that the system usually becomes more robust and efficient. On the other side, model-based tracking requires models. This means that depending on the size of the environment to be tracked, the modelling process can be very time-consuming. ^[1]

Image processing for markerless tracking

Image processing applied to markerless tracking uses natural features in the images received to calculate the camera’s pose. One of the first applications that used natural features for tracking purposes was presented by Park et al., in 1998. ^[1]

Ziegler (2010) describes this approach as first extracting features from the frames of a video stream and then finding correspondences between succeeding frames. The camera’s pose calculation is made based on these correspondences. Features that where not detected in previous frames are stored and the system calculates their 3D coordinates in order to use them for future correspondence searches. If the system cannot establish a connection to previous frames, tracking fails. ^[1]

In case tracking fails, it will become impossible to determine the change of the camera’s pose between frames. One way to surpass this is to extract a high quantity of features, increasing the probability of having enough useful features. However, this increases the computational complexity. Other methods to counteract a tracking fail include trying to choose the most useful features instead of extracting a large quantity of them, assessing which features have the highest amount of information content. This is not an easy task, even more so when the environment is unknown. ^[1]

As a general overview of image processing for markerless tracking, the system first detects natural features in the current frame; secondly, it compares the features and finds corresponding ones; it then approximates the camera’s pose using the correspondences whose 3D positions are already known; finally, it calculates the 3D positions of the correspondences whose 3D position are unknown. ^[1]

Feature detection

A point of interest in an image, distinctive in terms of intensity, is a feature. Markerless tracking systems automatically detect features for tracking purposes. In ideal conditions, features should be re-observable from different point of views under various lighting conditions. This is called repeatibility and is a very important property. Features that are unique and easy to distinguish from their environment and each other will make the tracking easier to achieve. Furthermore, invariances in features are advantageous for the tracking system. This describe the independence of a feature from certain transformations such as rotation or translation. ^[1]

Inaccuracies during feature detection endanger the success of the tracking process. Therefore, in general, feature detection is more accurate than efficient.

It is necessary to identify and describe features in order to compare them with each other. A description of a feature has to include its neighborhood - the context in which a feature is observable. The values that describe a pixel, its color value and image coordinates are not enough. ^[1]

Correspondences

When two features in different images show the same point in the world, it is called a correspondence. Ziegler (2010) writes that “Markerless tracking applications try to recognize parts of the environment that they have observed before. The correspondence search is responsible for re-observing features. Markerless tracking methods search for correspondences between features that are present in the current frame and features that have been observed previously. A correspondence search can only succeed if at least a part of the current view has been captured before.” ^[1]

It should be noted that the useful correspondences for tracking are those in which their 3D position is already known. These are the input for the algorithm that calculates position and orientation. When at least four correspondences exist, the estimation of a camera’s pose is possible. ^[1]