Why do we need AI in VR and AR, and how did we do it?

“Why?”, or actually “What for?” do we need AI in AR & VR apps?

AR and VR are both indirect and direct interactions between equipment, software and a person. We recently wrote about the differences between AR and VR technologies in our blog (link to post). Because each person is slightly different and the devices are manufactured on a mass scale, there is a problem with the individual adjustment of interaction between the equipment and the user. Contrary to appearances, this is not that easy as classic algorithms have minimal adaptive skills, not to mention the hardware!. Of course, we do not have to adapt the thickness of the sponge mounted in VR glasses to the shape of the face :-). It’s about interacting and working with a complex system on the software level. It would be ideal if the algorithms could adapt to the user themselves; they could learn, taking advantage of observations and demonstrating a lot of tolerance. In the end, each of us can recognise Kozakiewicz’s gesture 🙂 A tip for younger readers: look in Wikipedia for what it means, e.g. here. These are examples of such situations where adaptation is required, and the information is not unambiguous and successfully replaces classic algorithms.

The Enthusiasm Stage

While planning another project, we decided to use artificial intelligence elements. As AI and VR integration and AR in one project is still rather rare, we decided to share our solutions, observations, and results.

In the beginning, the task looked rather straightforward: recognizing a user’s dynamic hand gestures. We were only interested in a single hand, and it was not important whether left or right. The key fact was that recognition should be done with minimal delays. Our system would then automatically verify the intentions and correctness of the user-performed actions in the AR or VR world. From our point of view, it was an element necessary in training systems in which users maneuver in virtual space, interacting with, for example, construction or heavy mining machines. Initially, we focused on typical actions, i.e., catching something (rod, switch, handle), turning it left or right, pressing a button, and so on – simple movements. Still, at the same time the most often performed manual interactions with the hardware. Thus, the topic did not look too “daunting” – in the end, many solutions have similar built-in functions, only often in a heavily punctured range.

The Curiosity Stage

Because we found that the system must operate with various AR, VR and other devices offering interaction with your hand (starting from Microsoft Hololens and ending with Leap Motion), we looked for a uniform solution. The ideal situation would be to find something like the Hardware Abstraction Layer, which causes us not to prepare solutions for specific hardware and SDK. This is where we found the Mixed Reality Toolkit, from Microsoft, where data on the hand’s position (fingers, joints) is delivered in a unified manner, regardless of whichever equipment we have. Microsoft developed a lesson from drivers for MS-DOS and Windows 95, where developers created several software versions to work with various hardware configurations.

OK – it is true that some devices do not transmit a complete data set, e.g., due to hardware restrictions, but the data that were transferred even by the most “limited” devices turned out to be sufficient. However, the real problem has become not a lack of data but rather their excess, as we mention below.

The MRTK (Mixed Reality Toolkit) transmits data in the form of a position and a rotation of all components of a single hand, and there is a total of 25 joints per hand (you have two). The data on the rotation is transferred as a quaternion. It can be roughly assumed that they correspond to the joints, and therefore the places where the fingers bend. These data are transferred in the world space, and therefore the location and rotation relative to the initial position in the virtual world coordinate system is delivered in the MRTK. More about this solution can be read here: https://docs.microsoft.com/en-us/windows/mixed-reality/mrtk-unity/features/input/hand-tracking.

Back to School

Our analysis of gestures is local, and therefore we are not interested in the position of your hand in space but the rotation of the joints. Consequently, we focused on using information about rotations only. However, there was one problem here: how to switch global rotations to local ones while they are present as quaternions? A brief analysis of available scientific and popular literature indicated that this should not be that difficult. Therefore, we prepared a theoretical model. We developed the software with visualisation of the result of the transformation, and … once we combined theory with practice, it turned out to be not so simple. At first, it appeared that nothing worked, and no one knew why, and the hands, after transforming their joints from global to local rotations, looked like something from a cheap horror movie. Ultimately, however, mathematics started to work for us, and we found the correct transformations.

The stream of data flowing from the MRTK and transformed by our software created something that we call a time series, and this was to be analysed through our mechanism of artificial intelligence. Suppose the concept of a time series is extremely abstract for you – in this case imagine the following movie screens (images) on which the hand performs a movement: here it is the same, but instead of pixels, we use numeric data.

The Panic Stage

Diagnosis of the battlefield (read: “scientific articles and the current state of knowledge”) indicated that … no one before us had even tried it yet! Seriously. Nobody. Everyone plays with gestures exploring the video stream and sometimes depth cameras. Furthermore, it works mainly on computing clusters using advanced graphics cards (GPUs) and powerful CPUs. Meanwhile, we had to do it on mobile devices while utilising quite limited resources. What’s more, even after rejecting the location information, our data stream was still far too large for the limited resources of the AR and VR mobile devices (goggles): 25 quaternions. To be clear: a quaternion is a spatial, four-number-long floating-point vector delivered to our algorithm several dozen times per second. Hence, it can bottleneck even an efficient computing system, not to mention VR and AR goggles.

The Research and Shortcuts Stage

Fuzzy Logic and Fuzzy Inference Systems are suitable for this class of time series analysis, as they already have a long presence in science. Unfortunately, due to computational complexity and implementation drawbacks, these systems are rarely found in industrial solutions. However, with the recent decade of development in Deep Learning, RNN (Recurrent Neural Networks) these have become more and more popular, particularly their special form – LSTM (Long-Short Term Memory) and their derivatives, so-called Transformers.

Initially, we planned to use just one complex, multi-layer LSTM network in our solution to detect gestures in one step.

Unfortunately, LSTM networks, however, require pretty significant computing resources at the training stage and at the implementation stage. However, these are smaller resources than comparable models in the Fuzzy Logic-based solution. Nevertheless, as you can read below, we had to introduce advanced data optimisation techniques, reduce their dimensionality, and finally, change the approach to the problem. It was essential because the transfer of the one-step LSTM network to the mobile platform (AR and VR goggles) caused an unacceptable latency; thus the “playability” of our solution placed us in the worst tail of the VR solutions ever created.

The Euphoria Stage

Without taking into account the analysis complexity, spent time, resources, and energy finding an optimal solution, we can proudly write – yes, it works! And so smoothly. Nevertheless, a hybrid approach was necessary: the time series was analysed in terms of static gestures through a convolutional network (CNN). This is a much faster model and introduces minimal delays because it only uses a single data frame. A similar approach was used, for example, to recognise objects in popular AI models for image recognition, e.g., Inception or Yolo. When a model based on the convolutions will recognise a characteristic hand shape which can potentially start the sequence of a gesture in which we are interested, a second model, using a simplified LSTM network, enters the action. It works on a minimal set of data, mainly due to performance reasons. Such a hybrid works well on AR and VR devices, including Oculus Quest and Hololens 2, with limited computing resources. At the moment, these devices mostly use the CPU for neural network computation (prediction), and there is no GPU support for the majority of the AI mobile frameworks available on the ARM platform.

Technical Hints

For both models, both convolutional and LSTM based, machine learning was necessary.

For this purpose, we planned to use ready frameworks for PCs, including Keras, Pytorch or Caffee. Finally, we decided to use Keras, primarily because of its maturity, many confirmed commercial applications and support for mobile devices (including TensorFlow Lite and the possibility of converting a model to other formats). In addition, after its integration with TensorFlow, Keras seems to be the solution which is most seamlessly supported by NVIDIA CUDA, i.e., calculations using a GPU.

Moving the model from the training platform (PC) to the target solution (AR / VR goggles, ARM-based) in theory is quite simple, but in practice, however, it is not so easy. We had essentially only two possible solutions in our case: to export the model to TFLite (the dedicated format for the TensorFlow Framework for mobiles) or to export to an open model in the ONNX format (Open Neural Network Exchange). The first approach is chosen for platforms for which TensorFlow Lite is available, unfortunately not for all, e.g., there is no TFLite available for Microsoft Hololens. On the other hand, the TensorFlow Lite library has the advantage that it is written using low-level C++. Thus, the AI computing core works efficiently even when creating applications in scripting languages or interpreted ones. However, it also means that binary, precompiled libraries dedicated to each platform are necessary. In the case of exports and subsequent imports for a mobile device in ONNX format, in most cases, we can use the universal library because it is written in C #, Java (JavaScript) or Python and available in the form of a source code. Unfortunately, this second solution is slower. In addition, using the entire development chain you must bear in mind that there are a lot of incompatibility issues between the individual versions of the libraries, both on the “training” side (PC) and the “user” side (mobile devices). For example, training with the TensorFlow version 2.4.0 or a newer library does not allow you to export the model to TFLite (Yes, TFLite!). Without a problem we can export with version 2.4.0 to the ONNX format, but … without changing the default settings in this “advanced” ONNX features version, it cannot be directly loaded to the software for VR / AR goggles virtually in any available library now. This is because … and, here we get to the incompatibilities for features versions again. Whilst ONNX is advertised as a universal portable format…

So, as you see, we had to solve a big box of puzzles, and the whole project at this stage more resembled the task of escaping from the “escape room” than classic, solid, development work. Still, this sort of challenge drives our team!

The Podium Stage

Finally, we have a solution that allows us to perform efficient training on the PC platform, using GPU units, then transferring it to the mobile VR and AR devices while still working quickly. On the one hand, we have models that provide a high (exceeding 90%) accuracy of gesture recognition. On the other hand, they work so quickly that the user does not even realise the complexity of the mechanisms working behind such advanced analysis of its gestures, the whole works close to real-time, with delays far below 100ms.