Why do we need AI in VR and AR, and how did we do it?

“Why”, or more precisely, “for what purpose?”

AR and VR are, respectively, indirect and direct interactions of hardware and software with humans. We have already written about the differences between AR and VR technologies on our blog (here). Because each of us is slightly different and devices are mass-produced, the issue of individually customizing the interaction between hardware and the user arises. Contrary to appearances, this is not so simple, since classic algorithms have very limited adaptive capabilities, not to mention the hardware. Of course, we are not referring to adjusting the thickness of the foam in VR goggles to the shape of the face :-). We mean interaction with and operation of a complex system at the software level. Ideally, algorithms would adapt themselves to the user or, like humans, be able to learn by observing and exhibiting a high degree of tolerance. After all, any of us can easily recognize the Kozakiewicz gesture 🙂 A tip for younger readers: look up what this means on Wikipedia, e.g. here. In such situations, where adaptation is required and information is not unambiguous, AI successfully replaces classic algorithms.

When planning our next project, we decided to incorporate elements of artificial intelligence. Since integrating AI with both VR and AR in a single project remains rare, we decided to share our solutions, comments, and observations.

Stage of Enthusiasm

The task we set for our team sounded quite prosaic: dynamic recognition of gestures performed by the user using their hand (we focus on a single hand, and it doesn’t matter whether it is the left or the right), with minimal delays. This way, our system could automatically verify the user’s intentions and the correctness of actions performed in the virtual world. From our point of view, this was an essential component in training systems where users practice interacting with machines (construction, mining, or any others) in a virtual environment. Initially, we focused on typical operations, namely: grabbing something (a rod, switch, handle), rotating left or right, pressing (a button), and similar simple yet most commonly performed manual interactions with equipment. The topic did not look too “threatening” – after all, many solutions have such features built in, albeit often in a heavily limited scope.

Stage of Curiosity

Since we additionally assumed that the system must work with various AR, VR, and other hand-interaction devices (ranging from Microsoft HoloLens to Leap Motion), the ideal would be to have something like a HAL (Hardware Abstraction Layer) so that we do not have to prepare solutions for specific hardware. Microsoft’s MRTK (Mixed Reality Toolkit) came to our aid, where data on hand position (fingers, joints) is delivered in a unified way, regardless of which hardware we have. Microsoft learned its lesson from the MS DOS and Windows 95 driver era, where developers were cursed with the need to create multiple software versions so it would work with various hardware configurations.

OK – it is true that some devices do not transmit the full set of data, for example due to hardware limitations; nevertheless, the data transmitted even by the most “limited” devices turned out to be sufficient. The real problem, however, turned out to be not so much the lack of data as their excess, as you will see shortly.

MRTK transmits data as the position and rotation of all constituent parts of a single hand, 25 in total. Rotation data is transmitted using quaternions. Broadly speaking, they correspond to joints or the so-called “knuckles,” the places where the fingers bend. This data is transmitted in absolute form, meaning that position and rotation are defined relative to the initial position in the virtual world’s coordinate system. You can read more about this solution here: https://docs.microsoft.com/en-us/windows/mixed-reality/mrtk-unity/features/input/hand-tracking

Stage of Return to School

Our gesture analysis is local in nature, so we are not interested in the hand’s spatial position. Consequently, we focused on using rotation information. However, one problem arose here: How to convert global rotations into local ones when they are recorded as quaternions? A brief review of available information in scientific and popular literature indicated it should not be difficult. So we prepared formulas (theory), developed software along with visualization (practice), and … combined theory with practice, which turned out not to be so simple: at first, it seemed that nothing worked and no one knew why, and the hands after transformation looked like something out of a bad horror movie. Ultimately, however, we managed to tame the mathematics.

The data stream coming from MRTK and transformed by our software creates what we call a time series, and this is exactly what is analyzed by our artificial intelligence mechanism. If the concept of a time series is abstract to you, imagine successive frames of a film showing a moving hand: it is exactly the same here, except instead of pixels we have numerical data.

Stage of Panic

Reconnaissance of the battlefield (i.e., scientific articles and the current state of knowledge) revealed that… no one had done this before us! Seriously. Absolutely no one. Everyone experimenting with gesture recognition uses video streams and, optionally, depth cameras. Moreover, they do it on compute clusters using advanced graphics cards (GPUs) and powerful processors. Meanwhile, we have to do it on mobile devices with fairly limited data. What is more, even after discarding position information, our data stream was still “huge” for the limited resources of AR and VR mobile devices: 25 quaternions (a quaternion, as the name suggests, is 4 floating-point values) delivered to the system dozens or even tens of times per second. This can choke even an efficient compute system, let alone a mobile phone–class device.

Stage of Brainstorming

Fuzzy Logic and Fuzzy Inference Systems are suitable for time series analysis and have been present in science for quite some time; however, due to their computational complexity and implementation difficulties, they are rarely encountered in industrial solutions. Meanwhile, with the development of Deep Learning, Recurrent Neural Networks (RNNs) have become increasingly popular, especially their special form, Long Short-Term Memory (LSTM) networks, and their derivatives, so-called Transformers (at the time of writing this text, the topic was so new that it had not yet received an equivalent term in Polish). Initially, we planned to apply a complex, multilayered LSTM network to solve the entire problem in one step.

Unfortunately, LSTM networks require substantial computational resources, not only during training but also during inference, although these resources are definitely less than those needed by comparable fuzzy logic models. Nevertheless, advanced data optimization, dimensionality reduction, and ultimately a complete change of approach were necessary, as you will read below, because porting a “straight” trained network to a mobile platform resulted in unacceptable lag, and the “playability” of our solution placed us at the tail end of the worst VR experiences ever created by a developer.

Stage of Euphoria

Without going into lengthy considerations of how much time, resources, and effort we devoted to finding the optimal solution, we can proudly say: Yes, it works! And it works smoothly. Nevertheless, it was necessary to adopt a hybrid approach: the time series is analyzed for static gestures using a convolutional network, a model that is significantly faster and introduces minimal latency because it uses only one “frame” of data. A similar approach is used, for example, in object recognition in popular AI image recognition models such as Inception or Yolo, which also utilize convolutional layers. When the convolutional network–based model recognizes a characteristic hand configuration that may potentially start the sequence we are interested in, a second model using LSTM comes into play. It operates on a very limited dataset for performance reasons. Such a hybrid works well on AR and VR devices (e.g., Oculus Quest and HoloLens 2), which have limited compute resources and rely primarily or solely on the CPU when using networks. Current AI frameworks do not provide computation for GPUs integrated into ARM platforms.

Technical Tidbits

For both models, the convolutional and the LSTM, machine learning was required. For this purpose, we planned to use existing PC frameworks such as Keras, PyTorch, or Caffe. Ultimately, we chose Keras due to its maturity, a substantial number of proven commercial applications, and support for mobile devices (e.g., TensorFlow Lite and the ability to convert models to other formats). Moreover, Keras integrated with TensorFlow appears to be the most stably supported solution by NVIDIA CUDA, i.e., GPU-accelerated computation.

Transferring a trained model from the training platform (PC) to the target solution (AR/VR device) is theoretically quite simple, but in practice it can be troublesome. In our case, we basically had only two possible solutions: exporting the trained model to TFLite (the dedicated format for TensorFlow Lite) or to the open ONNX (Open Neural Network Exchange) format. The first approach works for platforms where TensorFlow Lite is available, unfortunately not for all (e.g., it is not available for HoloLens). On the other hand, the TensorFlow Lite library itself has the advantage of being written at a low level in C++ and, even when creating applications in scripting or interpreted languages, the computational core runs directly on the processor. This also means that dedicated binary libraries are required for each platform. In the case of exporting and later importing to a mobile device in ONNX format, we can, in most cases, use a library that is universal because it is written in C#, Java (JavaScript), or Python and available as source code. Unfortunately, this second solution is decidedly slower, as is typical for interpreted languages. Additionally, when using the entire development chain, one must be aware that there are many incompatibilities between library versions, both on the “training” side (PC) and on the “consuming” side (mobile devices), as well as between them. For example, training carried out with the TensorFlow library in version 2.4.0 or newer does not allow exporting our model in the same code to TFLite format (yes, TFLite!), because the Google developers responsible for the exporting mechanism apparently overslept and did not manage to “finish” the exporter and adapt it to the latest library versions. We can export to ONNX format in version 2.4.0 without issues, but… without changing the default settings in such an “advanced” ONNX version, it is impossible to load the model into software for VR/AR goggles in practically any library because… we again encounter version incompatibility, this time at the level of “too new features” in the portable format, which, by the way, is advertised as open and universal. So, as you can see, we had quite a puzzle to solve, and at this stage the whole project more resembled an Escape Room challenge than classic, solid development. But we won’t hide that such challenges drive our team to work.

Podium Stage

Finally, we have a pathway and a solution that allows for training on a PC platform, and we have models that, on one hand, provide high (exceeding 90%) gesture recognition accuracy, and on the other hand, operate so quickly that the user is not even aware of the complexity of the mechanisms behind the advanced analysis of their gestures – the whole thing runs practically in real time with latencies below 100ms (and most often much faster).