Current Implementation of Hands Overlay
XMReality AB1 develops and sells a remote guidance solution that allows users to under-stand and solve problems over vast distances quickly. The solution that XMreality offers, henceforth called XM-application, works on many platforms. The Windows version of this application was used as a starting point, and then modified and extending in this thesis. It can be used both in iOS, Android, Windows-desktop, and on the web. The flowchart for the app can be seen in Figure 2.2. The XM-application has three layers. To ease the development for cross-platform, they have a library written in C++ that executes most heavy calculations such as generating, merging and manipulating frames. The library also handles transfer of frames from one device to another. The library is the same regardless of platform. Connec-tions between users are established with the help of a server, using WebRTC2. All video and sound streams are sent directly between users using web sockets. With this comes a platform-specific front end client which is what the users see and interact with. For Windows, which is the focus of this thesis, this is written in C#. To link the front end client with the library, XM-Reality have implemented a platform-specific bridge layer which handles all communication between the two.
Hand tracking has been researched quite extensively. What technology used is often deter-mined by cost and how flexible hardware setup one needs. Some technologies require multi-ple cameras and used to be really expensive. The method used in this thesis is model-based hand tracking and is described below.
Model-based Hand Tracking
Rehg and Kanade  describe a model-based hand tracking system. In their work, they track the position, rotation, and orientation of many points of interest or states, such as the palm, fingers, or finger joints. They have divided the palm into seven such points of interest, each finger into four and the thumb into five. The points are also connected. The tip of the index finger is connected to the joint closest to it, for example. So a hand pose is represented by 28 states. They then show how this can be used to build a model of the hand. These states are estimated by extracting line and point features from unmarked images of hands. They do this from one or two viewpoints. When trying to find the points of the hand, they use information about the current state generated by previous frames to get a good starting point. Their approach then searches the image along a line close to the starting point, where the previous angle of the finger determines the angle of the line. So to avoid pixel processing, they try to reduce the size of the image search. Searching a smaller part of the image allows one to use a higher sampling rate, which in turn gives more accurate starting points. When having multiple views, partial observations can be used. Points of interest found in one view but missing in another because of it being blocked can still be incorporated, giving a more accurate representation.
Leap Motion Controller
A leap motion controller is a commercial product developed by Ultraleap 3 and is used for hand tracking. It consists of two image sensors and three infrared LEDs.
Figure 2.2: A leap motion controller
Similar to Rehg and Kanade  they use a model-based tracking system. Each finger, including the thumb, has four tracking points. It tracks the base of the finger, the joints, and the tip. The palm has seven points and the wrist five. These points, excluding some for the wrist, can be seen in Figure 2.3.
The controller is limited to a reach of about 80 cm due to decreasing intensity from the IR LED when increasing distance. Data is streamed from the controller to a computer via USB. The data that the controller streams are two grayscale images of the near-infrared light spectrum (one for each camera). This is what is used by the Leap Motion software to get the tracking points and model the hand.
Weichert et al. shows that the Leap Motion Controller has excellent accuracy with the difference between the desired 3D position and the measured position being below 0.2 mm with a static setup and under 2.5 mm when moving the reference point. As a baseline, the accuracy of a human hand is said to be around 0.4 mm.
This section explains the technologies used in the thesis.
C++ is a high-level, general-purpose programming language. It was first created as an ex-tension to C or C with classes. It has expanded steadily over time, and today it supports object-oriented, generic, and functional programming. It also allows for low-level memory manipulation. C++ is a compiled language and is known for being fast. C++ does not have any garbage collector, so one has to manage the memory used. The speed makes it suitable for heavy calculations in image processing.
C# is a strongly typed object-oriented programming language. Microsoft developed it and released it in 2001. It has a garbage collector that takes care of memory clean up and also sup-ports exception handling. Microsoft’s strong support for this language and the control that comes with that often makes this the language of choice when doing front-end in a Windows environment.
C++/CLI (Common Language Infrastructure) is a language specification created by Mi-crosoft. It is used for achieving interoperability between C++ and C# applications. With it, one can have both unmanaged classes in C++ style and managed classes in C# style in the same program. It is essential if one wants to link a C++ library with a front-end built with C#.
TCP and UDP
Both TCP (Transmission Control Protocol) and UDP (User Datagram Protocol) are protocols used for sending data over an IP network.
A TCP-connection is made with a three-way handshake, first initiating and then acknowl-edging the connection. After a request has been made each packet is numbered to ensure them being in order and the receiving end must confirm that each packet has been delivered. If the confirmation is missing (if the packet was lost or an error occurred) the packet will be resent. A congestion control is in place to not overflow the receiver if it can handle the num-ber of packets being sent. TCP is all about being reliable and because of that is a bit slower compared to UDP.
UDP does not have a connection. The packets are sent without any checks. They are unnumbered so they can get lost, be out of order, or sent multiple times. If an error occurs that packet is discarded. UDP is all about speed and is often used for streaming.
Unity 5 is a cross-platform game engine developed by Unity Technologies. It can be used to create 2D, 3D, Augmented reality, or Virtual reality games. It can also be used to create simulations, films/animations, or in engineering. The scripting API for Unity uses C#. Unity can ease the use of textures, rendering, and modeling a lot. One can also import and use many finished assets from various sources. One such example is the Software development kit that exists for leap motion in Unity 6. With it, one can get finished models, shaders, and example scripts that can be used as a stepping stone.
Image Format BGRA32
Image formats describe how pixel data is stored in memory. BGRA stands for blue, green, red, and alpha, where alpha is the transparency. Each of those values is stored in a byte, which together takes 32 bits to represent a pixel. In Unity, the programmer can specify what format to be used by textures. In this project, BGRA32 is used.
The following chapter includes details on how the prototype was developed. Firstly describ-ing how the images of the hands are generated. Secondly, how they are transferred and integrated into the XM-application. Finally, the evaluation of the prototype is described.
The prototype used the XM-application, explained in Section 2.2, as a starting point. The structure of the prototype can be seen in Figure 3.1. The client takes all the input from the user and is what the user sees. The library does all the heavy calculations and transfer data between users, while the bridge handles communication between the library and the client. The bridge was modified and integrated into the existing application. The client and library was not modified. Hand Modeler is a Unity-application made from scratch. It takes input from the leap motion controller, converts it to a 3D-model, converts that model to a byte array, and then finally send that byte array to the XM-application using a TCP-connection. Different approaches were investigated to find out in which part of the XM-application the frames that are sent from Hand Modeler should be received. If it was put into the client layer, one would not be able to use functions from the library and if it instead was put directly into the library it would be hard to know when hands overlay should be activated or deactivated. In the bridge layer, one can communicate both with the client layer and the library layer, which was essential when merging frames from Hand Modeler and the streamed video from user cameras. It was therefore put in the bridge.
Building the Hand Model
When creating a 3D model a leap motion controller and its software was used to do the hand tracking. The output was then used in Hand Modeler. Figure 3.2 describes the information flow. The leap motion controller feeds the leap motion software with data that is converted to points of interest representing a model of a hand, as described in Section 2.4. The leap motion SDK1 (Software Development Kit) comes with an API (Application Programming Interface) which can be used when making applications in Unity. It was used to convert the models made by the leap motion software to actual 3D models. The solution used the example project included in the Leap Motion SDK as a starting point and used the two scripts that came with it. The first script is the leap service provider, which handles the communication between the leap motion software and Hand Modeler. It checks if a controller is connected and extracts the data, if any is sent. The second script is for handling the hand models.
Hand Modeler renders the 3D representation of the hands in a scene. However, the models are not meant to be seen in the Hand Modeler application but instead meant to be transmitted to the XM-application and displayed there. To do this, we create a Texture2D2 which repre-sents an image. The width, breadth, and image format was set of this texture to 640, 480, and BGRA32, to match what is used in the XM-application. Pixels are then read from the scene and saved into that texture. The raw pixel data can then be extracted and stored in a byte array. This byte array has a size of 640 * 480 * 4, and are now ready to be transferred to the XM-application.
Transfer Image Data to XM-Client
The image data is sent using a TCP-connection. When working with an image, it is essential to know the format. Although somewhat slower than a UDP-connection, TCP enables one not to worry about packet loss and the programmer can assume that each frame has the same size. It is a trade-off between speed and reliability. In the worst case, package losses on the client side could cause segmentation errors that would crash the application.
The application is, on a separate thread, listening for new connections. Once a connection is established, the frames are streamed as long as the connection is active. The frame rate that these are streamed at can easily be modified and is limited by the available computing power. A problem that occurred was that the receiving side got the contents in the byte array representing the image in reversed order. This resulted in the image being mirrored. To maintain the pixel format a workaround for this was to first reverse the image vertically and then horizontally.
Integrating Hand Frames with XM-Application
In this project, the goal was to get a proof of concept and a demonstration. One could display the hand models in the client layer but would then not fully integrate into the XM-application. It would allow one to see how good the leap motion controller works but not compare it to the old solution. The old solution was used as a starting point to integrate into the XM-application as smoothly as possible. The way it worked is that it took frames from a camera, and applied an image segmentation algorithm. The output of this algorithm was a frame with the hand, and the rest of the image was masked with a solid background (black or white depending on which settings are used). This new frame is then sent to the library, where it is merged with the current video. The XM-library also sends it to the other user if in a call. The approach used here was to replace the image that previously came from a camera and fed to the image segmentation algorithm with the frames coming from Hand Modeler. The frames from Hand Modeler have a solid background and do not need to be manipulated before being sent to the library. If this is done correctly, the old pipeline can still work the way it did and continue to work on all platforms.
Evaluating the Prototype Based on FPS, CPU, and Memory Usage
The XM-application can run on both mobile devices and on desktop. Since it runs on mobile devices, power usage is of interest. It was therefore decided that the visual gains from having a frame rate of more than 30 was not worth the cost in power usage. As a baseline, the old solution has a frame rate of 25. This project tried to get the highest possible frame rate up to a maximum of 30. The frame rate is dependant on how fast images can be sent from Hand Modeler and then received in the XM-application. If the XM-application is unable to keep up, the frames will start stacking up, and the delay starts increasing. This makes it easy to notice and test. It was achieved, so a frame rate of 30 was used for testing.
Performance-wise, two things are interesting, delay and computational needs. Since the goal frame rate was achieved, delay was not a problem. Because of this only the computa-tional needs, CPU and memory usage, was recorded and evaluated. The use cases this thesis is affecting is in-call with hands overlay and is what was evaluated. Hands overlay was tested with or without hands present since this affects the leap motion controller’s power usage. This was compared to the old version with hands overlay. For the purpose of having a baseline, both the new prototype and the old solution was recorded while being idle. The measurements were done on a system with an Intel Core i7-6700HQ processor, 16 GB ram, a Geforce GTX 960m graphics card and running 64-bit Windows 10. The measurements were done using Windows Performance Monitor. Each experiment ran for at least 10 minutes with measurements taken every five seconds.
Table of contents :
1.3 Research Questions
2.1 Related work
2.2 Current Implementation of Hands Overlay
2.3 Hand Tracking
2.4 Leap Motion Controller
2.5 Used technologies
3.2 Building the Hand Model
3.3 Convert Model to Image Data
3.4 Transfer Image Data to XM-Client
3.5 Integrating Hand Frames with XM-Application
3.6 Evaluating the Prototype Based on FPS, CPU, and Memory Usage
4.1 Hand Modeler
5.3 Thesis in a Larger Context
6.1 Can hands overlay be done using a leap motion camera and software?
6.2 How are the frame rate and computational needs compared to the old solution?
6.3 How can it be integrated into the XM-client?
6.4 Future Work