Mittels virtueller Augmentierungen ist es möglich, die wahrgenommene Realität visuell zu erweitern. In der Regel beschränkt sich diese Erweiterung auf nicht-belebte Objekte oder nicht-menschliche Avatare. Jedoch ermöglichen Tiefenkameras die Erfassung von Körpern und deren Bewegungen. Ziel des Praktikums ist es, eine Tiefenkamera (Microsoft Kinect) mit einer stereoskopischen Augmented Reality (basierend auf HTC Vive) Anwendung zu koppeln, so dass die von der Kamera aufgenommene Person als Abbild in der Augmentierten Realität erscheint. Dadurch ergeben sich verschiedene Einsatzszenarien. Beispielsweise muss sich die aufgenommene Person nicht zwangsläufig am gleichen Ort befinden, wodurch ein AR Chatraum erzeugt werden kann. Alternativ können wir auf diese Art ein virtuelles Hologramm von uns selbst erstellen. Beide Szenarien sollen im Rahmen des Praktikums untersucht werden. Die notwendige Technik wird dabei von uns zur Verfügung gestellt.
1. What is the Repetitive Training?
Repetitive training Fig. 1 is a special approach to training a neural network. In this approach, we divide our relatively large database into smaller subsets. We then train multiple networks over each of these subsets. During training, we employ the resulting network from the previous subset as the backbone for training on the following subset. With this approach, we aim to address the issue of premature learning saturation.
2. What is the Learning Saturation?
Learning saturation is a state of training in which the training loss does not decrease meaningfully. One reason for this is that the hidden units predominantly produce values close to the asymptotic ends of the activation function range. This state is usually represented in the loss by undirected fluctuations (e.g., saturation) after an initial drop and a somewhat steady (and directed) decrease. Such behavior can have many causes, such as high initial weights, a small network size (e.g., underfitting), and an inappropriate learning rate. However, if these parameters are chosen carefully, (most) networks would reach this state in the final stages of their training. Thus, such saturation would not be a problem.
3. What is the Premature Learning Saturation?
If we have to employ relatively large data sets (for example, when using synthetic data as training sets), this learning saturation would be premature. This means that the network enters this saturation state in the early stages/steps. This seems to be a problem because it seems to cause the networks to learn merely the features from the initial portion of the training set and almost ignore the rest of the data. For a more detailed explanation of these three terms, see [Dadgar and Brunnett, 2023] how repetitive training helped mitigate premature learning saturation for hand segmentation (on specific examples) when the majority of the training data are synthetic images.
4. Our Proposed Praktikum
Here, we propose an Praktikum in a small team, to investigate the influence of repetitive training on a broader and deeper scale. On the one side, we attempt to broaden the scope of this training scheme by considering not only segmentation networks (as in the mentioned paper), but also detection, pose estimation, and classification networks (preferably targeting hands, but considering other objects of interest is negotiable). On the other side, we deepen the scope by considering not only synthetic images, but also real images, and also in comparison with some fancy learning rate schedulers, such as the cosine schedulers.
5. The Takeaways
There are several benefits for those of you who are interested in this project. Some of these benefits are as follows: First, you will gain invaluable insight and experience in the hot topic of convolutional neural networks with different networks, data sets (including those generated in our department), and our lavish facilities. Second, you will work in a small and friendly team, and each of you will focus on a tiny aspect of this project in detail, thereby evaluating and enhancing your teamwork skills in a professional conduct. Third, the entire project would help us to sketch out a detailed picture of the pros and cons of this training scheme. Eventually, we may be able to formulate it as a theorem, with one (or more) publications along the way.
References
[Dadgar and Brunnett, 2023] Dadgar, A. and Brunnett, G. (2023). Hand Segmentation with Mask-RCNN using Mainly Synthetic Images as Training Sets and Repetitive Training Strategy. In VISAPP, Lisbon.
Figure 1: Repetitive Training
1 Finger Individuation
In most gestures, involuntary movements of the fingers, imposed by the voluntary movements of the other fingers, are unavoidable. In addition, there are a considerable number of virtual postures that are anatomically impossible for real hands. The analysis of such involuntary, allowed, and forbidden movements, called finger individuation analysis [2, 3, 4], will therefore have a great influence on the design of accurate hand posture estimation systems that return plausible postures. Finger individuation can be done in three ways: dimensionality reduction analysis (DRA), fingers independence analysis (DIA), or finger coupling analysis (DCA).
2 PoseDescriptor
In our analysis of the behavioral repertoire of the hand at the digit/finger level, we observed that we can characterize all postures of every finger in human hands, each with a single and unique value. More precisely, the sum of the distances of the (movable) finger joints/nodes (or of the fingertip) to a locally fixed reference point on that hand (e.g., the wrist joint) has a specific value for each posture of that finger. This unique value, which we call the PoseDescriptor, reduces the dimensionality of the finger’s space from 16 to 5 (i.e., one degree of freedom for each finger) [1]. To make the study and the use of the PoseDescriptor more efficient, we consider three motion patterns that we introduced in that paper.
3 Our Proposed Bachelor Thesis
In that context, we propose a bachelor thesis to study finger individuation and its effect on the flexion/extension of other fingers using our PoseDescriptor. After collecting (and reusing) the existing data on the Internet) and normalizing and preprocessing of the gesture data, the student will analyze and illustrate the influence of finger coupling and enslavement with several computer vision and optimization algorithms (similar to the paper [3]). Ultimately, the goal of this thesis is to investigate the benefits that the PoseDescriptor can bring to the formulation of finger individuation and finger enslavement over the conventional joint-degree configuration.
References
[1] A. Dadgar and G. Brunnett. Using a 1D Pose-Descriptor on the Finger-Level to Reduce the Dimensions in Hand Posture Estimation. In ICPRAM, Lisbon, 2023.
[2] C. Hager-Ross and M. H. Schieber. Quantifying the independence of human finger movements: Comparisons of digits, hands, and movement frequencies. Journal of Neuroscience, 20(22):8542–8550, 2000.
[3] J. N. Ingram, K. P. K¨ording, I. S. Howard, and D. M. Wolpert. The statistics of natural hand movements. Experimental Brain Research, 188(2):223–236, 2008.
[4] M. Nakamura, C. Miyawaki, N. Matsushita, R. Yagi, and Y. Handa. Analysis of voluntary finger movements during hand tasks by a motion analyzer. Journal of Electromyography and Kinesiology, 8(5):295–303, 1998.
Figure 1: Two types of Fingers’ PoseDescriptors
Our hands play significant roles in our daily lives. Examples of such roles are 1) Pointing to a person or an object, 2) Conveying information about space, shapes, the objects’ number, or the temporal characteristics of motions, 3) Interacting relentlessly with objects in operating rooms, airplanes, laboratories, and factories, 4) Carrying out unconscious gesticulation to express ideas, and 5) Conducting conscious communication with sign language. Thus, a successful system design that encompasses the entire chain of automatic recognizing hand gestures could be beneficial for many social sectors. And it would play a central role in many intelligent systems of our future world.
Though being an immensely beneficial topic, the project calls for addressing many unforeseen challenges. The setting of the project’s design would demand the specificity of those challenges to be redefined anew. However, within the vision-based class of technology and, in the scope of the analysis-by-synthesis approach, there is a general trend of modules which can attend to several significant obstacles. Those modules include but are not limited to the following:
- Bottom-up modules such as tracking (Fig. a) and segmentation (Fig. b) to locate and extract human hands within 2D image scenes.
- Top-down modules such as spatio-temporal models to effectively relating hand’s postures with mathematical representations (Fig. c), and optimization techniques to efficiently searching through the high dimensional search space (Fig. d).
- Classification frameworks for converting the estimated postures/gestures to the semantically meaningful commands (Fig. e).
It goes without saying that one can view designing of these modules from several fascinating perspectives, such as computer vision, optimization, computer graphics, and machine learning. In that context, we welcome talented, creative, and hardworking students toward completing their master’s thesis focusing on one of the above modules from a particular perspective. Finally, we are open to consider new insight a vision from the student’s side to approach the modules above, subject to novelty and feasibility of the proposals.
1 What is the SaneNet?
SaneNet is a type of convolutional neural network trained mainly on synthetic data (Synthetically-based Trained Artificial Neural Network). The idea is to use synthetic data as a training set, without introducing new architectural elements, and using existing networks to eliminate the burden of costly annotation of real data. However, creating photorealistic synthetic images can also be costly. To alleviate this problem, we exploited the invariancy concept of neural networks by especific rotation of the hand model around the cartesian axes. This allowed us to use simplistic synthetic images in the training set instead of employing conventional and expensive techniques of domain randomization in the creation of realistic synthetic images. For a more detailed explanation of what the invariancy concept is and how we exploited it, see the paper [Dadgar and Brunnett, 2020].
2 What are the achievements so far?
First, we addressed the hand detection problem. To do so, we trained the YOLO network using these images (generated with a single hand model, with no shadowing but only shading, no texture, and a simple plain background) in combination with a few (100) real images. The results suggest that our simple and very inexpensive approach to generating synthetic training sets is successful for detecting hands in challenging scenarios [Dadgar and Brunnett, 2020]. Then, using a similar strategy and a new training scheme, we demonstrated the success of the approach in segmenting hands in real scenes, provided that the parameters of the networks and the specifications of our real data are taken with meticulous care [Dadgar and Brunnett, 2023].
3 What is the problem?
As the tasks became more difficult (e.g., from detection to segmentation), we faced a generalization problem. That is, for detection, the networks responded well in different scenarios and examples. However, for segmentation, the networks performed well only for certain examples and test sets. Therefore, we investigated the feasibility of increasing the generalization of the segmentation. To do this, we generated a new set of synthetic images, this time aiming for maximum diversity in the data while only slightly increasing the complexity of our rendering engine [Uhlmann et al., 2023]. Therefore, we generated a large amount of diversity by showing a human and its hands in a multitude of poses and with varying backgrounds while keeping the number of subjects, poses, scenes, and other costly graphical factors at the minimum level (see Fig.1).
4 What is next?
We are proposing two master projects, one focusing on hand segmentation and the other on pose estimation, to investigate the feasibility and success of our SaneNet approach in more general scenarios and test sets. We will provide our enhanced synthetic data for training the neural networks. Students may consider creating a new set of synthetic images using a generative neural network (e.g., GANs) if they can propose a novel approach to forming similarly rotated images (e.g., helping to exploit the invariancy concept).
References
[Dadgar and Brunnett, 2020] Dadgar, A. and Brunnett, G. (2020). SaneNet: Training a Fully Convolutional Neural Network Using Synthetic Data for Hand Detection. IEEE SAMI, pages 251–256.
[Dadgar and Brunnett, 2023] Dadgar, A. and Brunnett, G. (2023). Hand Segmentation with Mask-RCNN using Mainly Synthetic Images as Training Sets and Repetitive Training Strategy. In VISAPP, Lisbon.
[Uhlmann et al., 2023] Uhlmann, T., Dadgar, A., Weigand, F., and Brunnett, G. (2023). A novel Framework for the Generation of Synthetic Datasets with Applications to Hand Detection and Segmentation. In CRC-Hybrid Society, Chemnitz.
Figure 1: Enhanced Synthetic Data
Our hands play significant roles in our daily lives. Examples of such roles are 1) Pointing to a person or an object, 2) Conveying information about space, shapes, the objects’ number, or the temporal characteristics of motions, 3) Interacting relentlessly with objects in operating rooms, airplanes, laboratories, and factories, 4) Carrying out unconscious gesticulation to express ideas, and 5) Conducting conscious communication with sign language. Thus, a successful system design that encompasses the entire chain of automatic recognizing hand gestures could be beneficial for many social sectors. And it would play a central role in many intelligent systems of our future world.
Though being an immensely beneficial topic, the project calls for addressing many unforeseen challenges. The setting of the project’s design would demand the specificity of those challenges to be redefined anew. However, within the vision-based class of technology and, in the scope of the analysis-by-synthesis approach, there is a general trend of modules which can attend to several significant obstacles. Those modules include but are not limited to the following:
- Bottom-up modules such as tracking (Fig. a) and segmentation (Fig. b) to locate and extract human hands within 2D image scenes.
- Top-down modules such as spatio-temporal models to effectively relating hand’s postures with mathematical representations (Fig. c), and optimization techniques to efficiently searching through the high dimensional search space (Fig. d).
- Classification frameworks for converting the estimated postures/gestures to the semantically meaningful commands (Fig. e).
It goes without saying that one can view designing of these modules from several fascinating perspectives, such as computer vision, optimization, computer graphics, and machine learning. In that context, we welcome talented, creative, and hardworking students toward completing their master’s thesis focusing on one of the above modules from a particular perspective. Finally, we are open to consider new insight a vision from the student’s side to approach the modules above, subject to novelty and feasibility of the proposals.