Deep learning in images recognition

ENSEMBLE LEARNING. What is it and what is it about?

4 November 2018

Artificial Intelligence in e-commerce

5 November 2018

4 November 2018

The article is based on the “Deep Learning for Computer Vision: A Brief Review” by A. Voulodimos, N.Doulamis A. Doulamis and E. Protopapadakis from the Department of Informatics, the Technological Educational Institute of Athens and the National Technical University of Athens

Deep learning allows computational models to process many layers, learn and represent data with many levels of abstraction, imitating how the brain receives and understands multimodal information, and thus captures complex data structures on a large scale. Over the last years, it has been found that deep learning methods outstrip previous modern machine learning techniques in many areas, and image recognition is one of the most visible cases. The most important schemes of deep learning used in computer vision problems are the convolutional machine networks, Boltzmann machines, Deep Belief Networks and auto-encoders. They are used in various tasks related to image recognition, such as object detection, face recognition, recognition of activities and estimating the pose of a human being.

Needless to say, the current coverage is by no means exhaustive. Among other things, long-term short-term memory (LSTM) in the category of recurrent neural networks, although of great importance as a deep learning system, is not presented by us, because it is mainly used in problems such as language modelling, text classification, handwriting recognition, machine translation, speech or music recognition, and less in vision problems. The methods presented below are useful in the area of tasks related to image recognition.

Deep Learning methods

Convolutional Neural Networks

Convolutional Neural Networks use error gradients and achieve very good results in various pattern recognition tasks.

The CNN architecture uses three specific ideas: local receptive fields, associated loads and spatial subsampling. Based on the local receptive field, each unit in the convolution layer receives input data from a set of neighbouring units belonging to the previous layer. In this way, neurons are able to extract elementary visual parts, such as edges or corners. These functions are then joined by successive convolution layers to detect higher order features. Furthermore, the idea that elementary feature detectors that are useful on a part of the image may be useful throughout the image is realized by the concept of load bonding. The concept of connected loads limits the set of units to have identical weights.

Generally, it has been shown that CNN far exceeds traditional machine learning approaches in a wide range of tasks related to image recognition and patterns. Their unique results combined with the relative ease of training are the main reasons that explain the huge increase in their popularity over the past few years.

Deep Belief and Deep Boltzmann machines

Deep Belief Networks and Deep Boltzmann Machines are deep learning models that belong to the Boltzmann family in the sense that they use limited Boltzmann machines as a learning module. Limited Boltzmann machines is a generative stochastic neural network. Deep Belief networks have undefined connections on the top two layers that form limited Boltzmann machines and route connections to the lower layers. Deep Boltzmann Machines have undirected connections between all layers of the network.

Denoising Autoencoders

Denoising Autoencoders use auto-encoders as the main construction element, in a similar way that Deep Belief Networks uses Boltzmann Limited Machines as a component.

Model comparison

Comparing these models, we can say that CNNs generally performed better than DBN in the current literature on comparative sets of image recognition data. In cases where input data are not visible, DBNs often outperform other models, but the difficulty of accurately estimating common probabilities as well as computational costs in creating DBN are disadvantages. An important positive aspect of CNN is “learning features”, i.e., omitting manual functions that are necessary for other types of networks. CNN features are automatically taught. On the other hand, CNN is based on the availability of training data labels, while Deep Belief Networks and Boltzmann machines do not have this limitation and can operate unattended. In other words, one of the disadvantages of self-coders is that they can become ineffective if errors occur on the first layers. Such errors may cause the network to learn how to reproduce the average of training data. One of the strengths of CNN is that they can be invariant in the case of transformations such as translation, scaling and rotation. Invariance, rotation and scale are ones of the most important assets of CNN, especially in problems with image recognition, such as object detection, because it allows the abstraction of identity so that the network can effectively recognize the object in cases where the real pixel values in the image can vary significantly.

Problems related to image recognition

Object detection

Object detection is the process of detecting instances of semantic objects of a particular class (such as people, planes or birds) in digital and video images. A common approach to object detection structures includes the creation of a large set of candidate windows.

The vast majority of work on detecting objects using deep learning uses the CNN variant. However, there is a relatively small number of attempts to detect objects using other deep models. One of them is the method of locating coarse objects based on the significance mechanism in combination with DBN for detecting objects in remote sensing images.

Face recognition

Face recognition is one of the hottest computer vision applications (Computer Vision), which also enjoys great commercial interest. Various facial recognition systems based on the extraction of hand features have been proposed. In such cases, the function extractor extracts elements from the aligned wall in order to obtain a low-dimensional representation, on the basis of which the classifier creates forecasts. CNN caused a change in the face recognition field, thanks to the properties of learning and transformation of features.

What’s more, Google FaceNet and Facebook DeepFace are based on CNN. DeepFace models the face in 3D and aligns it to look like the face. Then the normalized input data is fed into a single filter, followed by three locally connected layers and two completely connected layers used to make the final forecasts. Although DeepFace technology achieves great results, its representation is not easy to interpret because the faces of the same person are not necessarily focused during the training process. On the other hand, the FaceNet training process learns to group face representation of the same person. What’s more, CNN is the core of OpenFace, a face recognition tool that has comparable (albeit slightly lower) accuracy. However, it is open-source and is ideal for mobile computers, due to its smaller size and fast turnaround time.

Operation and recognition of activity

Recognition of human activities is a research issue to which researchers paid much attention. Many works have been proposed regarding the recognition of human activity based on deep-learning techniques. It was used to comprehensively detect and recognize events in video sequences: first, significance maps were used to detect and locate events, and then deep learning was used to improve the functions to identify the most important frames that correspond to a basic event. Some scientists have successfully applied the CNN-based approach to recognizing beach volleyball activities, similar to the approach to event classification from large-scale video data sets. In some studies, the CNN model based on smartphone sensor data is used to recognize activity. Researchers incorporate radius and margin constraints as a regularization term into a deep CNN model that effectively improves CNN generation performance for business classification.

Guided by the adaptive abilities of models and the availability of a variety of sensors, an increasingly popular strategy for recognizing human activity is combining multimodal and/or data functions. In the research, the researchers mixed up the look and function of a motion to recognize group activities in crowded scenes collected from the web. In order to combine different modalities, the authors applied multitasking, deep learning. They also explore a combination of heterogeneous features for complex event recognition. The problem is perceived as two different tasks: first, the most informative event recognition functions are estimated. There is also a series of works linking more than one type of model, with the exception of a few data modalities. Engineers propose a multimodal, multi-stream deep learning framework that solves the problem of recognizing egocentric action, using both video data and sensor data, as well as using the dual CNN architecture and short-term memory. Finally, DBN is used to recognize activity using input video sequences that also contain depth information.

Estimating a human pose

The purpose of estimating a human pose is to determine the location of human joints from images, image sequences, depth images or skeleton data provided by motion capture equipment. Estimating a human pose is a very difficult task due to the wide range of human figures and appearance, difficult lighting and a “cluttered” background. Before the era of deep learning, the estimation of the pose was based on detecting parts of the body, for example through image structures.

By going to deep learning methods in the estimation of the human pose, we can group them into holistic and part-based methods, depending on how the input images are processed. Holistic methods of processing tend to perform their tasks in a global way and do not define clearly the model for each part and their spatial relations. DeepPose is a holistic model that formulates the human pointer assessment method as a common regression problem and does not explicitly define a graphics model or part detector to evaluate a human pose. Nevertheless, holistic methods are plagued by inaccuracies in the high-precision region due to the difficulty in learning to directly regress complex image vectors from images.

On the other hand, partial processing methods focus on the individual detection of parts of the human body, and then on the graphics model to include spatial information. Some people, instead of training a network that uses the entire image, use local patches and background patches to train CNN to learn the conditional probabilities of the presence of parts and spatial relationships. Others use the approach which involves many smaller CNNs to perform an independent binary classification of body parts, followed by a weak spatial model with a higher level to remove strong outliers and force the coherence of the global pose.

Conclusions

The increase in deep learning in recent years is to a large extent the result of the advances that have allowed widening in the field of computer vision. Three key categories of deep learning in terms of image recognition, which were presented are CNN, “Boltzmann family”, including DBN and DBM, and autoencoders. They have been used to achieve significant performance indicators in various tasks related to image comprehensions, such as object detection, face recognition, action recognition, human pose estimation, image retrieval and semantic segmentation. However, each category has clear advantages and disadvantages. CNN have a unique ability to learn functions, i.e. automatic learning functions based on a given data set. They are also immutable transformations, which is a great advantage for some image recognition applications. On the other hand, they rely heavily on the existence of labelled data, as opposed to DBN / DBM that can operate unattended. Of the models tested, both CNN and DBN / DBM are computationally demanding when it comes to training.

greenlogic