Automatic Metadata Extraction

Title: Automatic Metadata Extraction
Headline: Automatic Metadata Extraction

Recognizing concepts in multi-media content means to extract human's understandable ‎meanings from an image or video using audio/visual features. In this case, we have the ‎following‏:‏‎ recognizing type of the places (religious sites, monuments, restaurant, malls and ‎etc.), various objects (vehicles, weapons, electronic devices and etc.), scenes and events ‎‎(landscapes, protests, NSFWs and etc.), genre (sport, wildlife documentary, ritual and etc.) and ‎face (celebrities, criminals and etc.). These concepts are widely used in different areas ‎including augmented reality entertainment applications, tourism applications, automated image ‎and video tagging in vast scale archives, smart robots, security systems, and online markets. For ‎instance, some social networks try to prevent dissemination of inappropriate and violent content ‎in cyberspace by extracting some useful information like the location, events and objects. ‎ Today, due to the high speed of content production, we have a huge amount of data that requires ‎fast processing with high accuracy. In traditional approaches, the learning algorithms are based ‎on the hand-engineered feature extraction to classify images. These algorithms work well on ‎data with simple features, but due to the increasing complexity and variety of content, it is not ‎possible to generalize these algorithms to real-world data, which slows down its speed and ‎accuracy.‎ Nowadays, with the advent of deep neural networks, a new concept of machine learning has ‎been introduced. One of the prominent features of these networks is the high accuracy ‎compared to other traditional machine learning methods. Deep networks, with the automatic ‎high-level features extraction, can be generalized to high-volume and varied data. This means ‎that the trained model will not be inaccurate in dealing with real-world data and will perform ‎the desired function. Also, these networks allow for the creation of integrated models which ‎receive a raw image as an input, and the output will be directly labeled.‎ The Intelligent Concepts Detection System is one of the first products employs deep neural ‎networks to draw out the content of the visual data using the object, landmark, scene, genre and ‎face recognition services. It can be said that this system has the ability to compete with the top ‎products in the category of visual data classification in a sufficiently smart and automated way. The ‎system enables the users to select the available filters and find the desired content among the ‎several thousands of videos in a fully automated and highly error-free manner. In this system, ‎the main focus is on visual features.‎ As the system consists of several different services, the steps for preparing each service are ‎different, but there are some phases in common:‎

  • ‎ Data collection for the training step.‎ ‎
  • Select the type and structure of the network to fit the data.‎ ‎
  • ‎Training step‎

  • Features

The archive of the Islamic Republic of Iran Broadcasting contains millions of images and videos collected over the years. Because the videos are too old, this collection contains data with different quality including analog, digital and black and white content. As a result, recovering any videos from this volume of data is almost impossible. This rich archive contains valuable historical, political, artistic, social and entertainment content, but lack of proper access to its content is a waste of this resource. According to the requirements, a precise definition of the product is provided according to which the final system is designed. As mentioned in the previous sections, this product consists of five main subcategories, each of which deals with the classification of a subject. In this system, all operations are done in parallel on the GPU, so it is possible for multiple services to receive a batch of inputs at the same time and give the results in the less time. This makes it easy to process even long high-quality videos without any problems. As mentioned earlier, this product is based on deep neural networks, so by a specialized network design, choosing the right structure and having a variety of training data, the accuracy of the final model can be as close as possible to the accuracy of the human brain (regardless of human error). The accuracy of the presented models varies from 80 to 99% depending on the training dataset. The system can deliver the same high accuracy and speed on a single GPU (1070, 150W) and is also economical to use. The optimization of each of these networks is high enough that it can also be provided on a smartphone. Due to the full automation of the product, there is no need for physical presence in the archive, and it can only be implemented remotely by specifying system entries. It is also completely isolated and offline and does not need to communicate with the global network, which has completely eliminated security concerns. The graphical interface of this product is very simple designed so that non-specialists can easily work with it. Features of this interface include the ability to select the service type, add filters and specify the output type.