BigRio used X-ray digitization process to create DICOM images from archived films while meeting stringent NARA quality standards…

BigRio used X-ray digitization process to create DICOM images from archived films while meeting stringent NARA quality standards…

BigRio used X-ray digitization process to create DICOM images from archived films while meeting stringent NARA quality standards…

BigRio used X-ray digitization process to create DICOM images from archived films while meeting stringent NARA quality standards…

BigRio used X-ray digitization process to create DICOM images from archived films while meeting stringent NARA quality standards…

A rapidly growing niche analytics firm has developed cutting-edge models helping retailers within their industry make improved product recommendations to customers. These recommendations not only match appropriate products to…

eBay Enterprise Marketing Solutions (EEMS) contracted with BigR.io to build fundamental components of a new Customer Engagement Engine (CEE) for their Commerce Marketing Platform. The CEE, a demand generation marketing solution, creates…

Deep Learning: Image and Video Recognition

Written by Bruce Ho

BigR.io’s Chief Big Data Scientist

Abstract

This paper illustrates the advancements in implementing Deep Neural Networks for automatic feature extraction in image and video for applications including facial recognition, programmatic video highlights, and image segmentation and object classification. Given the limitations of human abilities in earlier extraction methods, these networks exponentially increase accuracy, output, and available feature selection options for further analysis. BigR.io specializes in the following industry use cases:

  • Image Recognition

  • Video Highlights

  • Anomaly Detection

 

ABOUT BIGR.IO

BigR.io is a technology consulting firm empowering data to drive analytics for revenue growth and operational efficiencies. Our teams deliver software solutions, data science strategies, enterprise infrastructure, and management consulting to the world’s largest companies. We are an elite group with MIT roots, shining when tasked with complex missions: assembling mounds of data from a variety of sources, building high-volume, highly-available systems, and orchestrating analytics to transform technology into perceivable business value. With extensive domain knowledge, BigR.io has teams of architects and engineers that deliver best-in-class solutions across a variety of verticals. This diverse industry exposure and our constant run-in with the cutting edge empowers us with invaluable tools, tricks, and techniques. We bring knowledge and horsepower that consistently delivers innovative, cost-conscious, and extensible results to complex software and data challenges. Learn more at www.bigr.io.

 

OVERVIEW

Over the past few years, Deep Neural Network (DNN) capabilities have surpassed human parity in recognizing and interpreting images. These DNNs use Convolutional Neural Networks (CNNs) to automatically extract features from an input image with the use of convolution filters. Backpropagation then facilitates the learning by these filters of their kernel functions, starting with random values and ending up with elemental features that best represent the class of images being trained (for instance, nose, eye, and jaw shapes for face images). Image recognition is also where the highly coveted idea of transfer learning got its early foothold. Pre-trained models based on certain categories of images can be repurposed for various classification applications using only a small dataset. Since data preparation and labeling is one of the most challenging steps when carrying out supervised learning, the impact this concept has on accelerating this process cannot be overstated. Published models and datasets by some of the biggest players in the field (Google, Microsoft, etc.) now serve as a strong starting point to build robust application-specific models for businesses with only modest means for development.

 

INDUSTRY USE CASES

Similar to the adoption of best practices in big data and data science across several industry verticals, image video recognition solutions affect business outcomes across diverse government agencies and businesses. In this paper, we specifically examine use cases in the security and professional sports segments, but these solutions illustrate applications across all areas of video content creation, consumption, and monitoring.

 

IMAGE INSIGHTS

FCN8s

 

Image recognition can go beyond classification tasks for an entire image. In dense prediction, we are asking the neural network to detect the semantic context of any given pixel in a document or image. CNNs work by first finding image features that resemble certain filter functions, then floating such features to a top-level representation as a translation-invariant descriptor (e.g., detection of a nose, regardless of its position within the image). By combining both coarse- and fine-grained features at different scales, we obtain both the semantic context and location information of any one pixel. This opens the door for pixel-level semantic segmentation (aka dense prediction). Recent work on Fully Convolutional Networks (FCNs) leverages this capability to extract semantic context of a digitized document. One could, for example, detect whether a particular pixel is a title, section header, figure caption, an image, or part of a long paragraph using FCNs. A mobile user could then easily re-layout or restyle an electronic document using the extracted semantic context. FCNs have also been successfully applied to segment parts of an image, as well as full documents, with remarkable accuracy. How does this system pick potential customers from an image of a crowd, a soccer team, or a room full of event attendees? Given a close-up face shot, is this person happy to be here, in the target age group, or giving a positive response to the last sales message? Being able to answer these audience measurement questions for marketing is one of the hot areas in need of a deep learning solution. Many classic approaches to facial feature extraction and classification, Support Vector Machines, for example, have been devoted to this long-standing problem. Deep learning research in facial identification is relatively new but already outperforming older techniques by a wide margin. This development, and many other impressive improvements achieved by deep learning, are generally attributed to the automatic feature extraction function of neural networks and the incremental accuracy boost that deep learning techniques achieve when given a huge training dataset. In many applications, a high-quality, close-up facial shot is not always available. Picking faces out of an ordinary action photo may be the first step before applying any facial feature analysis. For this, the region-based CNNs (R-CNNs) excel in both speed and accuracy. The R-CNN approach proposes a number of bounding boxes in the original photo using what is called Selective Search. In this method, initial object boundaries are set using a graphical pixel similarity approach. Neighboring boxes with high pixel similarity metrics are then merged to further reduce the object count. Finally, each boxed object can be classified based on a pre-trained image recognition model.
FCN8s

 

In other efforts, researchers have extended facial analysis to emotion detection. Classically, this simply involved image labeling where the subject exhibits a range of facial expressions and a group of volunteers would mark each as happy, sad, angry, etc. — typically up to eight emotions. More recent work also incorporates dynamic facial movements, for example, capturing the complete sequence of facial movements for a smile or frown. A more generalizable model can be developed using linear scoring along the valence- arousal graph. A prediction of valence and arousal scores on future subjects can then be interpreted using a wider range of emotion states instead of the initial selection of about eight.

 

valance arousal plot

Reference: G Paltoglout, M Thelwall, Seeing Stars of Valence and Arousal in Blog Posts. Issue No. 01 Jan-Mar 2013 Vol. 4, IEEE Transactions on Affective Computing.

Points on the valence arousal plot can be translated to commonly understood emotions.

 

VIDEO HIGHLIGHTS

There are numerous highlights in every major sporting event. Manual real-time extraction of these highlights by fully attentive labelers is error-prone, requires significant manpower, is very expensive, and doesn’t scale well. Furthermore, while the most recent games may benefit from manual labeling, there are years of archived footage that remain unprocessed. Most off-stats highlights are overlooked by human observers who are instructed to look for only specific events, for example, looking for a ball boy slipping while chasing a tennis ball or a Major League splitter in a Little League game.

Today, we can automate programmatic video highlights using video recognition techniques. In addition to applying CNNs to static image features, Recurrent Neural Networks (RNNs) are able to classify video segments using optical flow between image frames. This technique is easily trained not only to extract official stat events, but also to extract any interesting player motion not explicitly logged and indexed — for example, an alley-oop in basketball. Due to the automated nature of these extraction tasks, studios can come up with new ideas at any time to build upon an existing menu of highlights.

Going beyond sporting events, any kind of motion picture, video ad, or short-form video opens itself up for potential indexing and repurposing. For example, a DC Comics fan may want the ability to easily find all instances of girl superhero encounters within the DC universe. This task requires automatic video highlight extraction, which is the key to reviving and monetizing unlimited archive contents that would otherwise remain buried and forgotten.

 

Image: Durant eyeing Rihanna after hitting a 3-pointer (she was cheering for LeBron).

 

ANOMALY DETECTION

Independent Component Analysis (ICA) is one such approach with many proposed variants. An ICA-based deep sparse feature extraction strategy combined with a non-parametric Bayesian approach can automatically determine the most optimal dimension for the latent feature vector, removing the heavy labor in parameter tuning that a full deep learning approach would entail. The reported accuracy improvement exceeds 10% over previous results. Variants of Restricted Boltzmann Machines (RBMs) are another major direction of research for deep-sparse representation. While much progress has been made on the theoretical front, the experimental results thus far lag behind the best ICA models. Reference: Y. Cong, J. Yuan, and J. Liu, “Sparse reconstruction cost for abnormal event detection,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2011, pp. 3449–3456

The graph on the right is a sparse vector representation of the image on the left. The vector dimensions, called training bases, are laid out along the x-axis, with the bars representing the coefficients for the bases needed to represent the image. A normal sample (top) can be represented as a sparse linear combination of the training bases, while an anomalous sample (bottom) requires a large number of base elements.

 

CONCLUSION

Recent advancements in image and video recognition pave the way for many business applications that would have been unimaginably hard or expensive to implement before. BigR.io excels at the application of deep learning to images and electronic documents for use cases ranging from facial recognition, to programmatic video highlights, to image segmentation and object classification.

Advanced Recommendation Engines

Written by Bruce Ho

BigR.io’s Chief Data Scientist

IMPACT OF RECOMMENDATION ENGINES ON ECOMMERCE

When a shopper comes to an eCommerce site, how does she find the items she wants? Search is generally the default option when the site knows nothing about the shopper. Somewhere along the journey though, the shopper starts to feel that the site should know her preferences better than pulling up a few thousand randomly sorted options and dumping them on her. Certainly the corner store used to treat her with a more personal touch.

The key to making customers feel at home is getting to know their tastes, like the corner store used to and suggesting to them the newly arrived silver set that matches their placemats. This less charming but more consistent replacement for the store owner is called the recommendation engine. With Amazon’s rise to preeminence, recommendation engine technology has gained some aura of mystique, often called a killer app by industry watchers. The company’s continued rapid growth hasn’t dampened that enthusiasm. The Netflix $1M challenge only added more fanfare.

The Netflix Challenge

Movies are a good example for where the recommendation engine makes or breaks the business. Viewer tastes and movie features vary widely, and the title list for movies is effectively infinite from the viewer’s perspective. Search by title is equally fruitless; you cannot ask viewers to look for what they don’t know. Making the right recommendation is a critical factor in determining whether the user comes back again. At Netflix, 75% of the movies watched came from their recommendation engine.

Netflix shook the geek world in 2006 with their announcement of a $1M prize to find the best movie recommendation algorithm. The winning team emerged almost three years later and generously published their approach, which is an ensemble of many algorithms including exotic neural network designs. There are many lessons to be learned from Netflix’s investment, with one of the biggest being that Netflix never did deploy the winning algorithm into production. Reason number 1: the extra engineering complexity is not worth the benefit. This outcome very much highlights the importance of a system that’s flexible and scalable alongside the quality of the recommendation. More on this at the end of this paper.

THE CLASSIC RECOMMENDATION ENGINE DESIGN

Recommendation engine design is one area where common machine learning techniques such as linear regression or SVM (support vector machines) perform poorly due to extreme data sparsity. In 2015, Netflix data shows:

  • 74 million subscribers2
  • 13,300 titles3
  • Total cells (user-item pairs): 98 billion
  • Actual collected ratings: 5 billion ratings ~ 5%

As a result, designers resorted to two other approaches which deal with sparsity well-content similarity and collaborative filtering. Each, however, has its own challenges in real world applications.

CONTENT SIMILARITY

Content similarity takes the intrinsic characteristics of users or items and estimates their affinity using similarity measures such as Euclidean distance, cosine similarity, or Pearson’s product moment coefficient. Someone who likes one comedy movie may like another. Mothers of teens may enjoy a film like “Mom’s Night Out”.

Content-based strategies require gathering and classifying product and user information that might not be available or easy to collect. The process of collecting features for all the products in the inventory is laborious and often manual. The algorithm is typically brand-specific, and therefore hard to commoditize.

On the user side, the user profile and preferences are often even harder to come by. Users generally won’t volunteer their demographics and interests unless they see the return value, which isn’t clear in the beginning. Although this strategy can work by relying on well-classified product content alone, at the very least the user must offer up some indication of their personal tastes in order to apply similarity matching.

Even when abundant features are collected, a content-based approach has the downside in that it cannot find hidden patterns. An article that contains “Operation Desert Storm” is probably of interest to readers of “Middle East foreign policy”, but the association would be overlooked unless identical keyword tokens appear in both.

COLLABORATIVE FILTERING

Collaborative filtering simply takes user rating data for each product without doing any intrinsic classification. The basis for matching is that a user who shares interest in some of the same products as another user may also like her other favorite products.

For example, if bike riders often buy sunglasses, the association may indicate interest in sunglasses by another biker who hasn’t bought sunglasses, even though there is no direct match between bike and sunglasses features.

Figure 1. The input ratings matrix A is re-factored into a user vector and product vector, both expressed in terms of hidden factors with k dimensions.

The input data of the rating matrix can be formidable in size – consider the data set example given above for Netflix. However, there is a machine learning technique known as Matrix Factorization that both reduces the computation complexity and discovers hidden rating similarity. Here, both user preferences and item features are expressed in terms of latent factors, vectors with only a fraction of the dimensions of the original ratings matrix. With the latent factors determined, the recommendation score for any unidentified user item pairs (missing from the original ratings matrix) can be derived using the dot product of the respective latent factor vectors. For example, the movie Troy would have a high values in latent factors “Action”, and “Brad Pitt” (these are not likely latent factors, only used here for illustration). Someone who enjoys “Action” and “Brad Pitt” would also have high values for these two factors. This results in a high dot product for this user – movie combination and, therefore, a high recommendation score.

This technique works well with rather sparse data. As shown with the Netflix dataset, even a 5% initial population yields surprisingly useful results. However, this method does have trouble with a brand new user who has no recorded ratings history. This is known as the cold start problem. Only generic recommendations can be made in that case, until ratings build up over time.

For this reason, content similarity is sometimes used to supplement collaborative filtering to overcome the cold start problem, at the expense of doubling the setup cost.

As well as the combination of content similarity and collaboration filtering works, there are notable areas where they fail to reach new heights. For example, sometimes features interact. Someone can rate both comedy and romance very low, yet enjoy the subcategory of romantic comedy. Considering the possible pairs of all the features, a direct inclusion of all combinations would quickly overwhelm the system, and likely for very a small percent in return.

Another area for improvement is freshness of ratings. How can the engine assign more weight to newer ratings and yet not completely discard older ones? A single ratings matrix has no provision for this enhancement, yet shopper taste does shift over time. It is naturally desirable for the recommendations to keep up.

In time, the bright minds in the field start to dream of a better way, which:

  • Accepts both user product rating and content features
  • Can be computed with only linear complexity
  • Takes account of features interactions in an efficient manner
  • Includes timeliness as one of the influencers

By implication, this method would behave well with sparse data, yet mitigate the cold start problem. If the algorithm is based on linear regression, it would be cream on top in terms of computational efficiency, as well as opening up wider inclusion of additional characteristics that might indicate a user’s preferences.

ADVANCED TECHNIQUES – FACTORIZATION MACHINE

Factorization Machine was invented by German researcher Steffen Rendle in 2010. This method lays out both user and product as unique features, among any other general features (such as freshness of ratings), at the discretion of the data scientist. The key innovation is where Rendle introduces cross-feature terms (equivalent to quadratic polynomial minus self-interacting terms in standard linear regression), but then applies the latent factor technique to reduce computation complexity. The cross-feature terms, on the one hand, express the user item pairs inherent in collaborative filtering, and on the other, capture any potential cross feature interactions.

Figure 2. Factorization Machine accepts features which are a superset of content similarity and collaborative filtering. It includes both user item pairs and general content characteristics, while optimizing computation efficiency using the latent factor method of reduction.

Through clever formula reduction, Rendle brought the complexity back down to O (n).
No one ever got a job at Google or Amazon by proposing a O (n^2) solution! This reduction is critical in judging whether a solution is sound in practice.

Now that the problem is expressed as a system of linear equations, the direct implication is that parallelization techniques such as SGD (Stochastic Gradient Descent) can be applied to effect linear scalability. This leads to the system architectural considerations discussed in the next section.

Factorization machine is one of the many specialty approaches BigR.io’s expertise can offer. It is an illustration of our philosophy to help our clients design the highest performing and most advanced solutions in any given situation, while balancing budget and other operational concerns.

DEPLOYMENT SYSTEM ARCHITECTURE

What did Netflix learn for its $1M investment in the ultimate recommendation algorithm? According to Netflix: “Coming up with a software architecture that handles large volumes of existing data, is responsive to user interactions, and makes it easy to experiment with new recommendation approaches is not a trivial task.” Clearly, if production viability was included in the competition criteria, they may have made better use of the prize money. Case in point: 100 million ratings were published for the competition, but the actual production load was 5 billion.

The engineering disciplines necessary to bring advanced algorithms to deployment cannot be overlooked. The algorithm should be adaptable to a scalable implementation (such as Apache Spark), and not simply dropped into a faster box in its original “data science” form.

From a system architecture perspective, model training and on-demand predictions for online users have very different requirements in data volume, computational load, and response time, and should always be launched in separate subsystems.

Often, it is also useful to introduce a third caching layer, in which the system predicts the next set of predictions in response to user events. While users are digesting the first prediction or consuming the recommended item, the system pre-computes the likely follow-up and caches the results awaiting future user requests. The potentials for performance gain are numerous and only await diligent discoveries.

CONCLUSION

Recommendation technology has come a long way since the dawn of eCommerce. Advances in matching algorithms have been nothing short of stunning. Factorization Machine not only eliminates the cold start barrier, it incorporates feature interactions and rating freshness; important functions that were missing from previous generation technology. At the same time, it fully embraces parallel processing and therefore is well suited to Big Data platforms.

BigR.io is uniquely qualified to not only fine tune the recommendation algorithm for your business model, but to deliver a complete solution which meets the scalability and operational requirements for a high-volume, fast-response production system.