Deep Learning: Image and Video Recognition

Written by Bruce Ho’s Chief Big Data Scientist


This paper illustrates the advancements in implementing Deep Neural Networks for automatic feature extraction in image and video for applications including facial recognition, programmatic video highlights, and image segmentation and object classification. Given the limitations of human abilities in earlier extraction methods, these networks exponentially increase accuracy, output, and available feature selection options for further analysis. specializes in the following industry use cases:

  • Image Recognition

  • Video Highlights

  • Anomaly Detection


ABOUT BIGR.IO is a technology consulting firm empowering data to drive analytics for revenue growth and operational efficiencies. Our teams deliver software solutions, data science strategies, enterprise infrastructure, and management consulting to the world’s largest companies. We are an elite group with MIT roots, shining when tasked with complex missions: assembling mounds of data from a variety of sources, building high-volume, highly-available systems, and orchestrating analytics to transform technology into perceivable business value. With extensive domain knowledge, has teams of architects and engineers that deliver best-in-class solutions across a variety of verticals. This diverse industry exposure and our constant run-in with the cutting edge empowers us with invaluable tools, tricks, and techniques. We bring knowledge and horsepower that consistently delivers innovative, cost-conscious, and extensible results to complex software and data challenges. Learn more at



Over the past few years, Deep Neural Network (DNN) capabilities have surpassed human parity in recognizing and interpreting images. These DNNs use Convolutional Neural Networks (CNNs) to automatically extract features from an input image with the use of convolution filters. Backpropagation then facilitates the learning by these filters of their kernel functions, starting with random values and ending up with elemental features that best represent the class of images being trained (for instance, nose, eye, and jaw shapes for face images). Image recognition is also where the highly coveted idea of transfer learning got its early foothold. Pre-trained models based on certain categories of images can be repurposed for various classification applications using only a small dataset. Since data preparation and labeling is one of the most challenging steps when carrying out supervised learning, the impact this concept has on accelerating this process cannot be overstated. Published models and datasets by some of the biggest players in the field (Google, Microsoft, etc.) now serve as a strong starting point to build robust application-specific models for businesses with only modest means for development.



Similar to the adoption of best practices in big data and data science across several industry verticals, image video recognition solutions affect business outcomes across diverse government agencies and businesses. In this paper, we specifically examine use cases in the security and professional sports segments, but these solutions illustrate applications across all areas of video content creation, consumption, and monitoring.





Image recognition can go beyond classification tasks for an entire image. In dense prediction, we are asking the neural network to detect the semantic context of any given pixel in a document or image. CNNs work by first finding image features that resemble certain filter functions, then floating such features to a top-level representation as a translation-invariant descriptor (e.g., detection of a nose, regardless of its position within the image). By combining both coarse- and fine-grained features at different scales, we obtain both the semantic context and location information of any one pixel. This opens the door for pixel-level semantic segmentation (aka dense prediction). Recent work on Fully Convolutional Networks (FCNs) leverages this capability to extract semantic context of a digitized document. One could, for example, detect whether a particular pixel is a title, section header, figure caption, an image, or part of a long paragraph using FCNs. A mobile user could then easily re-layout or restyle an electronic document using the extracted semantic context. FCNs have also been successfully applied to segment parts of an image, as well as full documents, with remarkable accuracy. How does this system pick potential customers from an image of a crowd, a soccer team, or a room full of event attendees? Given a close-up face shot, is this person happy to be here, in the target age group, or giving a positive response to the last sales message? Being able to answer these audience measurement questions for marketing is one of the hot areas in need of a deep learning solution. Many classic approaches to facial feature extraction and classification, Support Vector Machines, for example, have been devoted to this long-standing problem. Deep learning research in facial identification is relatively new but already outperforming older techniques by a wide margin. This development, and many other impressive improvements achieved by deep learning, are generally attributed to the automatic feature extraction function of neural networks and the incremental accuracy boost that deep learning techniques achieve when given a huge training dataset. In many applications, a high-quality, close-up facial shot is not always available. Picking faces out of an ordinary action photo may be the first step before applying any facial feature analysis. For this, the region-based CNNs (R-CNNs) excel in both speed and accuracy. The R-CNN approach proposes a number of bounding boxes in the original photo using what is called Selective Search. In this method, initial object boundaries are set using a graphical pixel similarity approach. Neighboring boxes with high pixel similarity metrics are then merged to further reduce the object count. Finally, each boxed object can be classified based on a pre-trained image recognition model.


In other efforts, researchers have extended facial analysis to emotion detection. Classically, this simply involved image labeling where the subject exhibits a range of facial expressions and a group of volunteers would mark each as happy, sad, angry, etc. — typically up to eight emotions. More recent work also incorporates dynamic facial movements, for example, capturing the complete sequence of facial movements for a smile or frown. A more generalizable model can be developed using linear scoring along the valence- arousal graph. A prediction of valence and arousal scores on future subjects can then be interpreted using a wider range of emotion states instead of the initial selection of about eight.


valance arousal plot

Reference: G Paltoglout, M Thelwall, Seeing Stars of Valence and Arousal in Blog Posts. Issue No. 01 Jan-Mar 2013 Vol. 4, IEEE Transactions on Affective Computing.

Points on the valence arousal plot can be translated to commonly understood emotions.



There are numerous highlights in every major sporting event. Manual real-time extraction of these highlights by fully attentive labelers is error-prone, requires significant manpower, is very expensive, and doesn’t scale well. Furthermore, while the most recent games may benefit from manual labeling, there are years of archived footage that remain unprocessed. Most off-stats highlights are overlooked by human observers who are instructed to look for only specific events, for example, looking for a ball boy slipping while chasing a tennis ball or a Major League splitter in a Little League game.

Today, we can automate programmatic video highlights using video recognition techniques. In addition to applying CNNs to static image features, Recurrent Neural Networks (RNNs) are able to classify video segments using optical flow between image frames. This technique is easily trained not only to extract official stat events, but also to extract any interesting player motion not explicitly logged and indexed — for example, an alley-oop in basketball. Due to the automated nature of these extraction tasks, studios can come up with new ideas at any time to build upon an existing menu of highlights.

Going beyond sporting events, any kind of motion picture, video ad, or short-form video opens itself up for potential indexing and repurposing. For example, a DC Comics fan may want the ability to easily find all instances of girl superhero encounters within the DC universe. This task requires automatic video highlight extraction, which is the key to reviving and monetizing unlimited archive contents that would otherwise remain buried and forgotten.


Image: Durant eyeing Rihanna after hitting a 3-pointer (she was cheering for LeBron).



Independent Component Analysis (ICA) is one such approach with many proposed variants. An ICA-based deep sparse feature extraction strategy combined with a non-parametric Bayesian approach can automatically determine the most optimal dimension for the latent feature vector, removing the heavy labor in parameter tuning that a full deep learning approach would entail. The reported accuracy improvement exceeds 10% over previous results. Variants of Restricted Boltzmann Machines (RBMs) are another major direction of research for deep-sparse representation. While much progress has been made on the theoretical front, the experimental results thus far lag behind the best ICA models. Reference: Y. Cong, J. Yuan, and J. Liu, “Sparse reconstruction cost for abnormal event detection,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2011, pp. 3449–3456

The graph on the right is a sparse vector representation of the image on the left. The vector dimensions, called training bases, are laid out along the x-axis, with the bars representing the coefficients for the bases needed to represent the image. A normal sample (top) can be represented as a sparse linear combination of the training bases, while an anomalous sample (bottom) requires a large number of base elements.



Recent advancements in image and video recognition pave the way for many business applications that would have been unimaginably hard or expensive to implement before. excels at the application of deep learning to images and electronic documents for use cases ranging from facial recognition, to programmatic video highlights, to image segmentation and object classification.

Advanced Recommendation Engines

Written by Bruce Ho’s Chief Data Scientist


When a shopper comes to an eCommerce site, how does she find the items she wants? Search is generally the default option when the site knows nothing about the shopper. Somewhere along the journey though, the shopper starts to feel that the site should know her preferences better than pulling up a few thousand randomly sorted options and dumping them on her. Certainly the corner store used to treat her with a more personal touch.

The key to making customers feel at home is getting to know their tastes, like the corner store used to and suggesting to them the newly arrived silver set that matches their placemats. This less charming but more consistent replacement for the store owner is called the recommendation engine. With Amazon’s rise to preeminence, recommendation engine technology has gained some aura of mystique, often called a killer app by industry watchers. The company’s continued rapid growth hasn’t dampened that enthusiasm. The Netflix $1M challenge only added more fanfare.

The Netflix Challenge

Movies are a good example for where the recommendation engine makes or breaks the business. Viewer tastes and movie features vary widely, and the title list for movies is effectively infinite from the viewer’s perspective. Search by title is equally fruitless; you cannot ask viewers to look for what they don’t know. Making the right recommendation is a critical factor in determining whether the user comes back again. At Netflix, 75% of the movies watched came from their recommendation engine.

Netflix shook the geek world in 2006 with their announcement of a $1M prize to find the best movie recommendation algorithm. The winning team emerged almost three years later and generously published their approach, which is an ensemble of many algorithms including exotic neural network designs. There are many lessons to be learned from Netflix’s investment, with one of the biggest being that Netflix never did deploy the winning algorithm into production. Reason number 1: the extra engineering complexity is not worth the benefit. This outcome very much highlights the importance of a system that’s flexible and scalable alongside the quality of the recommendation. More on this at the end of this paper.


Recommendation engine design is one area where common machine learning techniques such as linear regression or SVM (support vector machines) perform poorly due to extreme data sparsity. In 2015, Netflix data shows:

  • 74 million subscribers2
  • 13,300 titles3
  • Total cells (user-item pairs): 98 billion
  • Actual collected ratings: 5 billion ratings ~ 5%

As a result, designers resorted to two other approaches which deal with sparsity well-content similarity and collaborative filtering. Each, however, has its own challenges in real world applications.


Content similarity takes the intrinsic characteristics of users or items and estimates their affinity using similarity measures such as Euclidean distance, cosine similarity, or Pearson’s product moment coefficient. Someone who likes one comedy movie may like another. Mothers of teens may enjoy a film like “Mom’s Night Out”.

Content-based strategies require gathering and classifying product and user information that might not be available or easy to collect. The process of collecting features for all the products in the inventory is laborious and often manual. The algorithm is typically brand-specific, and therefore hard to commoditize.

On the user side, the user profile and preferences are often even harder to come by. Users generally won’t volunteer their demographics and interests unless they see the return value, which isn’t clear in the beginning. Although this strategy can work by relying on well-classified product content alone, at the very least the user must offer up some indication of their personal tastes in order to apply similarity matching.

Even when abundant features are collected, a content-based approach has the downside in that it cannot find hidden patterns. An article that contains “Operation Desert Storm” is probably of interest to readers of “Middle East foreign policy”, but the association would be overlooked unless identical keyword tokens appear in both.


Collaborative filtering simply takes user rating data for each product without doing any intrinsic classification. The basis for matching is that a user who shares interest in some of the same products as another user may also like her other favorite products.

For example, if bike riders often buy sunglasses, the association may indicate interest in sunglasses by another biker who hasn’t bought sunglasses, even though there is no direct match between bike and sunglasses features.

Figure 1. The input ratings matrix A is re-factored into a user vector and product vector, both expressed in terms of hidden factors with k dimensions.

The input data of the rating matrix can be formidable in size – consider the data set example given above for Netflix. However, there is a machine learning technique known as Matrix Factorization that both reduces the computation complexity and discovers hidden rating similarity. Here, both user preferences and item features are expressed in terms of latent factors, vectors with only a fraction of the dimensions of the original ratings matrix. With the latent factors determined, the recommendation score for any unidentified user item pairs (missing from the original ratings matrix) can be derived using the dot product of the respective latent factor vectors. For example, the movie Troy would have a high values in latent factors “Action”, and “Brad Pitt” (these are not likely latent factors, only used here for illustration). Someone who enjoys “Action” and “Brad Pitt” would also have high values for these two factors. This results in a high dot product for this user – movie combination and, therefore, a high recommendation score.

This technique works well with rather sparse data. As shown with the Netflix dataset, even a 5% initial population yields surprisingly useful results. However, this method does have trouble with a brand new user who has no recorded ratings history. This is known as the cold start problem. Only generic recommendations can be made in that case, until ratings build up over time.

For this reason, content similarity is sometimes used to supplement collaborative filtering to overcome the cold start problem, at the expense of doubling the setup cost.

As well as the combination of content similarity and collaboration filtering works, there are notable areas where they fail to reach new heights. For example, sometimes features interact. Someone can rate both comedy and romance very low, yet enjoy the subcategory of romantic comedy. Considering the possible pairs of all the features, a direct inclusion of all combinations would quickly overwhelm the system, and likely for very a small percent in return.

Another area for improvement is freshness of ratings. How can the engine assign more weight to newer ratings and yet not completely discard older ones? A single ratings matrix has no provision for this enhancement, yet shopper taste does shift over time. It is naturally desirable for the recommendations to keep up.

In time, the bright minds in the field start to dream of a better way, which:

  • Accepts both user product rating and content features
  • Can be computed with only linear complexity
  • Takes account of features interactions in an efficient manner
  • Includes timeliness as one of the influencers

By implication, this method would behave well with sparse data, yet mitigate the cold start problem. If the algorithm is based on linear regression, it would be cream on top in terms of computational efficiency, as well as opening up wider inclusion of additional characteristics that might indicate a user’s preferences.


Factorization Machine was invented by German researcher Steffen Rendle in 2010. This method lays out both user and product as unique features, among any other general features (such as freshness of ratings), at the discretion of the data scientist. The key innovation is where Rendle introduces cross-feature terms (equivalent to quadratic polynomial minus self-interacting terms in standard linear regression), but then applies the latent factor technique to reduce computation complexity. The cross-feature terms, on the one hand, express the user item pairs inherent in collaborative filtering, and on the other, capture any potential cross feature interactions.

Figure 2. Factorization Machine accepts features which are a superset of content similarity and collaborative filtering. It includes both user item pairs and general content characteristics, while optimizing computation efficiency using the latent factor method of reduction.

Through clever formula reduction, Rendle brought the complexity back down to O (n).
No one ever got a job at Google or Amazon by proposing a O (n^2) solution! This reduction is critical in judging whether a solution is sound in practice.

Now that the problem is expressed as a system of linear equations, the direct implication is that parallelization techniques such as SGD (Stochastic Gradient Descent) can be applied to effect linear scalability. This leads to the system architectural considerations discussed in the next section.

Factorization machine is one of the many specialty approaches’s expertise can offer. It is an illustration of our philosophy to help our clients design the highest performing and most advanced solutions in any given situation, while balancing budget and other operational concerns.


What did Netflix learn for its $1M investment in the ultimate recommendation algorithm? According to Netflix: “Coming up with a software architecture that handles large volumes of existing data, is responsive to user interactions, and makes it easy to experiment with new recommendation approaches is not a trivial task.” Clearly, if production viability was included in the competition criteria, they may have made better use of the prize money. Case in point: 100 million ratings were published for the competition, but the actual production load was 5 billion.

The engineering disciplines necessary to bring advanced algorithms to deployment cannot be overlooked. The algorithm should be adaptable to a scalable implementation (such as Apache Spark), and not simply dropped into a faster box in its original “data science” form.

From a system architecture perspective, model training and on-demand predictions for online users have very different requirements in data volume, computational load, and response time, and should always be launched in separate subsystems.

Often, it is also useful to introduce a third caching layer, in which the system predicts the next set of predictions in response to user events. While users are digesting the first prediction or consuming the recommended item, the system pre-computes the likely follow-up and caches the results awaiting future user requests. The potentials for performance gain are numerous and only await diligent discoveries.


Recommendation technology has come a long way since the dawn of eCommerce. Advances in matching algorithms have been nothing short of stunning. Factorization Machine not only eliminates the cold start barrier, it incorporates feature interactions and rating freshness; important functions that were missing from previous generation technology. At the same time, it fully embraces parallel processing and therefore is well suited to Big Data platforms. is uniquely qualified to not only fine tune the recommendation algorithm for your business model, but to deliver a complete solution which meets the scalability and operational requirements for a high-volume, fast-response production system.

An Adaptive and Fine-Grained Experience with Big Data & Data Science

Written by Bruce Ho’s Chief Data Scientist


Your customers tell you a lot about themselves, through their digital interactions and social media assertions. With all the internal and external data you have, you can get to know each one of them as well as the corner store owner. Uncover their preferences, forestall their defection and predict their demands by deciphering their digital behavior with advanced analytics and Big Data technology. In this paper, we take you through the expertise delivers to complete your 360 degree customer view.

The capabilities explored are:

  • Fine-grained Personalization
  • Social Media Sentiment
  • Next Best Action
  • Precision Recommendation
  • Detect Buyer Readiness
  • Market Segmentation
  • Real-time Customer Monitoring

ABOUT BIGR.IO is a technology consulting firm empowering data to drive analytics for revenue growth and operational efficiencies. Our teams deliver software solutions, data science strategies, enterprise infrastructure, and management consulting to the world’s largest companies. We are an elite group with MIT roots, shining when tasked with complex missions: assembling mounds of data from a variety of sources, building high-volume, highly-available systems, and orchestrating analytics to transform technology into perceivable business value.

With extensive domain knowledge, has teams of architects and engineers that deliver best-in-class solutions across a variety of verticals. This diverse industry exposure and our constant run-in with the cutting edge, empowers us with invaluable tools, tricks, and techniques. We bring knowledge and horsepower that consistently delivers innovative, cost-conscious, and extensible results to complex software and data challenges. Learn more at


Are your targeted prospects, even after making significant marketing investments, still just hanging around the shopping cart and not pulling the trigger? What is the last piece of “assurance” that will get them over the fence? What if you can read into their minds and find out what’s holding them up? What is that missing carrot that will boost your revenue?

Experienced marketers will tell you that the customer journey starts long before reaching the shopping cart. You can begin to win the prospects’ favor from their first curiosity about your product, throughout the nurturing process, until they reach the final stages of comparison shopping and conversion. To effectively channel these customers down the sales funnel, you want to remain in tune with their state of readiness, and be equipped to offer the right incentives at the right time.

But, “What is the right incentive?”, and “How do you determine the right timing?”, you ask. Isn’t every customer different? You must not only read minds, but millions of minds, on a moment’s notice. How can anyone but the IBMs of the world hope to attain this kind of market intelligence?

Fortunately, today’s playing field has been leveled; all businesses can and must embrace this capability now. With the emergence of Big Data technologies and recent advancements in analytics, offering a fine-grained customer experience, across all channels, is no longer just in the domain of giant multi-nationals. The cloud infrastructure removes the economic factor in utilizing unlimited compute and storage capacity. And specialty consulting firms like deliver the quantative expertise which makes analytics initiatives low risk have-to-have propositions.

In this white paper, we discuss the importance of customer analytics in terms of business impacts, and present how delivers customer acquisition through personalization.


Modern commerce is predominantly conducted via digital channels. This allows suppliers to collect unprecedented amounts of data on their customers. Advanced analytics techniques turn this data into insights that help businesses serve individual customers better and win more sales.

Customer 360-Degree View

A 360-degree view yields a complete profile of the customer by incorporating all available information, whether from internal repositories, email, voice records, or social networks. Such a detailed understanding translates into a superior customer experience, improved campaign effectiveness, boosted sales, and better retention.

Customer 360 View

Customer Journey – track each customer’s digital trail and decipher mood relative to the experience. These interactions contain the clues on how best to engage prospects and when.

Examples of customer journey include:

  • Opening a new account at a retail bank
  • Shopping for new music online
  • Monthly billing for an online service

Customer Acquisition – customer profiles combined with nurturing collateral are the carrots that win the hearts of new clients. The 360-degree view is the basis for tactical engagements through all touchpoints, from first exposure to final conversion and continued engagement.

Customer Churn – Conversely, every existing customer retained is worth a new customer earned. Existing clients who exhibit signs of dissatisfaction or inclinations towards switching can be detected through real-time analytics and remedial efforts can be made in time to retain loyalty.



  • Contact list through referrals, social media, blogs, linkedin, events, website
  • Enrich each prospect by cross referencing to CRM account and other available information
  • Segmentation for email / Ad campaign

Engaged – first response of any sort

  • Ad targeting
  • Email reminders
  • Webinar / live event invitations


  • Promotional offers
  • Assigned to sales rep
  • Call center active list
  • High search keyword, CPM targeting


  • Account management / customer care
  • Purchase history
  • Satisfaction survey


  • Recommendation upsell / cross sell
  • Product release announcements


  • Loyalty program
  • Monitoring signs of dissatisfaction
  • Frequent re-engagement
  • Encourage viral advocacy

Fine-Grained Personalization

The ultimate customer care is when the buyer is automatically given or presented with the most probable match to their needs or wants. Personal preferences are ingrained in the breadcrumbs left along the entire digital journey. can help you collect, analyze and serve them up for each and every instance of customer interaction. Implementation of personal profiles as active records means dealing with the challenges of rapid access and sheer volume; these are the very missions of Big Data engineering. By embracing the new generation of computing infrastructure, the forward-looking enterprise can incorporate this high degree of personalization into daily operations.

Blend CRM Data with Social Media Sentiments

This is where the Navy meets the pirate; the orthodox meets the avant garde. Businesses shouldn’t pretend they understand all there is to know about their customers, regardless of the extent of their CRM database. Consumers are attracted to coolness, and what’s cool is in the Facebook likes, the midnight Tweets, and the Instagram posts.

Deciphering social media sentiment is a fascinating and daunting challenge. All the factors that motivated the Big Data movement come together under one roof. The data is unformatted, sketchy, nearly impossible to authenticate and validate, and often short-lived in relevance, yet likely to be highly honest. Once the author’s identity is mapped to the user ID in the CRM, the marketer gains insight about a person’s habits, range of motion, purchasing pattern, and strongly held opinions that otherwise would not be revealed in a targeted survey. As a whole, social media data also points out market trends and help companies adapt to changing customer taste. This effort is one of the most important steps in completing the 360-degree view of your customers.

Next Best Action

Customer insights become gold if the business can plot the Next Best Action (NBA); the appropriate course of action that matches the customer’s individual needs. Adaptive, Data-driven decisions are far superior to gut feelings and intuition when it comes to consistent results. Predictive analysis finds the best combination of timing and marketing levers that elicits the most optimal response from the buyer. Call center operators can work from a list of most likely customer concerns before picking up. Visiting reps can arrive at meetings prepared with the most appealing agenda. Customer support agents can proactively offer assistance or promotional material before frustration builds.

The model itself can be retrained at the appropriately set intervals to remain adaptive. Through Predictive Model Markup Language (PMML), personalized NBAs can integrate directly into operational systems such as marketing automation systems, call center software, consultation scheduling systems; anywhere automated or manual processes take place.


Today’s marketing activity has moved way beyond the back office CRM. Marketing automation systems, ad targeting, email delivery systems, A/B and multivariate testing tools, consumer surveys, crowdsourcing, and social media monitoring are all necessary parts of a complete arsenal to achieve marketing success. The escalation in tools and data also means the enterprise must utilize cutting-edge analytics to orient its campaigns and drive personalized marketing efforts in an ever-more competitive landscape.

Precision Recommendations with Factorization Machines

A recommendation engine is the most obvious of eCommerce strategies for any business trying to boost revenue from its existing client base. Collaborative Filtering is the most commonly-recognized technique for finding likely customer-product matches, given only sparse purchasing history data. A recent innovation, Factorization Machines, far exceeds collaborative filtering in recommendation precision to the point where a single customer-product pair can be identified to set the recommendation score. In addition, this new technique incorporates cross-feature influences which may be too important to be overlooked. For example, a viewer may actually dislike romance and comedy movies, but is fascinated by romantic-comedies.

Detection of the Prospect Funnel State – Buyer Readiness

The sales funnel is an important marketing concept used to direct sales efforts and organize campaigns. While marketing automation platforms offer prospect scoring, the feature is always set manually. Advanced analytics techniques such as the Hidden Markov Model can automatically and optimally score each prospect based on digital behavior. Digitally-intelligent enterprises can leapfrog the competition with aggressive adoption of data driven innovations and harness these important hidden indicators.

Advanced classification for market segmentation

While fine-grained personalization is well within the capability of today’s app serving platforms, from a management perspective it is of practical necessity to conduct campaigns at a cluster level. Analytically sophisticated organizations can make guided decisions in choosing amongst a host of clustering techniques from simple (k-means) to advanced (latent classification) to handle data at the extreme scales of volume and complexity. The mathematically derived classifications reduce dependency on gut feelings in favor of consistent returns.

Streaming analytics for real-time customer monitoring

There are a number of use cases where a timely response from an organization is vital to success. A dissatisfied customer can exhibit familiar disapproving behavior patterns right before defecting to competitors. Other digital signatures are indicative of high receptivity to promotional efforts. The concept can be extended to fraudulent transactions. The urgency for response in these cases are self evident. Streaming data analytics brings real-time pattern detection within reach of every progressive organization.

Data Management Platform

Effective marketing campaigns are essential to business expansion and bringing in sales. can help you design and build your Data Management Platform (DMP) to help drive your marketing initiatives. A DMP incorporates data on retail transactions, catalogs, social media, and online advertising. Our DMPs supports hypothesis-driven, high-impact analyses, and helps your organization move from simply collecting data to surfacing actionable insights and delivering business results.


Let’s use a fictional company called Ace Shopper (“Ace”) and a fictional customer named Hank to illustrate the above concepts in action. Ace is a specialty consumer electronic super store which actively profiles its buyers with customer 360-degree view analytics. Its records have customer Hank on file as a busy executive with three teens, two of which are musically gifted. In the past 3 years, Hank’s family spent over $10K per year on musical instruments and accessories in the store.

Ace’s marketing department categorizes its customer population based on the most distinguishing features, which are found to be indicative of their spending patterns. Hank falls in the group for highly educated parent with college bound children. His purchase records show a preference for high end electronics and incidental interest in laptops.

When a new electronic keyboard came to the market, Ace’s marketing department ran their recommendation engine and found Hank to be a top 10% match. Ace proactively includes Hank in their email campaign on seasonal product releases. The email addresses Hank by name, and thanked him for his purchase items in the past one year.

Soon after the email campaign, the customer tracker application lists Hank as a customer whose interest level moved from “unaware” to “interested” due to his short visit to Ace’s eStore. The marketing platform does not fire additional emails yet, but starts retargeting Hank with music instrument ads.

In a few months when summer break nears, the tracking application picks up frequent visits by Hank on the eStore and each time the click trail dives deeper into product details. Click tracking software also picks up signs of comparison shopping. Hank’s readiness status is now upgraded to “active”. With this status, the marketing platform now authorizes promotions valued up to 5% of Hank’s annual purchase.

At the same time, one of Hank’s Facebook influencers posted a complaint about a failed circuit board on an electric guitar, which Hank actively responded to. Apparently, this issue worried Hank sufficiently that he opened a chat session with Ace’s sales agent. Before responding, the agent sees Hank’s readiness status, purchase history, issue list, and suggested promotional offers prepared by customer analytic engine. The agent greets Hank, assures him on the quality of the new product line, and offers a free two-year extended warranty, plus an invitation to attend an in-store live performance by a famed local artist. Within a week, Hank became a happy repeat customer.


Customer 360-degree view means a complete understanding of a client’s needs and a personalized customer care presence. This level of intelligence and responsiveness can only be achieved with a highly integrated system that readily can access all data sources and continually refreshes the status of the customer.

Big Data technology and advanced analytics brings fine-grained personalization within reach of all forward looking companies. It’s no longer a matter of economics but a matter of will on the part of business owners. with its team of high-caliber consultants can help architect a platform, comb though your data, and implement a custom analytic solution that matches your unique marketing practice.

Leveraging Big Data, ​Advanced Machine Learning, and Complex Event Processing Technologies

Written by Bruce Ho’s Chief Data Scientist


The global Internet of Things (IoT) market will grow to $1.7 trillion in 2020 from $656 billion in 2014, according to IDC Insights Research. IoT is forecast to generate a staggering 500 zettabytes of data per year by 2019, coming from 50 billion connected devices (up from 134.5 ZB per year in 2014), according to a report from Cisco. Massive challenges arise in managing this data, making it “useful”. This is due not just to the sheer volume of data being generated, but also the inherent complexity in the data. Fortunately, there are great open source applications and frameworks such as Spark & Hadoop that have emerged to address these challenges. Similarly, advances in Neural Networks, Deep Learning, and Complex Event Processing help drive ever-more sophisticated analyses. can help you take on these new challenges at any stage of the adoption lifecycle: strategy/planning phase, infrastructure design and implementation, or production operations and data science.

ABOUT BIGR.IO is a technology consulting firm empowering data to drive innovation and advanced analytics. We specialize in cutting-edge Big Data and custom software strategy, analysis, architecture, and implementation solutions. We are an elite group with MIT roots, shining when tasked with complex missions. Whether it’s assembling mounds of data from a variety of sources, surfacing intelligence with machine learning, or building high-volume, highly-available systems, we consistently deliver.

With extensive domain knowledge, has teams of architects and engineers that deliver best-in-class solutions across a variety of verticals. This diverse industry exposure and our constant run-in with the cutting edge, empowers us with invaluable tools, tricks, and techniques. We bring knowledge and horsepower that consistently delivers innovative, cost-conscious, and extensible results to complex software and data challenges. Learn more at


Potential applications of IoT range from health maintenance and remote diagnoses at an individual level to grandiose world-changing scenarios like smart semi-automated factories, buildings, homes, and cities. IoT systems generate serious amounts of data. For example:

a Boeing 787 aircraft generates 40TB per hour of flight

a Rio Tinto mining operation can generate up to 2.4TB of data per minute

Data is ingested with or without schema, in textual, audio, video, imagery and binary forms, sometimes multi-lingual and often encrypted, but almost always with real-time velocity. While the initial technology challenge in harnessing IoT is an infrastructural upgrade to address the data storage, integration, and analytic requirements, the end goal is to generate meaningful business insights from the ocean of data that can translate to strategic business advantages.

The first step is making sure your infrastructure is ready for the influx of this increased data volume. It is imperative to productionize Hadoop and reap the benefits of technologies such as Spark, Hive, and Mahout. has specialists who can evaluate your current systems and provide any architectural direction necessary to update your infrastructure to embrace “Big Data”, while leveraging your existing investments. Once the environment is fully implemented, will then help you capitalize on your investment with Machine Learning experts who can help you to start mining, surfacing insights, and automating the process of notifications and autonomous actions based on data insights.

The branch of machine learning most central to IoT is automated rule generation; uses the term Complex Event Processing (CEP). These rules represent causal relations between the observed events (e.g. noise, vibration) and the phenomena to be detected (worn washer). Human experts can be employed to create user-defined rules within reasonable limits of complexity. In sensor data terms, that limit is the first millimeter in the journey to Mars. The raw events themselves rarely tell a clear story. Reliable and identifiable signs of trouble generally consist of a combination of low-level events masqueraded in irregular temporal patterns. Individual events that make up the valid signal can exhibit temporal behaviors over impossibly wide ranges from sub seconds to months or longer, each further confounded by anomalies such as sporadicity or outliers. Only machine learning techniques can overcome both the challenge of collecting, preparing and fusing the massive data into useful feature sets, and extract the event patterns that can be inducted as readable rules for predicting a future recurrence of a suspect phenomenon.

Embracing IoT

As in any nascent field of endeavor, there are multiple candidate approaches inspired by techniques proven in related past experiences, each with their promises and handicaps. While abundant rule-based classifiers are reported in literature and have gone through extended efforts of improvement, they were generally applied to classes of problems that are narrower in scope, of an offline nature, and lack explicit temporal attributes. At, we reach beyond these more established classification approaches in favor of innovations that deal more effectively with the greater levels of volume and complexity typically found in the IoT context. As usually is the case in machine learning, we find that better final results are obtained by using an ensemble of models that are optimally combined using proven techniques like Super Learner.


For problems of this complexity, Neural Networks are a natural fit. In statistical terms, a neural network implements regression or classification by applying nonlinear transformation to linear combinations of raw input features. Because of the typically 3 or 4 layers and potentially high number of nodes per layer, it is generally untenable to interpret the intermediate model representations, even when good prediction results are achieved, and the computational load requires a dedicated engineering effort.

Neural Networks have many key characteristics which make it an attractive and typically the default option for very complex modeling such as those found in IoT applications. Sensor data is voluminous with complex patterns (especially temporal patterns); both fall under the strengths of neural networks. The variety of data representations makes feature engineering difficult for IoT, but neural networks automate feature engineering. Neural Networks also excel in cross-modality learning, matching the multiple modalities found in IoT.

Adding Deep Learning to Neural Network architectures takes the sophistication and accuracy of machine-generated insights to the next level and is’s preferred method. Deep Learning Neural Networks differ from “normal” neural networks by adding in more hidden layers and can be trained in both an unsupervised and supervised manner (although we suggest employing unsupervised learning tasks as often as feasible).

There are numerous additional strengths in the Deep Learning approach:

  • Full expressiveness from non-linear transformations
  • Robustness to unintended feature correlations
  • Allows extraction of learned features
  • Can stop training anytime and reap the rewards
  • Results improve with more data
  • World class pattern recognition capabilities

Because of its richness in expressiveness, Deep Learning can be counted on to tackle IoT rule extraction from a modeling perspective. However, the complexity and risks associated with the implementation should be weighted carefully. Consider some well known challenges:

  • Slow to train – high iterations and many hyper parameters translate to significant computing time
  • Black box paradigm – subject matter experts cannot make sense of the connections to improve results
  • Over fitting is a common problem that requires attention
  • Still requires preprocessing steps to handle dirty data problems such as missing values
  • Practitioners generally resort to special hardware to achieve desired performance’s team of highly-trained specialists is well-equipped to take on these implementation challenges. We select from a host of available platforms including Apache Spark, Nvidia CUDA, or HP Distributed Mesh Computing. Often, having the necessary intuitions derived from experience can expedite the completion of training by 10 times.

In certain cases, the performance cost associated with Neural Networks, especially with Deep Learning motivates other approaches. One alternative often champions is the use of a specialty CEP engine which is optimized for flowing sensor data.


In this approach, we look at the IoT rule extraction challenge not as a generalized machine learning problem, but rather one which is characterized by some unique aspects:

  • Voluminous and flowing data
  • The input is one or more event traces
  • Temporal pattern plays a prominent role besides event types and attributes
  • A decomposable problem into time window, sequence and conjunctive relationships
  • The event sequence and their time relationship forms large grains of composite events
  • The conjunction of the composite events formulates describable rules for predicting suspect phenomenon

This CEP engine represents a practical tradeoff between expressiveness and performance. Where a comparable IoT study may require days to process, this specialized engine may complete its task in under an hour. Parallelization based on in-memory technologies such as Apache Spark may soon lead to real-time or near real-time IoT analysis. Unlike the case of a Neural Network, a subject matter expert can make sense of the results from this engine, and may be able to manually optimize the rule through iterations.

These two approaches are complementary in a number of ways. For example, a prominent derived feature involving an obscure non-linear combination of raw events may be extracted from the Neural Network study and fed into the CEP engine and vastly improve the quality of prediction. The CEP engine might drive an initial effort of any study, extracting most of the low hanging fruit rules. This leaves Neural Networks to detect the remaining rules after pruning either the sample data or event types from the first phase. In some cases, the two techniques can simply be used for cross-validation when inconsistent results are obtained.


Running more than one modeling approach is more the norm than the exception in today’s machine learning best practices. Recent work has demonstrated that an ensemble of a collection of algorithms can outperform a single algorithm. The stacking algorithm and combined weak classifiers are two examples of formal research where the ensemble approach produces better results.

In this context, the two model approach can lead to a final result in several ways:

  • Voluminous and flowing data
  • The input is one or more event traces
  • Temporal pattern plays a prominent role besides event types and attributes
  • A decomposable problem into time window, sequence and conjunctive relationships
  • The event sequence and their time relationship forms large grains of composite events
  • The conjunction of the composite events formulates describable rules for predicting suspect phenomenon

A Super Learner is a loss-based supervised learning method that finds the optimal combination of a collection of prediction algorithms. It is generally applicable to any project with either diverse models or a single model which leverages different feature sets and modeling parameters. Such provisions can mean significant improvements in terms of reduced false alarms or increased accuracy of detection. Depending on the context of the application, one or both of such improvements can have a strong impact on the perceived success of the project.

Repeatable Approaches to Big Data Challenges for Optimal Decision Making


A number of architectural patterns are identified and applied to a case study involving ingest, storage, and analysis of a number of disparate data feeds. Each of these patterns is explored to determine the target problem space for the pattern and pros and cons of the pattern. The purpose is to facilitate and optimize future Big Data architecture decision making.

The patterns explored are:

  • Lambda
  • Data Lake
  • Metadata Transform
  • Data Lineage
  • Feedback
  • Cross­Referencing


Modern business problems require ever­-increasing amounts of data, and ever ­increasing variety in the data that they ingest. Aphorisms such as the “three V’s ​ ” have evolved to describe some of the high­-level challenges that “Big Data” solutions are intended to solve. An introductory article on the subject may conclude with a recommendation to consider a high­level technology stack such as Hadoop and its associated ecosystem.

While this sort of recommendation may be a good starting point, the business will inevitably find that there are complex data architecture challenges both with designing the new “Big Data” stack as well as with integrating it with existing transactional and warehousing technologies.

This paper will examine a number of architectural patterns that can help solve common challenges within this space. These patterns do not rely on specific technology choices, though examples are given where they may help clarify the pattern, and are intended to act as templates that can be applied to actual scenarios that a data architect may encounter.

The following ​ case study​ will be used throughout this paper as context and motivation for application of these patterns:

Alpha Trading, Inc. (ATI)​ is planning to launch a new quantitative fund. Their fund will be based on a proprietary trading strategy that combines real­-time market feed data with sentiment data gleaned from social media and blogs. They expect that the specific blogs and social media channels that will be most influential, and therefore most relevant, may change over time. ATI’s other funds are run by pen, paper, and phone, and so for this new fund they start building their data processing infrastructure Greenfield.


Diagram 1: ATI Architecture Before Patterns

The first challenge that ATI faces is the timely processing of their real­-time (per­ tick) market feed data. While the most recent ticks are the most important, their strategy relies on a continual analysis of not just the most recent ticks, but of all historical ticks in their system. They accumulate approximately 5GB of tick data per day. Performing a batch analysis (e.g. with Hadoop) will take them an hour. This 2 batch process gives them very good accuracy – great for predicting the past, but problematic for executing near ­real-time trades. Conversely, a streaming solution (e.g. Storm, Druid, Spark) can only accommodate the most recent data, and often uses approximating algorithms to keep up with the data flow. This loss of accuracy may generate false trading signals within ATI’s algorithm.

In order to combat this, the ​ Lambda Pattern​ will be applied. Characteristics of this pattern are:

  • The data stream is fed by the ingest system to both the batch and streaming analytics systems.
  • The batch analytics system runs continually to update intermediate views that summarize all data up to the last cycle time — one hour in this example. These views are considered to be very accurate, but stale.
  • The streaming analytics system combines the most recent intermediate view with the data stream from the last batch cycle time (one hour) to produce the final view.


Diagram 2: Lambda Architecture

While a small amount of accuracy is lost over the most recent data, this pattern provides a good compromise when recent data is important, but calculations must also take into account a larger historical data set. Thought must be given to the intermediate views in order to fit them naturally into the aggregated analysis with the streaming data.

With this pattern applied, ATI can utilize the full backlog of historical tick data; their updated architecture is as such:

Diagram 3: ATI Architecture with Lambda

The Lambda Pattern described here is a subset and simplification of the Lambda Architecture described in Marz/Warren. For more detailed considerations and examples of applying specific 3 technologies, this book is recommended.


ATI suspects that sentiment data analyzed from a number of blog and social media feeds will be important to their trading strategy. However, they aren’t sure which specific blogs and feeds will be immediately useful, and they may change the active set of feeds over time. In order to determine the active set, they will want to analyze the feeds’ historical content. Not knowing which feeds might turn out to be useful, they have elected to ingest as many as they can find

Diagram 4: Data Lake

They quickly realize that this mass ingest causes them difficulties in two areas:

  • Their production trading server is built with very robust (and therefore relatively expensive) hardware, and disk space is at a premium. It can handle those feeds that are being actively used, but all the speculative feeds consume copious amounts of storage space.
  • Each feed has its own semantics; most are semi­ structured or unstructured, and all are different. Each requires a normalization process (e.g. an ETL workflow) before it can be brought into the structured storage on the trading server. These normalization processes are labor­intensive to build, and become a bottleneck to adding new feeds.

These challenges can be addressed using a ​ Data Lake Pattern​. In this pattern, all potentially useful data sources are brought into a landing area that is designed to be cost­-effective for general storage. Technologies such as HDFS serve this purpose well. The landing area serves as a platform for initial exploration of the data, but notably does not incur the overhead of conditioning the data to fit the primary data warehouse or other analytics platform. This conditioning is conducted only after a data source has been identified of immediate use for the mainline analytics. Data Lakes provide a means for capturing and exploring potentially useful data without incurring the storage costs of transactional systems or the conditioning effort necessary to bring speculative sources into those transactional systems. Often all data may be brought into the Data Lake as an initial landing platform. However, this extra latency may result in potentially useful data becoming stale if it is time sensitive, as with ATI’s per­ tick market data feed. In this situation, it makes sense to create a second pathway for this data directly into the streaming or transactional system. It is often a good practice to also retain that data in the Data Lake as a complete archive and in case that data stream is removed from the transactional analysis in the future.

Incorporating the Data Lake pattern into the ATI architecture results in the following:

Diagram 5: ATI Architecture with Data Lake


By this time, ATI has a number of data feeds incorporated into their analysis, but these feeds carry different formats, structures, and semantics. Even discounting the modeling and analysis of unstructured blog data, there are differences between well structured tick data feeds. For example, consider the following two feeds ​ showing stock prices from NASDAQ and the Tokyo Stock Exchange:

The diagram above reveals a number of formatting and semantic conflicts that may affect data analysis. Further, consider that the ordering of these fields in each file is different:

NASDAQ: 01/11/2010,10:00:00.930,210.81,100,Q,@F,00,155401,,N,,,

TSE: 10/01/2008,09:00:13.772,,0,172.0,7000,,11,,

Typically, these normalization problems are solved with a fair amount of manual analysis of source and target formats implemented via scripting languages or ETL platforms. This becomes one of the most labor­-intensive (and therefore expensive and slow) steps within the data analysis lifecycle. Specific concerns include:

  • Combination of knowledge needed: in order to perform this normalization, a developer must have or acquire, in addition to development skills: knowledge of the domain (e.g. trading data), specific knowledge of the source data format, and specific knowledge of the target data format.
  • Fragility: any change (or intermittent errors or dirtiness!) in either the source or target data can break the normalization, requiring a complete rework.
  • Redundancy: many sub­ patterns are implemented repeatedly for each instance – this is low­ value (re­implementing very similar logic) and duplicates the labor for each instance.

Intuitively the planning and analysis for this sort of work is done at the metadata level (i.e. working with a schema and data definition) while frequently validating definitions against actual sample data. Identified conflicts in representation are then manually coded into the transformation (the “T” in an ETL process, or the bulk of most scripts).

Instead, the Metadata Transform Pattern proposes defining simple transformative building blocks. These blocks are defined in terms of metadata – for example: “perform a currency conversion between USD and JPY.” Each block definition has attached runtime code – a subroutine in the ETL/script – but at data integration time, they are defined and manipulated solely within the metadata domain.

Diagram 6: Metadata Domain Transform

This approach allows a number of benefits at the cost of additional infrastructure complexity:

  • Separation of expertise: Developers can code the blocks without specific knowledge of source or target data systems, while data owners/stewards on both the source and target side can define their particular formats without considering transformation logic.
  • Code generation: Defining transformations in terms of abstract building blocks provides opportunities for code generation infrastructure that can automate the creation of complex transformation logic by assembling these pre­defined blocks.
  • Robustness: These characteristics serve to increase the robustness of any transform. As long as the metadata definitions are kept current, transformations will also be maintained. The response time to changes in metadata definitions is greatly reduced.
  • Documentation: This metadata mapping serves as intuitive documentation of the logical functionality of the underlying code.

Applying the Metadata Transform to the ATI architecture streamlines the normalization concerns between the markets data feeds illustrated above and additionally plays a significant role within the Data Lake. Given the extreme variety that is expected among Data Lake sources, normalization issues will arise whenever a new source is brought into the mainline analysis. Further, some preliminary normalization may be necessary simply to explore the Data Lake to identify currently useful data. Incorporating the Metadata Transform pattern into the ATI architecture results in the following:

Diagram 7: ATI Architecture with Metadata Transform


Not all of ATI’s trades succeed as expected. These are carefully analyzed to determine whether the cause is simple bad luck, or an error in the strategy, the implementation of the strategy, or the data infrastructure. During this analysis process, not only will the strategy’s logic be examined, but also its assumptions: the data fed into that strategy logic. This data may be direct (via the normalization/ETL process) from the source, or may be take from intermediate computations. In both cases, it is essential to understand exactly where each input to the strategy logic came from – what data source supplied the raw inputs.

The Data Lineage pattern is an application of metadata to all data items to track any “upstream” source data that contributed to that data’s current value. Every data field and every transformative system (including both normalization/ETL processes as well as any analysis systems that have produced an output) has a globally unique identifier associated with it as metadata. In addition, the data field will carry a list of its contributing data and systems. For example, consider the following diagram:

Diagram 8: The Data Lineage pattern

Note that the choice is left open whether each data item’s metadata contains a complete system history back to original source data, or whether it contains only its direct ancestors. In the latter case, storage and network overhead is reduced at the cost of additional complexity when a complete lineage needs to be computed.

This pattern may be implemented in a separate metadata documentation store to the effect of less impact on the mainline data processing systems; however this runs the risk of a divergence between documented metadata and actual data if extremely strict development processes are not adhered to. Alternately, a data structure that includes this metadata may be utilized at “runtime” in order to guarantee accurate lineage. For example, the following JSON structure contains this metadata while still retaining all original feed data:

  "data” : { 
"field1" : "value1", 
"field2" : "value2" 
"metadata" : { 
  "document_id" : "DEF456", 
  "parent_document_id" : ["ABC123", "EFG789"]  

In this JSON structure the decision has been made to track lineage at the document level, but the same principal may be applied on an individual field level. In the latter case, it is generally worth tracking both the document lineage and the specific field(s) that sourced the field in question.

In the case of ATI, all systems that consume and produce data will be required to provide this metadata, and with no additional components or pathways, the logical architecture diagram will not need to be altered.


Diagram 9: Feedback Pattern

Frequently, data is not analyzed in one monolithic step. Intermediate views and results are necessary, in fact the Lambda Pattern depends on this, and the Lineage Pattern is designed to add accountability and transparency to these intermediate data sets. While these could be discarded or treated as special cases, additional value can be obtained by feeding these data sets back into the ingest system (e.g. for storage in the Data Lake). This gives the overall architecture a symmetry that ensures equal treatment of internally ­generated data. Furthermore, these intermediate data sets become available to those doing discovery and exploration within the Data Lake and may become valuable components to new analyses beyond their original intent. As higher order intermediate data sets are introduced into the Data Lake, its role as data marketplace is enhanced increasing the value of that resource as well.

In addition to incremental storage and bandwidth costs, the Feedback Pattern increases the risk of increased ​ data consanguinity, ​ in which multiple, apparently different data fields are all derivatives of the same original data item. Judicious application of the Lineage pattern may help to alleviate this 7 risk.

ATI will capture some of their intermediate results in the Data Lake, creating a new pathway in their data architecture.

Diagram 10: ATI Architecture with Feedback


By this point, the ATI data architecture is fairly robust in terms of its internal data transformations and analyses. However, it is still dependent on the validity of the source data. While it is expected that validation rules will be implemented either as a part of ETL processes or as an additional step (e.g. via a commercial data quality solution), ATI has data from a large number of sources and has an opportunity to leverage any conceptual overlaps in these data sources to validate the incoming data.

The same conceptual data may be available from multiple sources. For example, the opening price of SPY shares on 6/26/15 is likely to be available from numerous market data feeds, and should hold an identical value across all feeds (after normalization). If these values are ever detected to diverge, then that fact becomes a flag to indicate that there is a problem either with one of the data sources or with ingest and conditioning logic.

In order to take advantage of cross­-referencing validation, those semantic concepts must be identified which will serve as common reference points. This may imply a metadata modeling approach such as a Master Data Management solution, but this is beyond the scope of this paper.

As with the Feedback Pattern, the Cross-­Referencing Pattern benefits from the inclusion of the Lineage Pattern. When relying on an agreement between multiple data sources as to the value of a particular field, it is important that the sources being cross-­referenced are sourced (directly or indirectly) from independent sources that do not carry correlation created by internal modeling.

ATI will utilize a semantic dictionary as a part of the Metadata Transform Pattern described above. This dictionary, along with lineage data, will be utilized by a validation step introduced into the conditioning processes in the data architecture. Adding this cross-referencing validation reveals the final ­state architecture:

Diagram 11: ATI Architecture Validation


This paper has examined for number patterns that can be applied to data architectures. These patterns should be viewed as templates for specific problem spaces of the overall data architecture, and can (and often should) be modified to fit the needs of specific projects. They do not require use of any particular commercial or open source technologies, though some common choices may seem like apparent fits to many implementations of a specific pattern.