Posts

NLP evolved to be an important way to track and categorize viewership in the age of cookie-less ad targeting. While users resist being identified by a single user ID, they are much less sensitive to and even welcome the chance for advertisers to personalize media content based on discovered preferences. This personalization comes from improvements made upon the original LDA algorithm and incorporate word2vec concepts.

The classic LDA algorithm developed at Columbia University raised industry-wide interest in computerized understanding of documents. It incidentally also launched variational inference as a major research direction in Bayesian modeling. The ability of LDA to process massive amounts of documents, extract their main theme based on a manageable set of topics and compute with relative high efficiency (compared to the more traditional Monte Carlo methods which sometimes run for months) made LDA the de facto standard in document classification.

However, the original LDA approach left the door open on certain desirable properties. It is, at the end, fundamentally just a word counting technique. Consider these two statements:

“His next idea will be the breakthrough the industry has been waiting for.”

“He is praying that his next idea will be the breakthrough the industry has been waiting for.”

After removal of common stop words, these two semantically opposite sentences have almost identical word count features. It would be unreasonable to expect a classifier to tell them apart if that’s all you provide it as inputs.

The latest advances in the field improve upon the original algorithm on several fronts. Many of them incorporate the word2vec concept where an embedded vector is used to represent each word in a way that reflects its semantic meaning. E.g. king – man + woman = queen

Autoencoder variational inference (AVITM) speeds up inference on new documents that are not part of the training set. It’s variant prodLDA uses product of experts to achieve higher topic coherence. Topic-based classification can potentially perform better as a result.

Doc2vec – generates semantically meaningful vectors to represent a paragraph or entire document in a word order preserving manner.

LDA2vec – derives embedded vectors for the entire document in the same semantic space as the word vectors.

Both Doc2vec and LDA2vec provide document vectors ideal for classification applications.

All these new techniques achieve scalability using either GPU or parallel computing. Although research results demonstrate a significant improvement in topic coherence, many investigators now choose to deemphasize topic distribution as the means of document interpretation. Instead, the unique numerical representation of the individual documents became the primary concern when it comes to classification accuracy. The derived topics are often treated as simply intermediate factors, not unlike the filtered partial image features in a convolutional neural network.

With all this talk of the bright future of Artificial Intelligence (AI), it’s no surprise that almost every industry is looking into how they will reap the benefits from the forthcoming (dare I say already existing?) AI technologies. For some, AI will merely enhance the technologies already being used. For others, AI is becoming a crucial component to keeping the industry alive. Healthcare is one such industry.

The Problem: Diminishing Labor Force

Part of the need for AI-based Healthcare stems from the concern that one-third of nurses are baby boomers, who will retire by 2030, taking their knowledge with them. This drastic shortage in healthcare workers poses the imminent need for replacements and, while the enrollment numbers in nursing school stay stable, the demand for experienced workers will continue to increase. This need for additional clinical support is one area where AI comes into play. In fact, these emerging technologies will not only help serve as a multiplier force for experienced nurses, but for doctors and clinical staff support as well.

Healthcare-AI Automation Applications to the Rescue

One of the most notable solutions for this shortage will be automating processes for determining whether or not a patient actually needs to visit a doctor in-person. Doctors’ offices are currently inundated with appointments and patients who’s lower-level questions and concerns could be addressed without a face-to-face consultation via mobile applications. Usually in the from of chatbots, these AI-powered applications can provide basic healthcare support by “bringing the doctor to the patient” and alleviating the need for the patient to leave the comfort of their home, let alone scheduling an appointment to go in-office and visit a doctor (saving time and resources for all parties involved).

Should a patient need to see a doctor,  these applications also contain schedulers capable of determining appointment type, length, urgency, and available dates/times, foregoing the need for constant human-based clinical support and interaction. With these AI schedulers also comes AI-based Physician’s Assistants that provide additional in-office support like scheduling follow-up appointments, taking comprehensive notes for doctors, ordering specific prescriptions and lab testing, providing drug interaction information for current prescriptions, etc. And this is just one high-level AI-based Healthcare solution (albeit with many components).

With these advancements, Healthcare stands to gain significant ground with the help of domain-specific AI capabilities that were historically powered by humans. As a result, the next generation of healthcare has already begun, and it’s being revolutionized by AI.

Sometimes I get to thinking that Alexa isn’t really my friend. I mean sure, she’s always polite enough (well, usually, but it’s normal for friends to fight, right?). But she sure seems chummy with that pickle-head down the hall too. I just don’t see how she can connect with us both — we’re totally different!

So that’s the state of the art of conversational AI: a common shared agent that represents an organization. A spokesman. I guess she’s doing her job, but she’s not really representing me or M. Pickle, and she can’t connect with either of us as well as she might if she didn’t have to cater to both of us at the same time. I’m exaggerating a little bit – there are some personalization techniques (*cough* crude hacks *cough*) in place to help provide a custom experience:

  • There is a marketplace of skills. Recently, I can even ask her to install one for me.
  • I have a user profile. She knows my name and zip code.
  • Through her marketplace, she can access my account and run my purchase through a recommendation engine (the better to sell you with, my dear!)
  • I changed her name to “Echo” because who has time for a third syllable? (If only I were hamming this up for the post; sadly, a true story)
  • And if I may digress to my other good friend Siri, she speaks British to me now because duh.

It’s a start but, if we’re honest, none of these change the agent’s personality or capabilities to fit with all of my quirks, moods, and ever-changing context and situation. Ok, then. What’s on my wishlist?

  • I want my own agent with its own understanding of me, able to communicate and serve as an extension of myself.
  • I want it to learn everything about how I speak. That I occasionally slip into a Western accent and say “ruf” instead of “roof”. That I throw around a lot of software dev jargon; Python is neither a trip to the zoo nor dinner (well, once, and it wasn’t bad. A little chewy.) That Pickle Head means my colleague S… nevermind. You get the idea.
  • I want my agent to extract necessary information from me in a way that fits my mood and situation. Am I running late for a life-changing meeting on a busy street uphill in a snowstorm? Maybe I’m just goofing around at home on a Saturday.
  • I want my agent to learn from me. It doesn’t have to know how to do everything on this list out of the box – that would be pretty creepy – but as it gets to know me it should be able to pick up on my cues, not to mention direct instructions.

Great, sign me up! So how do I get one? The key is to embrace training (as opposed to coding, crafting, and other manual activities). As long as there is a human in the loop, it is simply impossible to scale an agent platform to this level of personalization. There would be a separate and ongoing development project for every single end user… great job security for developers, but it would have to sell an awful lot of stuff.

To embrace training, we need to dissect what goes into training. Let’s over-simplify the “brain” of a conversational AI for a moment: we have NLU (natural language understanding), DM (dialogue management), and NLG (natural language generation). Want an automatically-produced agent? You have to automate all three of these components.

  • NLU – As of this writing, this is the most advanced component of the three. Today’s products often do incorporate at least some training automation, and that’s been a primary enabler that leads to the assistants that we have now. Improvements will need to include individualized NLU models that continually learn from each user, and the addition of (custom, rapid) language models that can expand upon the normal and ubiquitous day-to-day vocabulary to include trade-specific, hobby-specific, or even made-up terms. Yes, I want Alexa to speak my daughter’s imaginary language with her.
  • DM – Sorry developers, if we make plugin skills ala Mobile Apps 2.0 then we aren’t going to get anywhere. Dialogues are just too complex, and rules and logic are just too brittle. This cannot be a programming exercise. Agents must learn to establish goals and reason about using conversation to achieve those goals in an automated fashion.
  • NLG – Sorry marketing folks, there isn’t brilliant copy for you to write. The agent needs the flexibility to communicate to the user in the most effective way, and it can’t do that if it’s shackled by canned phrases that “reflect the brand”.

In my experience, most current offerings are focusing on the NLU component – and that’s awesome! But to realize the potential of MicroAgents (yeah, that’s right. MicroAgents. You heard it here first) we need to automate the entire agent, which is easier said than done. But that’s not to say that it’s not going to happen anytime soon – in fact, it might happen sooner than you think.  

Echo, I’m done writing. Post this sucker.

Doh!


 

In the 2011 Jeopardy! face-off between IBM’s Watson and Jeopardy! champions Ken Jennings and Brad Rutter, Jennings acknowledged his brutal takedown by Watson during the last double jeopardy in stating “I for one welcome our new computer overlords.” This display of computer “intelligence” sparked mass amounts of conversation amongst myriad groups of people, many of whom became concerned at what they perceived as Watson’s ability to think like a human. But, as BigR.io’s Director of Business Development Andy Horvitz points out in his blog “Watson’s Reckoning,” even the Artificial Intelligence technology with which Watson was produced is now obsolete.

The thing is, while Watson was once considered to be the cutting-edge technology of Artificial Intelligence, Artificial Intelligence itself isn’t even cutting-edge anymore. Now, before you start lecturing me about how AI is cutting-edge, let me explain.

Defining Artificial Intelligence

You see, as Bernard Marr points out, Artificial Intelligence is the overarching term for machines having the ability to carry out human tasks. In this regard, modern AI as we know it has already been around for decades – since the 1950s at least (especially thanks to the influence of Alan Turing). Moreso, some form of the concept of artificial intelligence dates back to ancient Greece when philosophers started describing human thought processes as a symbolic system. It’s not a new concept, and it’s a goal that scientists have been working towards for as long as there have been machines.

The problem is that the term “artificial intelligence” has become a colloquial term applied when a machine mimics “cognitive” functions that humans associate with other human minds, such as “learning” and “problem solving.” But the thing is, AI isn’t necessarily synonymous with “human thought capable machines.” Any machine that can complete a task in a similar way that a human might can be considered AI. And in that regard, AI really isn’t cutting-edge.

What is cutting-edge are the modern approaches to Machine Learning, which have become the cusp of “human-like” AI technology (like Deep Learning, but that’s for another blog).

Though many people (scientists and common folk alike) use the terms AI and Machine Learning interchangeably, Machine Learning actually has the narrower focus of using the core ideas of AI to help solve real-world problems. For example, while Watson can perform the seemingly human task of critically processing and answering questions (AI), it lacks the ability to use these answers in a way that’s pragmatic to solve real-world problems, like synthesizing queried information to find a cure for cancer (Machine Learning).

Additionally, as I’m sure you already know, Machine Learning is based upon the premise that these machines train themselves with data rather than by being programmed, which is not necessarily a requirement of Artificial Intelligence overall.

https://xkcd.com/1838/

Why Know the Difference?

So why is it important to know the distinction between Artificial Intelligence and Machine Learning? Well, in many ways, it’s not as important now as it might be in the future. Since the two terms are used so interchangeably and Machine Learning is seen as the technology driving AI, hardly anyone would correct you if were you to use them incorrectly. But, as technology is progressing ever faster, it’s good practice to know some distinction between these terms for your personal and professional gains.

Artificial Intelligence, while a hot topic, is not yet widespread – but it might be someday. For now, when you want to inquire about AI for your business (or personal use), you probably mean Machine Learning instead. By the way, did you know we can help you with that? Find out more here.

We’re seeing and doing all sorts of interesting work in the Image domain. Recent blog posts, white papers, and roundtables capture some of this work, such as image segmentation and classification to video highlights. But an Image area of broad interest that, to this point, we’ve but scratched the surface of is Video-based Anomaly Detection. It’s a challenging data science problem, in part due to the velocity of data streams and missing data, but has wide-ranging solution applicability.

In-store monitoring of customer movements and behavior.

Motion sensing, the antecedent to Video-based Anomaly Detection, isn’t new and there are a multitude of commercial solutions in that area. Anomaly Detection is something different and it opens the door to new, more advanced applications and more robust deployments. Part of the distinction between the two stems from “sensing” what’s usual behavior and what’s different.

Anomaly Detection

Walkers in the park look “normal”. The bicyclist is the anomaly. 

Anomaly detection requires the ability to understand a motion “baseline” and to trigger notifications based on deviations from that baseline. Having this ability offers the opportunity to deploy AI-monitored cameras in many more real-world situations across a wide range of security use cases, smart city monitoring, and more, wherein movements and behaviors can be tracked and measured with higher accuracy and at a much larger scale than ever before.

With 500 million video cameras in the world tracking these movements, a new approach is required to deal with this mountain of data. For this reason, Deep Learning and advances in edge computing are enabling a paradigm shift from video recording and human watchers toward AI monitoring. Many systems will have humans “in the loop,” with people being alerted to anomalies. But others won’t. For example, in the near future, smart cities will automatically respond to heavy traffic conditions with adjustments to the timing of stoplights, and they’ll do so routinely without human intervention.

Human in the Loop

Human in the loop.

As on many AI fronts, this is an exciting time and the opportunities are numerous. Stay tuned for more from BigR.io, and let’s talk about your ideas on Video-based Anomaly Detection or AI more broadly.

A few months back, Treasury Secretary Steve Mnuchin said that AI wasn’t on his radar as a concern for taking over the American labor force and went on to say that such a concern might be warranted in “50 to 100 more years.” If you’re reading this, odds are you also think this is a naive, ill-informed view.

An array of experts, including Mnuchin’s former employer, Goldman Sachs, disagree with this viewpoint. As PwC states, 38% of US jobs will be gone by 2030. On the surface, that’s terrifying, and not terribly far into the future. It’s also a reasonable, thoughtful view, and a future reality for which we should prepare.

Naysayers maintain that the same was said of the industrial and technological revolutions and pessimistic views of the future labor market were proved wrong. This is true. Those predicting doom in those times were dead wrong. In both cases, technological advances drove massive economic growth and created huge numbers of new jobs.

Is this time different?

It is. Markedly so.

The industrial revolution delegated our labor to machines. Technology has tackled the mundane and repetitive, connected our world, and, more, has substantially enhanced individual productivity. These innovations replaced our muscle and boosted the output of our minds. They didn’t perform human-level functions. The coming wave of AI will.

Truckers, taxi and delivery drivers, they are the obvious, low-hanging fruit, ripe for AI replacement. But the job losses will be much wider, cutting deeply into retail and customer service, impacting professional services like accounting, legal, and much more. AI won’t just take jobs. Its impacts on all industries will create new opportunities for software engineers and data scientists. The rate of job creation, however, will lag far behind that of job erosion.

But it’s not all bad! AI is a massive economic catalyst. The economy will grow and goods will be affordable. We’re going to have to adjust to a fundamental disconnect between labor and economic output. This won’t be easy. The equitable distribution of the fruits of this paradigm shift will dominate the social and political conversation of the next 5-15 years. And if I’m right more than wrong in this post, basic income will happen (if only after much kicking and screaming by many). We’ll be able to afford it. Not just that — most will enjoy a better standard of living than today while also working less.

I might be wrong. The experts might be wrong. You might think I’m crazy (let’s discuss in the comments). But independent of specific outcomes, I hope we can agree that we’re on the precipice of another technological revolution and these are exciting times!

Deep Learning: Image and Video Recognition

Written by Bruce Ho

BigR.io’s Chief Big Data Scientist

Abstract

This paper illustrates the advancements in implementing Deep Neural Networks for automatic feature extraction in image and video for applications including facial recognition, programmatic video highlights, and image segmentation and object classification. Given the limitations of human abilities in earlier extraction methods, these networks exponentially increase accuracy, output, and available feature selection options for further analysis. BigR.io specializes in the following industry use cases:

  • Image Recognition

  • Video Highlights

  • Anomaly Detection

 

ABOUT BIGR.IO

BigR.io is a technology consulting firm empowering data to drive analytics for revenue growth and operational efficiencies. Our teams deliver software solutions, data science strategies, enterprise infrastructure, and management consulting to the world’s largest companies. We are an elite group with MIT roots, shining when tasked with complex missions: assembling mounds of data from a variety of sources, building high-volume, highly-available systems, and orchestrating analytics to transform technology into perceivable business value. With extensive domain knowledge, BigR.io has teams of architects and engineers that deliver best-in-class solutions across a variety of verticals. This diverse industry exposure and our constant run-in with the cutting edge empowers us with invaluable tools, tricks, and techniques. We bring knowledge and horsepower that consistently delivers innovative, cost-conscious, and extensible results to complex software and data challenges. Learn more at www.bigr.io.

 

OVERVIEW

Over the past few years, Deep Neural Network (DNN) capabilities have surpassed human parity in recognizing and interpreting images. These DNNs use Convolutional Neural Networks (CNNs) to automatically extract features from an input image with the use of convolution filters. Backpropagation then facilitates the learning by these filters of their kernel functions, starting with random values and ending up with elemental features that best represent the class of images being trained (for instance, nose, eye, and jaw shapes for face images). Image recognition is also where the highly coveted idea of transfer learning got its early foothold. Pre-trained models based on certain categories of images can be repurposed for various classification applications using only a small dataset. Since data preparation and labeling is one of the most challenging steps when carrying out supervised learning, the impact this concept has on accelerating this process cannot be overstated. Published models and datasets by some of the biggest players in the field (Google, Microsoft, etc.) now serve as a strong starting point to build robust application-specific models for businesses with only modest means for development.

 

INDUSTRY USE CASES

Similar to the adoption of best practices in big data and data science across several industry verticals, image video recognition solutions affect business outcomes across diverse government agencies and businesses. In this paper, we specifically examine use cases in the security and professional sports segments, but these solutions illustrate applications across all areas of video content creation, consumption, and monitoring.

 

IMAGE INSIGHTS

FCN8s

 

Image recognition can go beyond classification tasks for an entire image. In dense prediction, we are asking the neural network to detect the semantic context of any given pixel in a document or image. CNNs work by first finding image features that resemble certain filter functions, then floating such features to a top-level representation as a translation-invariant descriptor (e.g., detection of a nose, regardless of its position within the image). By combining both coarse- and fine-grained features at different scales, we obtain both the semantic context and location information of any one pixel. This opens the door for pixel-level semantic segmentation (aka dense prediction). Recent work on Fully Convolutional Networks (FCNs) leverages this capability to extract semantic context of a digitized document. One could, for example, detect whether a particular pixel is a title, section header, figure caption, an image, or part of a long paragraph using FCNs. A mobile user could then easily re-layout or restyle an electronic document using the extracted semantic context. FCNs have also been successfully applied to segment parts of an image, as well as full documents, with remarkable accuracy. How does this system pick potential customers from an image of a crowd, a soccer team, or a room full of event attendees? Given a close-up face shot, is this person happy to be here, in the target age group, or giving a positive response to the last sales message? Being able to answer these audience measurement questions for marketing is one of the hot areas in need of a deep learning solution. Many classic approaches to facial feature extraction and classification, Support Vector Machines, for example, have been devoted to this long-standing problem. Deep learning research in facial identification is relatively new but already outperforming older techniques by a wide margin. This development, and many other impressive improvements achieved by deep learning, are generally attributed to the automatic feature extraction function of neural networks and the incremental accuracy boost that deep learning techniques achieve when given a huge training dataset. In many applications, a high-quality, close-up facial shot is not always available. Picking faces out of an ordinary action photo may be the first step before applying any facial feature analysis. For this, the region-based CNNs (R-CNNs) excel in both speed and accuracy. The R-CNN approach proposes a number of bounding boxes in the original photo using what is called Selective Search. In this method, initial object boundaries are set using a graphical pixel similarity approach. Neighboring boxes with high pixel similarity metrics are then merged to further reduce the object count. Finally, each boxed object can be classified based on a pre-trained image recognition model.
FCN8s

 

In other efforts, researchers have extended facial analysis to emotion detection. Classically, this simply involved image labeling where the subject exhibits a range of facial expressions and a group of volunteers would mark each as happy, sad, angry, etc. — typically up to eight emotions. More recent work also incorporates dynamic facial movements, for example, capturing the complete sequence of facial movements for a smile or frown. A more generalizable model can be developed using linear scoring along the valence- arousal graph. A prediction of valence and arousal scores on future subjects can then be interpreted using a wider range of emotion states instead of the initial selection of about eight.

 

valance arousal plot

Reference: G Paltoglout, M Thelwall, Seeing Stars of Valence and Arousal in Blog Posts. Issue No. 01 Jan-Mar 2013 Vol. 4, IEEE Transactions on Affective Computing.

Points on the valence arousal plot can be translated to commonly understood emotions.

 

VIDEO HIGHLIGHTS

There are numerous highlights in every major sporting event. Manual real-time extraction of these highlights by fully attentive labelers is error-prone, requires significant manpower, is very expensive, and doesn’t scale well. Furthermore, while the most recent games may benefit from manual labeling, there are years of archived footage that remain unprocessed. Most off-stats highlights are overlooked by human observers who are instructed to look for only specific events, for example, looking for a ball boy slipping while chasing a tennis ball or a Major League splitter in a Little League game.

Today, we can automate programmatic video highlights using video recognition techniques. In addition to applying CNNs to static image features, Recurrent Neural Networks (RNNs) are able to classify video segments using optical flow between image frames. This technique is easily trained not only to extract official stat events, but also to extract any interesting player motion not explicitly logged and indexed — for example, an alley-oop in basketball. Due to the automated nature of these extraction tasks, studios can come up with new ideas at any time to build upon an existing menu of highlights.

Going beyond sporting events, any kind of motion picture, video ad, or short-form video opens itself up for potential indexing and repurposing. For example, a DC Comics fan may want the ability to easily find all instances of girl superhero encounters within the DC universe. This task requires automatic video highlight extraction, which is the key to reviving and monetizing unlimited archive contents that would otherwise remain buried and forgotten.

 

Image: Durant eyeing Rihanna after hitting a 3-pointer (she was cheering for LeBron).

 

ANOMALY DETECTION

Independent Component Analysis (ICA) is one such approach with many proposed variants. An ICA-based deep sparse feature extraction strategy combined with a non-parametric Bayesian approach can automatically determine the most optimal dimension for the latent feature vector, removing the heavy labor in parameter tuning that a full deep learning approach would entail. The reported accuracy improvement exceeds 10% over previous results. Variants of Restricted Boltzmann Machines (RBMs) are another major direction of research for deep-sparse representation. While much progress has been made on the theoretical front, the experimental results thus far lag behind the best ICA models. Reference: Y. Cong, J. Yuan, and J. Liu, “Sparse reconstruction cost for abnormal event detection,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2011, pp. 3449–3456

The graph on the right is a sparse vector representation of the image on the left. The vector dimensions, called training bases, are laid out along the x-axis, with the bars representing the coefficients for the bases needed to represent the image. A normal sample (top) can be represented as a sparse linear combination of the training bases, while an anomalous sample (bottom) requires a large number of base elements.

 

CONCLUSION

Recent advancements in image and video recognition pave the way for many business applications that would have been unimaginably hard or expensive to implement before. BigR.io excels at the application of deep learning to images and electronic documents for use cases ranging from facial recognition, to programmatic video highlights, to image segmentation and object classification.

For many years, and with rapidly accelerating levels of targeting sophistication, marketers have been tailoring their messaging to our tastes. Leveraging our data and capitalizing upon our shopping behaviors, they have successfully delivered finely-tuned, personalized messaging.

Consumers are curating their media ever more by the day. We’re buying smaller cable bundles, cutting cords, and buying OTT services a la carte. At the same time, we’re watching more and more short-form video. Video media is tilting toward snack-size bites and, of course, on demand.

Cable has been in decline for years and the effects are now hitting ESPN, once the mainstay of a cable package. Even live sports programming, long considered must see and even bulletproof by media executives, has seen declining viewership.

 

So what’s to be done?

To thrive, and perhaps merely to survive, content owners must adapt. Leagues and networks have come a long way toward embracing a “TV Everywhere” distribution model despite the obnoxious gates at every turn. But that’s not enough and the sports leagues know it.

While there are many reasons for declining viewership and low engagement among younger audiences, length of games and broadcasts are a significant factor. The leagues recognize that games are too long. The NBA has made some changes that will speed up the action and the NFL is also considering shortening games to avoid losing viewership. MLB has long been tinkering in the same vein. These changes are small, incremental, and of little consequence to the declining number of viewers.

Most sporting events are characterized by long stretches of calm, less interesting play that is occasionally accented by higher intensity action. Consider for a moment how much actual action there is in a typical football or baseball game. Intuitively, most sports fans know that the bulk of the three-hour event is consumed by time between plays and pitches. Still, it’s shocking to see the numbers from the Wall Street Journal, which point out that there are only 11 minutes of action in a typical football game and a mere 18 minutes in a typical baseball game.

 

A transformational opportunity

There is so much more they can do. Recent advances in neural network technology have enabled an array of features to be extracted from streaming video. The applications are broad and the impacts significant. In this sports media context, the opportunity is nothing short of transformational.

Computers can now be trained to programmatically classify the action in the underlying video. With intelligence around what happens where in the game video, the productization opportunities are endless. Fans could catch all of the action, or whatever plays and players are most important to them, in just a few minutes. With a large indexed database of sports media content, the leagues could present near unlimited content personalization to fans.

Want to see David Ortiz’s last ten home runs? Done.

Want to see Tom Brady’s last ten TD passes? You’re welcome.

Robust features like these will drive engagement and revenue. With this level of control, fans are more likely to subscribe to premium offerings, offering predictable recurring revenue that will outpace advertising in the long run.

Computer-driven, personalized content is going to happen. It’s going to be amazing, and we are one step closer to getting there.

Scientists have been working on the puzzle of human vision for many decades. Convolutional Neural Network (CNN or convnet)-based Deep Learning reached a new landmark for image recognition when Microsoft announced it had beat the human benchmark in 2015. Five days later, Google one-upped Microsoft with a 0.04% improvement.

Figure 1. In a typical convnet model, the forward pass reduces the raw pixels into a vector representation of visual features. In its condensed form, the features can be effectively classified using fully connected layers.

source: https://docs.google.com/presentation/d/10XodYojlW-1iurpUsMoAZknQMS36p7lVIfFZ-Z7V_aY/edit#slide=id.g18c28bf1e3_0_201

 

Data Scientists don’t sleep. The competition immediately moved to the next battlefield of object segmentation and classification for embedded image content. The ability to pick out objects inside a crowded image is a precursor to fantastic capabilities, like image captioning, where the model describes a complex image in full sentences. The initial effort to translate full-image recognition to object classification involved different means of localization to efficiently derive bounding boxes around candidate objects. Each bounding box is then processed with a CNN to classify the single object inside the box. A direct pixel-level dense prediction without preprocessing was, for a long time, a highly sought-after prize.

 

Figure 2. Use bounding box to classify embedded objects in an image

source: https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/object_localization_and_detection.html

In 2016, a UC Berkeley group, led by E. Shelhamer, achieved this goal using a technique called Fully Convolutional Neural Network. Instead of using convnet to extract visual features followed by fully connected layers to classify the input image, the fully connected layers are converted to additional layers of convnet. Whereas the fully connected layers completely lose all information on the original pixel locations, the cells in the final layer of a convnet are path-connected to the original pixels through a construct called receptive fields.

Figure 3. During the forward pass, a convnet reduces raw pixel information to condensed visual features which can then be effectively classified using fully connected neural network layers. In this sense, the feature vectors contain the semantic information derived from looking at the image as a whole.

source: https://people.eecs.berkeley.edu/~jonlong/long_shelhamer_fcn.pdf

 

Figure 4. In dense prediction, we want to both leverage the semantic information contained in the final layers of the convnet and assign the semantic meaning back to the pixels that generated the semantic information. The upsampling step, also known as the backward pass, maps the feature representations back onto the original pixels positions.

source: https://people.eecs.berkeley.edu/~jonlong/long_shelhamer_fcn.pdf

 

The upsampling step is something of great interest. In a sense, it deconvolutes the dense representation back to its original resolution and the deconvolution filters can be learned through Stochastic Gradient Descent, just like any forward pass learning process. A good visual demonstration of deconvolution can be found here. The most practical way to implement this deconvolution step is through bilinear interpolation, as discussed later.

The best dense prediction goes beyond just upsampling the last and coarsest convnet layer. By fusing results from shallower layers, the result becomes much more finely detailed. Using a skip architecture as shown in Figure 4, the model is able to make accurate local predictions that respect global structure. The fusion operation is based on concatenating vectors from two layers and perform a 1 x 1 convolution to reduce the vector dimension back down again.

 

Figure 5. Fuse upsampling results from shallower layers push the prediction limits to a finer scale.

 source: https://people.eecs.berkeley.edu/~jonlong/long_shelhamer_fcn.pdf

LABELED DATA

As is often the case when working with Deep Learning, collecting high-quality training data is a real challenge. In the image recognition field, we are blessed with open source data from PASCAL VOC Project. The 2011 dataset provides 11,530 images with 20 classes. Each image is pre-segmented with pixel-level precision by academic researchers. Examples of segmented images can be found here.

 

OPEN SOURCE MODELS

Computer vision enthusiasts also benefit hugely from open source projects which implement almost every exciting new development in the deep learning field. The author’s group posted a Caffe implementation of FCNN. For keras implementations, you will find no fewer than 9 FCN projects on GitHub. After trying out a few, we focused on the Aurora FCNproject, which started running with very little modifications. The authors provided rather detailed instruction on environment setup and downloading of datasets. We chose the AstrousFCN_Resnet50_16s model out of the six included in the project. The training took 4 weeks on a two Nvidia 1080 card cluster, which was surprising but perhaps understandable given the huge number of layers. The overall model architecture can be visualized by either a JSON tree or with PNG graphics, although both are too long to fit on one page. The figure below shows just one tiny chunk of the overall model architecture.

Figure 6. Top portion of the FCN model. The portion shown is less than one-tenth of the total.

source: https://github.com/aurora95/Keras-FCN/blob/master/Models/AtrousFCN_Resnet50_16s/model.png

It is important to point out that the authors of the paper and code both leveraged established image recognition models, generally the winning entries of the ImageNet competition, such as the VGG nets, ResNet, AlexNet, and the GoogLeNet. Imaging is the one area where transfer learning applies readily. Researchers without the near infinite resources found at Google and Microsoft can still leverage their training results and retrain high-quality models by adding only small new datasets or make minor modifications. In this case, the proven classification architectures named above are modified by stripping away the fully connected layers at the end and replaced with fully convolutional and upsampling layers.

RESNET (RESIDUAL NETWORK)

In particular, the open source code we experimented with is based on Resnet from Microsoft. Resnet has the distinction of being the deepest network ever presented on ImageNet, with 152 layers. In order to make such a deep network converge, the submitting group had to tackle a well-known problem where error rate tends to rise rather than drop after a certain depth. They discovered that by adding skip (aka highway) connections, the overall network converges much better. The explanation lies with the relative ease in training intermediates to minimize residuals rather the originally intended mapping (thus the name Residual Network). The figure below illustrates the use skip connections used in the original ResNet paper, which are found in the open source FCN model derived from ResNet.

Figure 7a. Resnet uses multiple skip connections to improve the overall error rate of a very deep network

source: https://arxiv.org/abs/1512.03385

 

Figure 7b. Middle portion of the Aurora model displaying skip connections, which is a characteristic of ResNet.

source: https://github.com/aurora95/Keras-FCN/blob/master/Models/AtrousFCN_Resnet50_16s/model.png

The exact intuition behind Residual Network is less than obvious. There is plenty good discussion in this Quora blog.

BILINEAR UPSAMPLING

As alluded to in Figure 4, at the end stage the resolution of the tensor must be brought back to original dimension using an upsampling step. The original paper stated that a simple bilinear interpolation is fast and effective. And this is the approach taken in the Aurora project, as illustrated below.

Figure 8. Only a single upsampling stage was implemented in the open source code.

source https://github.com/aurora95/Keras-FCN/blob/master/Models/AtrousFCN_Resnet50_16s/model.png

Although the paper authors pointed out the improvement achieved by use of skips and fusions in the upsampling stage, it is not implemented by the Aurora FCN project. The diagram for the end stage illustrates that only a single up sampling layer is used. This may leave room for further improvement in error rate.

The code simply makes a TensorFlow call to implement this upsampling stage:

X = tf.image.resize_bilinear(X, new_shape)

 

ERROR RATE

The metrics used to measure segmentation accuracy is intersection over union (IOU). The IOU measured over 21 randomly selected test images are:

[ 0.90853866  0.75403876  0.35943439  0.63641792  0.46839113  0.55811771

0.76582419  0.70945356  0.74176198  0.23796475  0.50426148  0.34436233

0.5800221   0.59974548  0.67946723  0.79982366  0.46768033  0.58926592

0.33912701  0.71760929  0.54273803]

These have a mean of 0.585907. This mean is very close to the number published in the original paper. The pixel level classification accuracy is very high at 0.903266, meaning when a pixel is classified as certain object type, it is correct about 90% of the time.

 

CONCLUSION

The ability to identify image pixels as members of a particular object without a pre-processing step of bounding box detection is a major step forward for deep image recognition. The techniques demonstrated by Shelhamer’s paper achieves this goal by combining coarse-level semantic identification with pixel-level location information. This technique leverages transfer learning based on pre-trained image recognition models that were winning entries in the ImageNet competition. Various open source project replicated the results. Certain implementations require extraordinarily long training time.