As I outline in the Machine Learning Field Guide, the concept of Machine Learning arose from interests in having machines learn from data. The industry has seen cycles of stagnation and resurgence in machine learning/AI research since as early as the 1950s. During the 1980s, we saw the emergence of the Multi-layer Perceptron and it’s back propagation training mechanism, both fundamental to today’s highly sophisticated Deep Learning architecture capable of image recognition and behavior analysis. However, to reach its zenith, this field depended on advancements in data proliferation and acquisition that wouldn’t materialize for many more decades. As promising as the initial results were, early attempts in industrial application of artificial intelligence as a whole fizzled.

Though the practice of Machine Learning only ascended to prominence recently, much of its mathematical foundation dates back centuries. Thomas Bayes, father of the Bayesian method from which we base contemporary statistical inference, wrote his famous equation in the 1700s. Shortly after, in the early 1800s, immortalized academics like Legendre and Gauss developed early forms of the statistical regression models we use today. Statistical analysis as a discipline remained an academic curiosity from this time until the commoditization of low-cost computing in the 1990s and onslaught of social media and sensor data in the 2000s.

What does this mean for Machine Learning today? Enterprises are sitting on data goldmines and collecting more at a staggering rate with ever greater complexity. Today’s Machine Learning is about mining this treasure trove, extracting actionable business insights, predicting future events, and prescribing next best actions, all in laser-sharp pursuit of business goals. In the rush to harvest these gold mines, Machine Learning is entering its golden age, buoyed by Big Data technology and Cloud infrastructure, and abundant access to open source software. Intense competition in the annual ImageNet contest between global leaders like Microsoft, Google, and Tencent rapidly propels machine learning/image recognition technology forward, and source codes for all winning entries are made available to the public free of charge. Most contestants in the Kaggle machine learning site share their work in the same spirit as well. In addition to these source codes, excellent free machine learning tutorials compete for mindshare on Coursera, edX, and Youtube. Hardware suppliers such as Nvidia and Intel further the cause by continuing to push the boundary for denser packaging of high-performance GPU to speed up Neural Networks. Thanks to these abundant resources, any aspiring entrepreneur or lone-wolf researcher has access to petabytes of storage, utility massive parallel computing, open source data, and software libraries. As of 2015, this access has led to developing computer image recognition capabilities that outperform human image recognition abilities.

With recent stunning successes in Deep Learning research, the floodgates open for industrial applications of all kinds. Practitioners enjoy a wide array of options when targeting specific problems. While Neural Networks clearly lead in the high-complexity and high-data volume end of the problem space, classical machine learning still achieves higher prediction and classification quality for low sample count applications, not to mention the drastic cost savings in computing time and gears. Research suggests that the crossover occurs at around one hundred thousand to one million samples. Just a short time ago, numbers like these would have scared away any level-headed project manager. Nowadays, data scientists are asking for more data and are getting it expediently and conveniently. A good Data Lake and data pipeline are necessary precursors to any machine learning practice. Mature data enterprises emphasize the close collaboration of data engineering (infrastructure) teams with data science teams. “Features” are the lingua franca of their interactions, not “files,” “primary keys,” or “provisions”.

Furthermore, execution environments should be equipped with continuous and visual monitoring capabilities, as any long running Neural Network training session (days to weeks) involves frequent mid-course adjustment based on feedback of evolving model parameters. Whether the most common Linear Regression or the deepest Convolutional Neural Network, the challenge of any machine learning experimentation is wading through the maze of configurational parameters and picking out a winning combination. After selecting the candidate models, a competent data scientist navigates a series of decisions from starting point, to learning rate, to sample size, to regularization setting, as well as constant examination of convergence on parallel training runs and various runtime tuning, all in attempt to get the most accurate model in the shortest amount of time.

Like I state in my recent e-book “Machine Learning Field Guide,” Machine Learning is smarter than ever and improving rapidly. This predictive juggernaut is coming fast and furious and will transform any business in its path. For the moment, it’s still a black magic in the hands of the high priests of statistics. As an organization with a mission to deliver its benefits to clients, BigR.io trained an internal team of practitioners, organized an external board of AI advisors, and packaged a Solutions Playbook as a practice guide. We have harnessed best practices, specialty algorithms, experiential guidelines, and training tutorials, all in effort to streamline delivery and concentrate most of our engagement efforts to areas that require specific customizations.

To find out more, check out the Machine Learning Field Guide, by Chief Data Scientist Bruce Ho.

To most in the know, Watson has long been considered more hype and marketing than technical reality. Presented as infinitely capable, bleeding edge technology, you might think the well-known Watson brand would be delivering explosive growth to IBM.

Reality is far different. IBM’s stock is down in a roaring market. The company is, in effect, laying off thousands of workers by ending it’s work-from-home policy. More than $60M has perhaps been wasted by MD Anderson on a failed Watson project. All of this is happening against the backdrop of a rapidly expanding market for Machine Learning solutions.

But why? I saw Watson dominate on Jeopardy.

And dominate it did, soundly beating Ken Jennings and Brad Reuter. So think for a moment about what Watson was built to do. Watson, as was proven then, is a strong Q&A engine. It does a fine job in this realm and was truly state of the art…in 2011. In this rapidly-expanding corner of the tech universe, that’s an eternity ago. The world has changed exponentially, and Watson hasn’t kept pace.

So what’s wrong with Watson?

  • It’s not the all-encompassing answer to all businesses. It offers some core competencies in Natural Language and other domains, but Watson, like any Machine Learning tech, and perhaps more than most, requires a high degree of customization to do anything useful. As such, it’s a brand around which Big Blue sells services. Expensive services.
  • The tech is now old. The bleeding edge of Machine Learning is Deep Learning, leveraging architectures Watson isn’t built to support.
  • The best talent is going elsewhere. With the next generation of tech leaders competing for talent, IBM is now outgunned.
  • …and much more discussed here.

The Machine Learning market is strong and growing. IBM has been lapped by Google, Facebook, and other big name companies, and these leaders are open sourcing much of their work.

Will Watson survive? Time will tell.

Chopping down a tree is a lot like taking on a business project (e.g., for me that could mean delivering a large software application, creating a new artificial intelligence platform, or building a company). I am on vacation this week and decided to do some yardwork today. The first project I took on was chopping down a tree. I came at it with an axe and a thin pair of gloves – definitely not the right equipment, but sometimes you just have to dive in with whatever tools you have in front of you. I threw on a pair of shades to cover the minimal safety precautions and started with a fury. I swung with all my might for a “sprint” and then had to slow down or I would have burnt out. I then started to notice the similarities to business and began reflecting:

  • First step is to identify and get clarity on what you are trying to accomplish. Some people think when it comes to starting a company or creating a product that that’s the hard part. The reality is that while the selection process does play a role in the ultimate success, it is the perseverance, perspiration, and positioning that are core.
  • Planning is a luxury that is vitally important, but needs to be balanced with diving in and getting the job done. I am a huge proponent of planning and being organized, but often times it is instinct that prevails to spark growth. If I had planned better, I would have had the right gloves at the very least. However, if I did a hardware store run, I would have lost inspiration and that tree would still be standing.
  • Assess don’t obsess about risk. I have chopped down a tree before, so the risk of chopping down a tree was precalculated, but I did have to size up this particular tree. Speed and spontaneity can get buried by over analyzing (analysis paralysis).
  • It helps to change your approach and come at it from different angles. It’s easy to fall into a rut. Shift gears if you feel diminishing returns. Take a step back and come back with a fresh perspective (a new stance). Don’t go too far adrift, as initial efforts can, and should be built upon.
  • Brute force works for a bit, but letting the weight of the axe do most of the work is the long game. The analogy is here is that you don’t need to go it alone. That method just doesn’t scale. Pull together the right team and rely on them to add to the effort (ideally autonomously).
  • Keep on doing and going. OK, I admit it, that is a quote from Henry Ford, but I have adopted it. Keep chipping away at whatever you are working on and you will eventually get there.
  • Find a certain angle that works and focus there. When you hone in on one area, you will make strides. Ride the momentum and make a significant dent, when you have it in front of you. Never ease up when you feel you are making progress.
  • In the end, it was a coordinated effort coming at it from both sides with all my might that won. Teamwork is the name of the game.

I am sore, I have a big blister on my hand, but I did it. I thought about borrowing a chainsaw or hiring someone to do it, but I knew I had it. At some point, you just know that you got it. Own it when you do!

“Alexa, how are you different than Siri?” “I’m more of a home-body”

I’m away from my desk, so I guess I can’t ask Alexa. No problem, I’ve got an iPhone in my pocket.

“Hey Siri, what’s the status of my Amazon order?” “I wish I could, but Amazon hasn’t set that up with me yet.” Doh!

IPAs (intelligent personal assistants*) are in their infancy, but they are a next major step in human-computer interaction. With the expected concurrent growth of IoT and connected devices, IPAs will be everywhere soon. Consider that it is easier to fit a small mic and speaker into a device than a screen and keyboard, and often easier to interact with such via voice outside of the desktop environment.

However, as the highly-contrived (after all I actually am at my desk typing this, and Alexa is giving me dirty looks) scenario above illustrates, IPAs have different capabilities, and different strengths and weaknesses. While Alexa and Siri both want to be my concierge, I’m more likely to talk to Watson when I want to discuss cancer treatments or need to pwn Jeopardy. When I’m hungry after midnight, it’s TacoBot to the rescue.

As a user, I already interact with more than one IPA, and over time this number is only going to grow. I want to use the IPA that is both best and most convenient for my immediate need. I have no interest in being restricted to a single IPA vendor’s ecosystem; likewise I don’t want to have to juggle endpoints and IPAs for every little task. And Taco Bell wants to craft their own brand and persona into TacoBot instead of subsuming it into one of the platform IPAs or chasing every third-party platform in a replay of the mobile app days.

What I really need is for the assorted IPAs in my life to work together on my behalf as a team. When I’m out and about, I want Siri to go ask Alexa when my order will arrive. Neither IPA alone can meet my criteria: report on order status while I’m away from my home, but Siri [mobile] and Alexa [connected to Amazon ordering] can achieve this collaboratively. Consider some of the aspects of complex, non-trivial tasks:

  • Mobility and location
  • Interactions with multiple, cross-vendor external systems
  • Asynchronous: real-world actions may take time to occur and aren’t containable within a one-time “conversation” with a current-state IPA
  • Deep understanding of both complicated domains and of my highly-personalized circumstances and context

So how do we herd these cats? One challenge is the mechanics of IPA-to-IPA communication. Will they speak the same language? How will each understand what another is good at? If the other is knowledgable about an area completely outside of the first IPA’s knowledge area?

APIs are the first, easiest option. They generally require explicit programming, but the interfaces are highly efficient and well-defined structurally. This is both a strength and a weakness, as well-defined structure imparts a rigidity and implication of understanding on both “client” and “server”. The Semantic Web was one attempt to address understanding gaps without explicit programming on both sides.

Another option is the utilization of human language. IPAs are rapidly learning to become better at this defining skill, and if they can communicate with people then why not use natural language capability with each other? Human language can be very expressive, if limited in information rate (good luck speaking more bits/s than a broadband connection), but efficiency and accuracy is a concern, at least with the current state of technology. One argument is that an IPA that does not fully understand a user’s language may better serve the user by simply relaying these words to another more suitable IPA instead of attempting to parse that poorly-understood language into an appropriate API call.

Of course, this is not an either/or decision and both may be utilized to better effect.

Language Interface for Conversation Ais

As this team of IPAs becomes more collaborative, another issue emerges that any manager will appreciate: how best to coordinate so that these IPAs function as a team rather than an inefficient collection of individuals.

  • One low-friction model is command-and-control. Alexa (or Siri, or Cortana, or Google, or… ) is the boss, makes dispatch decisions, and delegates to other IPAs.
  • Agile methodologies may provide inspiration for more collaborative processes. Goals are jointly broken down and estimated in terms of confidence, capability, etc. by the team of IPAs, and individual subtasks agreed upon and committed to by a voting system.
  • Because computation is cheap and generally fast in human time, a Darwinian approach may also work. Individual IPAs can proceed in competition and the best, or fastest, result wins. Previous wins, within a given context, will add a statistical advantage for the winning IPA in future tasks.

As IPAs become more and more entwined in our daily lives and embedded into the devices that surround us, we will learn to utilize them as a collaborative community rather than as individual devices. Unique personas become a “customer service” skill, but IPAs with whom we do not always wish to communicate directly still have value to provide. This collective intelligence is one of the directions in which we can expect to see significant advances.

* Also delicious, delicious beers. Mmmmmm beer…

While working with eBay Enterprise on a large multi-company effort to create a marketing platform, I saw first-hand the struggle around effective project management. I played Scrum Master for our development team and immersed myself into the eBay Enterprise team, including keeping a desk at their facility. Actually, our entire team kept desks there. We didn’t come in every day, as we were more productive working from home or in our offices that had private rooms, creating an environment with limited distractions. Collaborating with team members in-person was a weekly, if not bi-weekly, occurrence. We were at eBay every Tuesday (all day), regardless of scheduled meetings. The eBay architects and managers we interfaced with knew we would be there every Tuesday and would schedule meetings accordingly. ‘Water cooler talk” (actually more like foosball and lunch talk) was also pretty strong on Tuesdays. We had 2-week sprints and scheduled the demo/review, retrospective, and planning meetings all on the last Friday of each sprint, also held on site. We would also frequently come in between scheduled days on site, as needed.

I am going into detail on how we as a team were amalgamated into eBay’s development methodology and infrastructure. We were a tightly integrated extension of their team. It worked very well. When the Scaled Agile Framework (SAFe) was deployed, it wreaked havoc. First off, it was done mid-project, so it was disruptive. The real issue was that the project was already off track. We were 1 of 5 other teams on the project. There was another large Fortune 500 company on as a partner that doubled as a contractor, building a complimentary product that was intended to be married to the eBay engine and sold in separate domains (the other company was mostly interested in credit card and banking customers).

We were the “A Team” that set the quality standards. We had a higher velocity than the other teams, but points were normalized, which I thought was a huge mistake. It was done for the sake of making it easier to report points and track progress for upper management. The bigger issue was the fact many of the other teams were “Agile-in-Name-Only”, an all too common phenomenon we run across. If you are not continuously cranking quality, tested, ready-to-deploy code every sprint (giving exceptions to Sprint 0 for setup, and sprints devoted to Hardening, Innovation, and Planning (HIP) sprints), then you are not Agile. Continuous is the operative word and that should transcend development, testing, DevOps, and management.

SAFe has benefits and should be leveraged to create your most effective project teams. For us, the main boost to productivity that it brought was the concept of a Product Owner; typically an architect that serves as the main translator of product requirements from Product Managers. Product Owners are responsible for making sure there are enough stories written out to keep the development team coding. They are also responsible for doing code reviews each sprint and accepting stories.

Developers are still the heart of Agile efforts and must be preserved from typical company distractions like excessive meetings and other activities that are not coding. The fable of the Chicken and the Pig is a great way of thinking about team dynamics. Chickens (product managers, project managers, CTOs, VPs of Engineering, etc) are not core to the development of code. They are influential and very important, but they are not as “committed” and must respect the developer’s time. We have added in a 3rd character, The Wolf, whose role is to make sure the whole operation is embracing Agile and not being affected by outside factors (e.g., upper management budgeting and scheduling in Waterfall ways). The Wolves BigR.io has on staff are adept at identifying the people, policies, and processes that might impact velocity and diplomatically remove them. Wolves lest not worry about corporate politics or any other roadblock to delivering good quality production-ready code that meets or exceeds spec and covers all non-functional requirements.

The principles of Agile are available in a multitude of locations, but be sure to read through the Twelve Principles of Agile Software first as well as the history behind the Agile Manifesto. Also note that Agile is not a “religion” and while structure is key, adaptability to what will make your team thrive is paramount. Beyond the fundamentals like continuous development and testing, it really is all about the rhythm of the team. Ease of collaboration, team camaraderie and energy can’t be underestimated. Here are a handful of components in our approach that I believe are mandatory:

  • Team lunches – helps build the team dynamic and foster more cohesion. Ideally at least once a week and ideally in-person but virtual is better than not doing it all.
  • True retrospectives – give everyone an equal voice and effectuate the changes (within reason).
  • Modern environment – the right tools, equipment, and infrastructure.
    1. QA automation and continuous testing
    2. Continuous integration e.g., Jenkins
  • Planning Poker – have all Pigs in a room (ideally the last day of the prior sprint), prioritize the stories that will make it into the next sprint, and then estimate each story having every Pig vote. Our preference is to use the Fibonacci sequence for estimating relative size of stories and throw out the outliers – 1 on the high side and 1 on the low.
  • Tasking stories – the sooner this is done the better, but in our experience this is better done after everyone has had some time to think them through. We usually make sure all tasking is done by the end of the 1st Tuesday of each sprint.
  • “Control-Freedom Balance” – we make sure each team member burns down their stories as they progress. Keeping a pulse on progress is key for highlighting problem areas and making sure the project stays on track. As tech lead or Scrum Master, you really need to “measure so you can manage” and strike the delicate balance of just enough controls with developer freedom.

BigR.io’s adaptation of Agile is a key element of our team’s DNA. Our engineers embrace and employ our methodology to consistently deliver for our customers. Team member roles and responsibilities are well known at the onset of every project and everyone maintains strict adherence to the established methodology (which by design is adaptable) throughout the project. I will leave by saying that Agile is not hype, it really works, especially if you embrace the core fundamentals/principles, but adapt the structure to your team and company.

For more information on our engagement with eBay please review our case study.

Recently, a customer asked us to help transition a set of data flows from an overwhelmed RDBMS to a “Big Data” system. These data flows had a batch dynamic, and there was some comfort with Pig Latin in-house, so this made for an ideal target platform for the production data flows (with architectural flexibility for Spark and other technologies for new functionality, but here I digress).

One wrinkle from a vanilla Hadoop deployment: they wanted schema enforcement soup-to-nuts. A first instinct might be that this is simply a logical data warehouse – and perhaps it is. So often these days one hears about Hadoop and Data Lakes and Schema-On-Read as the new shiny that it is easy to forget that Schema-On-Write also has a time and a place, and as with most architectural decisions, there are tradeoffs – right (bad pun… intended?) times for each.

Schema-On-Read works well when:

  • Different valid views can be projected on a given data set. This data set may be not well-understood, or applicable across a number of varied use cases.
  • Flexibility outweighs performance.
  • The variety “V” is a dominant characteristic. Not all data will fit neatly into a given schema, and not all will actually be used; save the effort until it is known to be useful.

Schema-On-Write may be a better choice when:

  • Productionizing established flows using well-understood data.
  • Working with data that is more time-sensitive at use than it is at ingest. Fast interactive queries fall into this category, and traditional data warehousing reports do as well.
  • Data quality is critical – schema enforcement and other validation prevents “bad” data from being written, removing this burden from the data consumer.
  • Governance constraints require metadata to be tightly controlled.

It’s a very exciting time to be in the data world, with new and groundbreaking technologies released seemingly every day. There is every temptation to pick up today’s new shiny, find an excuse to throw it into production, and call it an architecture. Of course, a more deliberate approach is required for long-term success – but that doesn’t mean that there isn’t a time and place to incorporate the newest technologies!

In this post, we take a look at the different phases of data architecture development: Plan, PoC, Prototype, Pilot, and Production. Formalizing this lifecycle, and the principles behind it, ensure that we deliver low-risk business value… and still get to play with the new shiny.

Phases of data architecture development

Plan

Before a single line of code is written, a single distribution downloaded, or the first line or box drawn on a whiteboard, we need to define and understand a data strategy and use that to derive business objectives. The best way to accomplish this? Start by locking business and technical stakeholders together in a room (it helps to be in the room with them). Success is defined by business value, and we need to combine strategic and tactical business goals with real-world technical and organizational constraints. Considerations such as platform scalability, data governance, and data dynamics are important – but all are in support of the actual business uses for that data.

This is not limited to new “green field” architectures – unless a business is a brand new startup still in the garage, there is data and there is a (perhaps organic, default) data architecture. This architecture can be assessed for points of friction, and then adjusted per business objectives.

PoC

As the business objectives are solidified, the architect will assemble likely combinations of technologies both well-known and, yes, shiny. All such candidate architectures have tradeoffs and unknowns – while the core technologies may be well-understood, it’s a given that the exact application of those technologies to specific, unique business objectives are, well, unique. Don’t believe anybody who says they have a one-size-fits-all solution! While some layers of data architecture are becoming common, if not standard, in 2016 modern data architecture is still very much about gluing together disparate components in specific ways.

To this end, certain riskier possibilities will be identified to apply approaches and technologies to a given business objective. Often a proof of concept (PoC) will be developed to validate the feasibility of these possibilities. This phase should be considered experimental, will often utilize representative “toy” problems, and failure is considered a useful (and not uncommon) outcome. It goes without saying that a PoC is not intended to be a production-quality system.

Prototype

Once areas of technical risk have been addressed with appropriate PoCs and an overall candidate architecture selected, the overall architecture should be tested against more representative use cases. Given the “glue” nature of data architectures, there is plenty of room for the unknown in the overall system even when the individual components are well-understood. A prototype may use manufactured, manageable data sets, but the data and the system should reflect realistic end-to-end business objectives. The prototype is also not intended to be production quality.

Pilot

When a prototype has demonstrated systemic feasibility, it is time to implement a pilot. A pilot is a full-quality production implementation of the architecture, limited in scope to a narrow (but complete) business objective. The Pilot should strategically be a high-win project, that is capable of providing real and visible value, even as a standalone system. Most organizations will use the pilot as means to earn buy-in from all stakeholders to move into full production, which typically impacts the entire organization.

Production

After an architecture has gone into full production, it should continue to be monitored and re-evaluated in an iterative process. Where is the architecture really performing well, and where are the weaker points? What new business objectives arise, and is any new functionality required to support them? Have any new technologies been released that may have impact on “weaker” points of the architecture? What’s different about the business today than when the architecture was originally planned?

 

The Big Data landscape is largely dominated by powerful free open source technologies. Different configurations and applications of these technologies seemingly consume the majority of mindshare, and it can be easy to lose sight of commercial offerings that can provide relevant business value.

Some of the areas in which commercial vendors offer particular value:

  • Managed Architectures – Assembling, operating, and maintaining an architecture from stock project distributions can be daunting. It can make sense for many organizations that need fine-grained control and have a sufficient investment in the appropriate technical staff, however many in this situation still find stack distributions to be a source of lift for both architecture build-out and ongoing maintenance and version compatibility. For smaller organizations, managed offerings can provide a great reduction in on-site effort required to create and operate a stack with the tradeoff of a reduced level of granular control – version lag, for example.
  • Speciality Query Engines – Hive, Impala, and other open source technologies represent a powerful query capability across both structured and semi-structured data sources. Technologies such as Apache Drill extend this reach even farther, and for focused needs such as search, Apache Solr is a powerful solution. However, the commercial space abounds with query engines that are focused on niche/industry-specific use cases and libraries, and with those that seek to provide even greater performance in specific situations.
  • Integrations – Vendors are often able to provide a library of custom data ingest and egress connectors to specific products, technologies, and data sets. Common examples include ERP systems, embedded device platforms, and proprietary niche data systems.
  • Collaboration and Process – Enterprise-level programs involve many contributors in many roles working on parallel projects over years on the calendar. Just as the software engineering trade has developed processes (Agile) and technologies (source control, defect tracking, project planning), data science and analytics and (meta)data management programs will suffer if they lack similar maturity of process. There is a rich vendor space in this area, and many “enterprise editions” of open source packages attack this need as well.
  • Metadata Management – While technologies such as Hive/HCatalog and Avro/Parquet are widely used to manage schema metadata, metadata management is not comprehensively addressed purely within the open source community. Vendors have value to add in areas such as security management, governance, provenance, discovery, and lineage.
  • Wrangling and Conditioning – Data comes dirty, it comes fast or it comes with different schema, format, and semantics, and the lines separating different irregularities are very blurry. This record is different from the last one: is it dirty? Did the model change? Is this even part of the same logical data set? Open source ETL GUI tools and dataflow languages provide a tried-and-true means for creating fairly static logic pipelines, but commercial solutions can bring innovative approaches for applying machine learning and crowdsourcing techniques to data validation and preprocessing use cases.
  • Reporting and Visualization – Last but not least, the commercial marketplace is rich with products that will consume data, big and small, and help make the information contained within that data accessible, bridging the gap from technology back to insights providing business value.

So when planning your next data architecture, consider whether your business constraints compel you to plan for a purely FOSS solution, or whether commercial technologies may have a place within your architecture. Open source software is a foundational core, but in these and other areas, commercial technologies also have a lot to offer.

A recent WSJ article echoes an FTC report released last Wednesday warning of the possible consequences of bias in Big Data applications. The article identifies a number of valid concerns around privacy, equal opportunity, and accuracy. It also rightly hints at possible positive consequences as well.

For example, they quote cases where people judged poor credit risks by conventional means may receive loans as a result of big-data techniques. Good news for those people, and time will tell whether the lenders identified an underserved viable market, or whether bias simply caused them to make a poor investment. All models, including traditional analyses, will have error and ultimately we want to reduce both false positives and false negatives.

So we know that our models will have error, and the theme of this article is that a significant part of that error comes in the form of bias. It is a poor assumption that an analysis of any single data set – social media is a popular case – represents the whole population. Do people of all ages, nationalities, races, income levels, use social media in the same proportion as the general population? Probably not.

So how can we make this work to advantage?

  1. Consider bias as a first-cut classification. A common application of big data techniques is to classify large numbers of people into specific, targeted subgroups. We get our first course-grained categorization for free.
  2. Use the bias to select additional complementary data sets. If you understand the bias in your current data set, then you can strategically select additional data sets that give the best bang-for-the-buck in an effort to broadly analyze the general population. Calibrate your aggregate model by combining complementary data sets.
  3. Monitor production models. As the article observes, blind trust in correlations can be dangerous. Still, correlations can represent opportunities to be exploited. The key to safe utilization without a solid understanding of root cause is to assume those opportunities are temporary. Monitor their performance, and blow whistles as soon as the results begin to deviate from expectations.

George Box had it right: all models are wrong, but some are useful!
Hypothetical Example of complementary data sets
Example of complementary data sets (hypothetical, and for illustration only!).

When designing an enterprise architecture for business intelligence, advanced analytics, and other data­centric applications, it is often useful to capture major data flows. This may require some research into use cases and tooling and even a bit of hard thinking, but it’s a straightforward exercise. What isn’t so straightforward is capturing the state of metadata that accompanies these data flows. Understanding metadata is critical to obtaining full value and control from your data architecture, but it doesn’t lend itself as well to a typical data flow view.

Common data concerns that stem from gaps or inconsistencies within the metadata domain include:

  • Lineage – What source data contributed to a given view or data set, and how has that data been altered to get there?
  • Impact Analysis – The converse of lineage; if a data source were to be altered, which consuming applications would be affected?
  • Access & Compliance – Who is able to view or alter data? Who has exercised that ability on a given data set? Have their larger­scale access patterns put them in danger of any compliance issues?
  • Quality – How complete and accurate is a particular view on a data set?
  • Implementation – How long and how much effort is required to add a new data source or consumer?

In order to structure and assess these concerns, a metadata lifecycle model (MDLM) can be a useful architectural tool. The MDLM can be used as a framework to evaluate candidate architectures.

The Metadata Lifecycle Model

There are three user­driven events that segment different stages in the metadata lifecycle. Addressing all lifecycle stages for each data concern is a best practice; stages that are skipped or are inconsistent are primary contributors to the concerns outlined above.

The Metadata Lifecycle Model

The identified lifecycle stages are gated in time by the occurrence of these events:

  • Data Source Import – Incorporating a new data source involves not only ingesting the raw data, but also understanding the nature of that data. Note that this event is defined here as making the data available for use by applications; simple ingest­and­store (as per a Data Lake architecture) does not meet the definition.
    • Generate – Metadata describing the data source must be generated (this may be an automated or manual process depending upon the nature of the metadata and the available tooling). Data import is an ideal time to capture metadata as research and documentation are more likely to be available at this time; further there exists the possibility of additional time to invest, as there may not immediate demand for that data source.
    • Persist – This generated metadata must be captured into a documentation system that will support later discovery and usage of this metadata.
  • Consumer Design – When a consuming view or application is designed, it operates in a specific metadata context. While some types of metadata may be defined by a public or private standard, this is often not the case (or the standard may not be strict enough to definitively define every aspect of the metadata: the non­standard standard is standard.)
    • Generate – The assumptions within the metadata domain are declared for the consumer. This ensures that the consumer’s operating environment is well-understood by the designer, and provides an opportunity below to address differences in metadata from the source(s).
    • Note that within the framework presented here, this a declarative addressed below in the Discover and Mediate stages.
    • Persist ­- The consumer may itself become the source to another “downstream” consumer in the future, and therefore its metadata should be similarly persisted.
    • Discover – ­ As sources are identified to feed into the consumer, their metadata documentation must be discovered within the persistence tool.
    • Mediate – ­ With Source and Consumer metadata in­hand, differences must be identified and a reconciliation strategy defined. Specific actions will depend on the type of metadata, but examples include ETL design or access log configuration.
    • A well­designed metadata management solution will not just persist and index metadata, but will also help catalog a library of mediation strategies, allowing previous work to be reused.
  • Consumer Runtime ­ When the consumer is run over actual data by an end user, the strategies from the metadata domain described above must be applied efficiently to the runtime data domain.
    • Apply – ­ Mediation strategies are applied to the data stream/query. Examples include execution of an ETL job and appending to a lineage chart.

Types of Metadata

The MDLM applies to metadata in general, but it is most useful to identify specific types of metadata that pertain to particular goals, and examine individual treatment of these types throughout the lifecycle. The following metadata types serve to address the concerns outlined earlier in this document:

  • Context – ­ Schema, format, and semantics
  • Lineage – ­ Evolution of data: starting with the System of Record (SOR), identify all “touches” in the logic chain applied to the data
  • Security – ­ Specific access rules for given data. This may additionally include a record of previous accesses (i.e. an audit log).

Note that this not a comprehensive list, and that the MDLM can be generally applied to other types of metadata.

To conclude, we will apply the MDLM to these metadata types. Consider a data warehousing architecture as a motivating backdrop.

Context

  • Generate (Source) – ­ Identify schema and column formatting; correlate specific entities and attributes with Master Data definitions and business glossary terms.
  • Persist (Source) – ­ Store this information in a metadata documentation system.
  • Generate (Consumer) – Identify specific semantic concepts required for the consumer, and select a target schema/format in which to work with those concepts. This may be identification or creation of an operational standard.
  • Persist (Consumer) – ­ Store this information in a metadata documentation system.
  • Discover – Select data sources, and locate context documentation.
  • Mediate – Locate or design ETL (ELT, etc.) logic to transform data between source(s) and consumer.
  • Apply – Execute the ETL job.

Lineage

  • Generate (Source) – Document the SOR. If this is a relative SOR that sources data “upstream”, an interface may be available to provide richer lineage.
  • Persist (Source) – If this is a new SOR, add it to a central directory of data sources and processes.
  • Generate (Consumer) – Specify a designation for the new consumer.
  • Persist (Consumer) – Add this designation to the central directory of data sources and processes.
  • Discover – Locate data source SOR designations.
  • Mediate – Design an append operation to the lineage chart.
  • Apply – Execute the append operation to the lineage chart.

Security

  • Generate (Source) – Define access rules for the data source.
  • Persist (Source) – Store this information in a metadata store or security enforcement system.
  • Generate (Consumer) – Identify required access to data sources; specify downstream access restrictions.
  • Persist (Consumer) – Store this information in a metadata store or security enforcement system.
  • Discover – Identify available access to the required data sources.
  • Mediate – Select or create a security principal that supports all required access on the data source.
  • Apply – Utilize this principle against the data source.

In conclusion, the MDLM can provide a useful framework for assessing metadata gaps in a data architecture. If these steps are well­covered, then risks of a flawed architecture are greatly reduced, and the overall effort can proceed to address specific tooling solutions.