The Big Data landscape is largely dominated by powerful free open source technologies. Different configurations and applications of these technologies seemingly consume the majority of mindshare, and it can be easy to lose sight of commercial offerings that can provide relevant business value.

Some of the areas in which commercial vendors offer particular value:

  • Managed Architectures – Assembling, operating, and maintaining an architecture from stock project distributions can be daunting. It can make sense for many organizations that need fine-grained control and have a sufficient investment in the appropriate technical staff, however many in this situation still find stack distributions to be a source of lift for both architecture build-out and ongoing maintenance and version compatibility. For smaller organizations, managed offerings can provide a great reduction in on-site effort required to create and operate a stack with the tradeoff of a reduced level of granular control – version lag, for example.
  • Speciality Query Engines – Hive, Impala, and other open source technologies represent a powerful query capability across both structured and semi-structured data sources. Technologies such as Apache Drill extend this reach even farther, and for focused needs such as search, Apache Solr is a powerful solution. However, the commercial space abounds with query engines that are focused on niche/industry-specific use cases and libraries, and with those that seek to provide even greater performance in specific situations.
  • Integrations – Vendors are often able to provide a library of custom data ingest and egress connectors to specific products, technologies, and data sets. Common examples include ERP systems, embedded device platforms, and proprietary niche data systems.
  • Collaboration and Process – Enterprise-level programs involve many contributors in many roles working on parallel projects over years on the calendar. Just as the software engineering trade has developed processes (Agile) and technologies (source control, defect tracking, project planning), data science and analytics and (meta)data management programs will suffer if they lack similar maturity of process. There is a rich vendor space in this area, and many “enterprise editions” of open source packages attack this need as well.
  • Metadata Management – While technologies such as Hive/HCatalog and Avro/Parquet are widely used to manage schema metadata, metadata management is not comprehensively addressed purely within the open source community. Vendors have value to add in areas such as security management, governance, provenance, discovery, and lineage.
  • Wrangling and Conditioning – Data comes dirty, it comes fast or it comes with different schema, format, and semantics, and the lines separating different irregularities are very blurry. This record is different from the last one: is it dirty? Did the model change? Is this even part of the same logical data set? Open source ETL GUI tools and dataflow languages provide a tried-and-true means for creating fairly static logic pipelines, but commercial solutions can bring innovative approaches for applying machine learning and crowdsourcing techniques to data validation and preprocessing use cases.
  • Reporting and Visualization – Last but not least, the commercial marketplace is rich with products that will consume data, big and small, and help make the information contained within that data accessible, bridging the gap from technology back to insights providing business value.

So when planning your next data architecture, consider whether your business constraints compel you to plan for a purely FOSS solution, or whether commercial technologies may have a place within your architecture. Open source software is a foundational core, but in these and other areas, commercial technologies also have a lot to offer.

A recent WSJ article echoes an FTC report released last Wednesday warning of the possible consequences of bias in Big Data applications. The article identifies a number of valid concerns around privacy, equal opportunity, and accuracy. It also rightly hints at possible positive consequences as well.

For example, they quote cases where people judged poor credit risks by conventional means may receive loans as a result of big-data techniques. Good news for those people, and time will tell whether the lenders identified an underserved viable market, or whether bias simply caused them to make a poor investment. All models, including traditional analyses, will have error and ultimately we want to reduce both false positives and false negatives.

So we know that our models will have error, and the theme of this article is that a significant part of that error comes in the form of bias. It is a poor assumption that an analysis of any single data set – social media is a popular case – represents the whole population. Do people of all ages, nationalities, races, income levels, use social media in the same proportion as the general population? Probably not.

So how can we make this work to advantage?

  1. Consider bias as a first-cut classification. A common application of big data techniques is to classify large numbers of people into specific, targeted subgroups. We get our first course-grained categorization for free.
  2. Use the bias to select additional complementary data sets. If you understand the bias in your current data set, then you can strategically select additional data sets that give the best bang-for-the-buck in an effort to broadly analyze the general population. Calibrate your aggregate model by combining complementary data sets.
  3. Monitor production models. As the article observes, blind trust in correlations can be dangerous. Still, correlations can represent opportunities to be exploited. The key to safe utilization without a solid understanding of root cause is to assume those opportunities are temporary. Monitor their performance, and blow whistles as soon as the results begin to deviate from expectations.

George Box had it right: all models are wrong, but some are useful!
Hypothetical Example of complementary data sets
Example of complementary data sets (hypothetical, and for illustration only!).

When designing an enterprise architecture for business intelligence, advanced analytics, and other data­centric applications, it is often useful to capture major data flows. This may require some research into use cases and tooling and even a bit of hard thinking, but it’s a straightforward exercise. What isn’t so straightforward is capturing the state of metadata that accompanies these data flows. Understanding metadata is critical to obtaining full value and control from your data architecture, but it doesn’t lend itself as well to a typical data flow view.

Common data concerns that stem from gaps or inconsistencies within the metadata domain include:

  • Lineage – What source data contributed to a given view or data set, and how has that data been altered to get there?
  • Impact Analysis – The converse of lineage; if a data source were to be altered, which consuming applications would be affected?
  • Access & Compliance – Who is able to view or alter data? Who has exercised that ability on a given data set? Have their larger­scale access patterns put them in danger of any compliance issues?
  • Quality – How complete and accurate is a particular view on a data set?
  • Implementation – How long and how much effort is required to add a new data source or consumer?

In order to structure and assess these concerns, a metadata lifecycle model (MDLM) can be a useful architectural tool. The MDLM can be used as a framework to evaluate candidate architectures.

The Metadata Lifecycle Model

There are three user­driven events that segment different stages in the metadata lifecycle. Addressing all lifecycle stages for each data concern is a best practice; stages that are skipped or are inconsistent are primary contributors to the concerns outlined above.

The Metadata Lifecycle Model

The identified lifecycle stages are gated in time by the occurrence of these events:

  • Data Source Import – Incorporating a new data source involves not only ingesting the raw data, but also understanding the nature of that data. Note that this event is defined here as making the data available for use by applications; simple ingest­and­store (as per a Data Lake architecture) does not meet the definition.
    • Generate – Metadata describing the data source must be generated (this may be an automated or manual process depending upon the nature of the metadata and the available tooling). Data import is an ideal time to capture metadata as research and documentation are more likely to be available at this time; further there exists the possibility of additional time to invest, as there may not immediate demand for that data source.
    • Persist – This generated metadata must be captured into a documentation system that will support later discovery and usage of this metadata.
  • Consumer Design – When a consuming view or application is designed, it operates in a specific metadata context. While some types of metadata may be defined by a public or private standard, this is often not the case (or the standard may not be strict enough to definitively define every aspect of the metadata: the non­standard standard is standard.)
    • Generate – The assumptions within the metadata domain are declared for the consumer. This ensures that the consumer’s operating environment is well-understood by the designer, and provides an opportunity below to address differences in metadata from the source(s).
    • Note that within the framework presented here, this a declarative addressed below in the Discover and Mediate stages.
    • Persist ­- The consumer may itself become the source to another “downstream” consumer in the future, and therefore its metadata should be similarly persisted.
    • Discover – ­ As sources are identified to feed into the consumer, their metadata documentation must be discovered within the persistence tool.
    • Mediate – ­ With Source and Consumer metadata in­hand, differences must be identified and a reconciliation strategy defined. Specific actions will depend on the type of metadata, but examples include ETL design or access log configuration.
    • A well­designed metadata management solution will not just persist and index metadata, but will also help catalog a library of mediation strategies, allowing previous work to be reused.
  • Consumer Runtime ­ When the consumer is run over actual data by an end user, the strategies from the metadata domain described above must be applied efficiently to the runtime data domain.
    • Apply – ­ Mediation strategies are applied to the data stream/query. Examples include execution of an ETL job and appending to a lineage chart.

Types of Metadata

The MDLM applies to metadata in general, but it is most useful to identify specific types of metadata that pertain to particular goals, and examine individual treatment of these types throughout the lifecycle. The following metadata types serve to address the concerns outlined earlier in this document:

  • Context – ­ Schema, format, and semantics
  • Lineage – ­ Evolution of data: starting with the System of Record (SOR), identify all “touches” in the logic chain applied to the data
  • Security – ­ Specific access rules for given data. This may additionally include a record of previous accesses (i.e. an audit log).

Note that this not a comprehensive list, and that the MDLM can be generally applied to other types of metadata.

To conclude, we will apply the MDLM to these metadata types. Consider a data warehousing architecture as a motivating backdrop.

Context

  • Generate (Source) – ­ Identify schema and column formatting; correlate specific entities and attributes with Master Data definitions and business glossary terms.
  • Persist (Source) – ­ Store this information in a metadata documentation system.
  • Generate (Consumer) – Identify specific semantic concepts required for the consumer, and select a target schema/format in which to work with those concepts. This may be identification or creation of an operational standard.
  • Persist (Consumer) – ­ Store this information in a metadata documentation system.
  • Discover – Select data sources, and locate context documentation.
  • Mediate – Locate or design ETL (ELT, etc.) logic to transform data between source(s) and consumer.
  • Apply – Execute the ETL job.

Lineage

  • Generate (Source) – Document the SOR. If this is a relative SOR that sources data “upstream”, an interface may be available to provide richer lineage.
  • Persist (Source) – If this is a new SOR, add it to a central directory of data sources and processes.
  • Generate (Consumer) – Specify a designation for the new consumer.
  • Persist (Consumer) – Add this designation to the central directory of data sources and processes.
  • Discover – Locate data source SOR designations.
  • Mediate – Design an append operation to the lineage chart.
  • Apply – Execute the append operation to the lineage chart.

Security

  • Generate (Source) – Define access rules for the data source.
  • Persist (Source) – Store this information in a metadata store or security enforcement system.
  • Generate (Consumer) – Identify required access to data sources; specify downstream access restrictions.
  • Persist (Consumer) – Store this information in a metadata store or security enforcement system.
  • Discover – Identify available access to the required data sources.
  • Mediate – Select or create a security principal that supports all required access on the data source.
  • Apply – Utilize this principle against the data source.

In conclusion, the MDLM can provide a useful framework for assessing metadata gaps in a data architecture. If these steps are well­covered, then risks of a flawed architecture are greatly reduced, and the overall effort can proceed to address specific tooling solutions.

There is a nuance about Big Data analysis. It’s really about small data. While this may seem confusing and counter to the whole Big Data “movement”, small data is the product of Big Data analysis. This is not a new concept, nor is it unfamiliar to people who have been doing data analysis for any length of time. The overall working space is larger, but the answers lie somewhere in the “small”.

In the old days of traditional data analysis, we began with databases filled with customer information, product information, transactions, telemetry data, etc. Even then, there was too much data available to efficiently analyze. Systems, networks, and software didn’t have the performance or capacity to address the scale. As an industry we addressed the shortcomings by creating smaller data sets.

These smaller data sets were still fairly substantive and we quickly discovered other shortcomings, the most glaring was the mismatch between the data and the working context. If I worked in accounts payable, I had to look at a large amount of unrelated data in order to do my job. Again the industry responded by creating smaller, contextually relevant data sets. Big to small to smaller still.

You may recognize this as the migration from production databases to Data Warehouses to Data Marts. More often than not, the data for the warehouses and the marts were chosen on arbitrary or experimental parameters resulting in a great deal of trial and error. All too often, the data was chosen to support an output or a conclusion we wanted to see as opposed to discovering something new, interesting or anomalous. We weren’t getting the perspectives we needed or were possible because the capacity reductions weren’t based on computational fact.

Enter Big Data with all its volumes, velocities, and varieties and the problem remains or perhaps worsens. We have addressed the shortcomings of the infrastructure and can store and process huge amounts of additional data, but we also had to introduce new technologies specifically to help us manage Big Data. If we think this is challenging now, just wait a year or two. The emergence and inevitability of ubiquitous machine data is just around the corner. Don’t be scared, be prepared!

Despite the outward appearances, this is a wonderful thing. Today and in the future we will have more data than we can imagine and we’ll have the means to capture and manage it. What is more necessary than ever, is the ability to analyze the right data in a timely enough fashion to make decisions and take actions. We will still shrink the data sets into “fighting trim”, but we can do so computationally. We process the Big Data and turn it into small data so it’s easier to comprehend. It’s more precise and because it was derived from a much larger starting point, it’s more contextually relevant.