Recently, a customer asked us to help transition a set of data flows from an overwhelmed RDBMS to a “Big Data” system. These data flows had a batch dynamic, and there was some comfort with Pig Latin in-house, so this made for an ideal target platform for the production data flows (with architectural flexibility for Spark and other technologies for new functionality, but here I digress).

One wrinkle from a vanilla Hadoop deployment: they wanted schema enforcement soup-to-nuts. A first instinct might be that this is simply a logical data warehouse – and perhaps it is. So often these days one hears about Hadoop and Data Lakes and Schema-On-Read as the new shiny that it is easy to forget that Schema-On-Write also has a time and a place, and as with most architectural decisions, there are tradeoffs – right (bad pun… intended?) times for each.

Schema-On-Read works well when:

  • Different valid views can be projected on a given data set. This data set may be not well-understood, or applicable across a number of varied use cases.
  • Flexibility outweighs performance.
  • The variety “V” is a dominant characteristic. Not all data will fit neatly into a given schema, and not all will actually be used; save the effort until it is known to be useful.

Schema-On-Write may be a better choice when:

  • Productionizing established flows using well-understood data.
  • Working with data that is more time-sensitive at use than it is at ingest. Fast interactive queries fall into this category, and traditional data warehousing reports do as well.
  • Data quality is critical – schema enforcement and other validation prevents “bad” data from being written, removing this burden from the data consumer.
  • Governance constraints require metadata to be tightly controlled.

It’s a very exciting time to be in the data world, with new and groundbreaking technologies released seemingly every day. There is every temptation to pick up today’s new shiny, find an excuse to throw it into production, and call it an architecture. Of course, a more deliberate approach is required for long-term success – but that doesn’t mean that there isn’t a time and place to incorporate the newest technologies!

In this post, we take a look at the different phases of data architecture development: Plan, PoC, Prototype, Pilot, and Production. Formalizing this lifecycle, and the principles behind it, ensure that we deliver low-risk business value… and still get to play with the new shiny.

Phases of data architecture development

Plan

Before a single line of code is written, a single distribution downloaded, or the first line or box drawn on a whiteboard, we need to define and understand a data strategy and use that to derive business objectives. The best way to accomplish this? Start by locking business and technical stakeholders together in a room (it helps to be in the room with them). Success is defined by business value, and we need to combine strategic and tactical business goals with real-world technical and organizational constraints. Considerations such as platform scalability, data governance, and data dynamics are important – but all are in support of the actual business uses for that data.

This is not limited to new “green field” architectures – unless a business is a brand new startup still in the garage, there is data and there is a (perhaps organic, default) data architecture. This architecture can be assessed for points of friction, and then adjusted per business objectives.

PoC

As the business objectives are solidified, the architect will assemble likely combinations of technologies both well-known and, yes, shiny. All such candidate architectures have tradeoffs and unknowns – while the core technologies may be well-understood, it’s a given that the exact application of those technologies to specific, unique business objectives are, well, unique. Don’t believe anybody who says they have a one-size-fits-all solution! While some layers of data architecture are becoming common, if not standard, in 2016 modern data architecture is still very much about gluing together disparate components in specific ways.

To this end, certain riskier possibilities will be identified to apply approaches and technologies to a given business objective. Often a proof of concept (PoC) will be developed to validate the feasibility of these possibilities. This phase should be considered experimental, will often utilize representative “toy” problems, and failure is considered a useful (and not uncommon) outcome. It goes without saying that a PoC is not intended to be a production-quality system.

Prototype

Once areas of technical risk have been addressed with appropriate PoCs and an overall candidate architecture selected, the overall architecture should be tested against more representative use cases. Given the “glue” nature of data architectures, there is plenty of room for the unknown in the overall system even when the individual components are well-understood. A prototype may use manufactured, manageable data sets, but the data and the system should reflect realistic end-to-end business objectives. The prototype is also not intended to be production quality.

Pilot

When a prototype has demonstrated systemic feasibility, it is time to implement a pilot. A pilot is a full-quality production implementation of the architecture, limited in scope to a narrow (but complete) business objective. The Pilot should strategically be a high-win project, that is capable of providing real and visible value, even as a standalone system. Most organizations will use the pilot as means to earn buy-in from all stakeholders to move into full production, which typically impacts the entire organization.

Production

After an architecture has gone into full production, it should continue to be monitored and re-evaluated in an iterative process. Where is the architecture really performing well, and where are the weaker points? What new business objectives arise, and is any new functionality required to support them? Have any new technologies been released that may have impact on “weaker” points of the architecture? What’s different about the business today than when the architecture was originally planned?

 

The Big Data landscape is largely dominated by powerful free open source technologies. Different configurations and applications of these technologies seemingly consume the majority of mindshare, and it can be easy to lose sight of commercial offerings that can provide relevant business value.

Some of the areas in which commercial vendors offer particular value:

  • Managed Architectures – Assembling, operating, and maintaining an architecture from stock project distributions can be daunting. It can make sense for many organizations that need fine-grained control and have a sufficient investment in the appropriate technical staff, however many in this situation still find stack distributions to be a source of lift for both architecture build-out and ongoing maintenance and version compatibility. For smaller organizations, managed offerings can provide a great reduction in on-site effort required to create and operate a stack with the tradeoff of a reduced level of granular control – version lag, for example.
  • Speciality Query Engines – Hive, Impala, and other open source technologies represent a powerful query capability across both structured and semi-structured data sources. Technologies such as Apache Drill extend this reach even farther, and for focused needs such as search, Apache Solr is a powerful solution. However, the commercial space abounds with query engines that are focused on niche/industry-specific use cases and libraries, and with those that seek to provide even greater performance in specific situations.
  • Integrations – Vendors are often able to provide a library of custom data ingest and egress connectors to specific products, technologies, and data sets. Common examples include ERP systems, embedded device platforms, and proprietary niche data systems.
  • Collaboration and Process – Enterprise-level programs involve many contributors in many roles working on parallel projects over years on the calendar. Just as the software engineering trade has developed processes (Agile) and technologies (source control, defect tracking, project planning), data science and analytics and (meta)data management programs will suffer if they lack similar maturity of process. There is a rich vendor space in this area, and many “enterprise editions” of open source packages attack this need as well.
  • Metadata Management – While technologies such as Hive/HCatalog and Avro/Parquet are widely used to manage schema metadata, metadata management is not comprehensively addressed purely within the open source community. Vendors have value to add in areas such as security management, governance, provenance, discovery, and lineage.
  • Wrangling and Conditioning – Data comes dirty, it comes fast or it comes with different schema, format, and semantics, and the lines separating different irregularities are very blurry. This record is different from the last one: is it dirty? Did the model change? Is this even part of the same logical data set? Open source ETL GUI tools and dataflow languages provide a tried-and-true means for creating fairly static logic pipelines, but commercial solutions can bring innovative approaches for applying machine learning and crowdsourcing techniques to data validation and preprocessing use cases.
  • Reporting and Visualization – Last but not least, the commercial marketplace is rich with products that will consume data, big and small, and help make the information contained within that data accessible, bridging the gap from technology back to insights providing business value.

So when planning your next data architecture, consider whether your business constraints compel you to plan for a purely FOSS solution, or whether commercial technologies may have a place within your architecture. Open source software is a foundational core, but in these and other areas, commercial technologies also have a lot to offer.

A recent WSJ article echoes an FTC report released last Wednesday warning of the possible consequences of bias in Big Data applications. The article identifies a number of valid concerns around privacy, equal opportunity, and accuracy. It also rightly hints at possible positive consequences as well.

For example, they quote cases where people judged poor credit risks by conventional means may receive loans as a result of big-data techniques. Good news for those people, and time will tell whether the lenders identified an underserved viable market, or whether bias simply caused them to make a poor investment. All models, including traditional analyses, will have error and ultimately we want to reduce both false positives and false negatives.

So we know that our models will have error, and the theme of this article is that a significant part of that error comes in the form of bias. It is a poor assumption that an analysis of any single data set – social media is a popular case – represents the whole population. Do people of all ages, nationalities, races, income levels, use social media in the same proportion as the general population? Probably not.

So how can we make this work to advantage?

  1. Consider bias as a first-cut classification. A common application of big data techniques is to classify large numbers of people into specific, targeted subgroups. We get our first course-grained categorization for free.
  2. Use the bias to select additional complementary data sets. If you understand the bias in your current data set, then you can strategically select additional data sets that give the best bang-for-the-buck in an effort to broadly analyze the general population. Calibrate your aggregate model by combining complementary data sets.
  3. Monitor production models. As the article observes, blind trust in correlations can be dangerous. Still, correlations can represent opportunities to be exploited. The key to safe utilization without a solid understanding of root cause is to assume those opportunities are temporary. Monitor their performance, and blow whistles as soon as the results begin to deviate from expectations.

George Box had it right: all models are wrong, but some are useful!
Hypothetical Example of complementary data sets
Example of complementary data sets (hypothetical, and for illustration only!).

When designing an enterprise architecture for business intelligence, advanced analytics, and other data­centric applications, it is often useful to capture major data flows. This may require some research into use cases and tooling and even a bit of hard thinking, but it’s a straightforward exercise. What isn’t so straightforward is capturing the state of metadata that accompanies these data flows. Understanding metadata is critical to obtaining full value and control from your data architecture, but it doesn’t lend itself as well to a typical data flow view.

Common data concerns that stem from gaps or inconsistencies within the metadata domain include:

  • Lineage – What source data contributed to a given view or data set, and how has that data been altered to get there?
  • Impact Analysis – The converse of lineage; if a data source were to be altered, which consuming applications would be affected?
  • Access & Compliance – Who is able to view or alter data? Who has exercised that ability on a given data set? Have their larger­scale access patterns put them in danger of any compliance issues?
  • Quality – How complete and accurate is a particular view on a data set?
  • Implementation – How long and how much effort is required to add a new data source or consumer?

In order to structure and assess these concerns, a metadata lifecycle model (MDLM) can be a useful architectural tool. The MDLM can be used as a framework to evaluate candidate architectures.

The Metadata Lifecycle Model

There are three user­driven events that segment different stages in the metadata lifecycle. Addressing all lifecycle stages for each data concern is a best practice; stages that are skipped or are inconsistent are primary contributors to the concerns outlined above.

The Metadata Lifecycle Model

The identified lifecycle stages are gated in time by the occurrence of these events:

  • Data Source Import – Incorporating a new data source involves not only ingesting the raw data, but also understanding the nature of that data. Note that this event is defined here as making the data available for use by applications; simple ingest­and­store (as per a Data Lake architecture) does not meet the definition.
    • Generate – Metadata describing the data source must be generated (this may be an automated or manual process depending upon the nature of the metadata and the available tooling). Data import is an ideal time to capture metadata as research and documentation are more likely to be available at this time; further there exists the possibility of additional time to invest, as there may not immediate demand for that data source.
    • Persist – This generated metadata must be captured into a documentation system that will support later discovery and usage of this metadata.
  • Consumer Design – When a consuming view or application is designed, it operates in a specific metadata context. While some types of metadata may be defined by a public or private standard, this is often not the case (or the standard may not be strict enough to definitively define every aspect of the metadata: the non­standard standard is standard.)
    • Generate – The assumptions within the metadata domain are declared for the consumer. This ensures that the consumer’s operating environment is well-understood by the designer, and provides an opportunity below to address differences in metadata from the source(s).
    • Note that within the framework presented here, this a declarative addressed below in the Discover and Mediate stages.
    • Persist ­- The consumer may itself become the source to another “downstream” consumer in the future, and therefore its metadata should be similarly persisted.
    • Discover – ­ As sources are identified to feed into the consumer, their metadata documentation must be discovered within the persistence tool.
    • Mediate – ­ With Source and Consumer metadata in­hand, differences must be identified and a reconciliation strategy defined. Specific actions will depend on the type of metadata, but examples include ETL design or access log configuration.
    • A well­designed metadata management solution will not just persist and index metadata, but will also help catalog a library of mediation strategies, allowing previous work to be reused.
  • Consumer Runtime ­ When the consumer is run over actual data by an end user, the strategies from the metadata domain described above must be applied efficiently to the runtime data domain.
    • Apply – ­ Mediation strategies are applied to the data stream/query. Examples include execution of an ETL job and appending to a lineage chart.

Types of Metadata

The MDLM applies to metadata in general, but it is most useful to identify specific types of metadata that pertain to particular goals, and examine individual treatment of these types throughout the lifecycle. The following metadata types serve to address the concerns outlined earlier in this document:

  • Context – ­ Schema, format, and semantics
  • Lineage – ­ Evolution of data: starting with the System of Record (SOR), identify all “touches” in the logic chain applied to the data
  • Security – ­ Specific access rules for given data. This may additionally include a record of previous accesses (i.e. an audit log).

Note that this not a comprehensive list, and that the MDLM can be generally applied to other types of metadata.

To conclude, we will apply the MDLM to these metadata types. Consider a data warehousing architecture as a motivating backdrop.

Context

  • Generate (Source) – ­ Identify schema and column formatting; correlate specific entities and attributes with Master Data definitions and business glossary terms.
  • Persist (Source) – ­ Store this information in a metadata documentation system.
  • Generate (Consumer) – Identify specific semantic concepts required for the consumer, and select a target schema/format in which to work with those concepts. This may be identification or creation of an operational standard.
  • Persist (Consumer) – ­ Store this information in a metadata documentation system.
  • Discover – Select data sources, and locate context documentation.
  • Mediate – Locate or design ETL (ELT, etc.) logic to transform data between source(s) and consumer.
  • Apply – Execute the ETL job.

Lineage

  • Generate (Source) – Document the SOR. If this is a relative SOR that sources data “upstream”, an interface may be available to provide richer lineage.
  • Persist (Source) – If this is a new SOR, add it to a central directory of data sources and processes.
  • Generate (Consumer) – Specify a designation for the new consumer.
  • Persist (Consumer) – Add this designation to the central directory of data sources and processes.
  • Discover – Locate data source SOR designations.
  • Mediate – Design an append operation to the lineage chart.
  • Apply – Execute the append operation to the lineage chart.

Security

  • Generate (Source) – Define access rules for the data source.
  • Persist (Source) – Store this information in a metadata store or security enforcement system.
  • Generate (Consumer) – Identify required access to data sources; specify downstream access restrictions.
  • Persist (Consumer) – Store this information in a metadata store or security enforcement system.
  • Discover – Identify available access to the required data sources.
  • Mediate – Select or create a security principal that supports all required access on the data source.
  • Apply – Utilize this principle against the data source.

In conclusion, the MDLM can provide a useful framework for assessing metadata gaps in a data architecture. If these steps are well­covered, then risks of a flawed architecture are greatly reduced, and the overall effort can proceed to address specific tooling solutions.

There is a nuance about Big Data analysis. It’s really about small data. While this may seem confusing and counter to the whole Big Data “movement”, small data is the product of Big Data analysis. This is not a new concept, nor is it unfamiliar to people who have been doing data analysis for any length of time. The overall working space is larger, but the answers lie somewhere in the “small”.

In the old days of traditional data analysis, we began with databases filled with customer information, product information, transactions, telemetry data, etc. Even then, there was too much data available to efficiently analyze. Systems, networks, and software didn’t have the performance or capacity to address the scale. As an industry we addressed the shortcomings by creating smaller data sets.

These smaller data sets were still fairly substantive and we quickly discovered other shortcomings, the most glaring was the mismatch between the data and the working context. If I worked in accounts payable, I had to look at a large amount of unrelated data in order to do my job. Again the industry responded by creating smaller, contextually relevant data sets. Big to small to smaller still.

You may recognize this as the migration from production databases to Data Warehouses to Data Marts. More often than not, the data for the warehouses and the marts were chosen on arbitrary or experimental parameters resulting in a great deal of trial and error. All too often, the data was chosen to support an output or a conclusion we wanted to see as opposed to discovering something new, interesting or anomalous. We weren’t getting the perspectives we needed or were possible because the capacity reductions weren’t based on computational fact.

Enter Big Data with all its volumes, velocities, and varieties and the problem remains or perhaps worsens. We have addressed the shortcomings of the infrastructure and can store and process huge amounts of additional data, but we also had to introduce new technologies specifically to help us manage Big Data. If we think this is challenging now, just wait a year or two. The emergence and inevitability of ubiquitous machine data is just around the corner. Don’t be scared, be prepared!

Despite the outward appearances, this is a wonderful thing. Today and in the future we will have more data than we can imagine and we’ll have the means to capture and manage it. What is more necessary than ever, is the ability to analyze the right data in a timely enough fashion to make decisions and take actions. We will still shrink the data sets into “fighting trim”, but we can do so computationally. We process the Big Data and turn it into small data so it’s easier to comprehend. It’s more precise and because it was derived from a much larger starting point, it’s more contextually relevant.