The Metadata Lifecycle
When designing an enterprise architecture for business intelligence, advanced analytics, and other datacentric applications, it is often useful to capture major data flows. This may require some research into use cases and tooling and even a bit of hard thinking, but it’s a straightforward exercise. What isn’t so straightforward is capturing the state of metadata that accompanies these data flows. Understanding metadata is critical to obtaining full value and control from your data architecture, but it doesn’t lend itself as well to a typical data flow view.
Common data concerns that stem from gaps or inconsistencies within the metadata domain include:
- Lineage – What source data contributed to a given view or data set, and how has that data been altered to get there?
- Impact Analysis – The converse of lineage; if a data source were to be altered, which consuming applications would be affected?
- Access & Compliance – Who is able to view or alter data? Who has exercised that ability on a given data set? Have their largerscale access patterns put them in danger of any compliance issues?
- Quality – How complete and accurate is a particular view on a data set?
- Implementation – How long and how much effort is required to add a new data source or consumer?
In order to structure and assess these concerns, a metadata lifecycle model (MDLM) can be a useful architectural tool. The MDLM can be used as a framework to evaluate candidate architectures.
The Metadata Lifecycle Model
There are three userdriven events that segment different stages in the metadata lifecycle. Addressing all lifecycle stages for each data concern is a best practice; stages that are skipped or are inconsistent are primary contributors to the concerns outlined above.
The identified lifecycle stages are gated in time by the occurrence of these events:
- Data Source Import – Incorporating a new data source involves not only ingesting the raw data, but also understanding the nature of that data. Note that this event is defined here as making the data available for use by applications; simple ingestandstore (as per a Data Lake architecture) does not meet the definition.
- Generate – Metadata describing the data source must be generated (this may be an automated or manual process depending upon the nature of the metadata and the available tooling). Data import is an ideal time to capture metadata as research and documentation are more likely to be available at this time; further there exists the possibility of additional time to invest, as there may not immediate demand for that data source.
- Persist – This generated metadata must be captured into a documentation system that will support later discovery and usage of this metadata.
- Consumer Design – When a consuming view or application is designed, it operates in a specific metadata context. While some types of metadata may be defined by a public or private standard, this is often not the case (or the standard may not be strict enough to definitively define every aspect of the metadata: the nonstandard standard is standard.)
- Generate – The assumptions within the metadata domain are declared for the consumer. This ensures that the consumer’s operating environment is well-understood by the designer, and provides an opportunity below to address differences in metadata from the source(s).
- Note that within the framework presented here, this a declarative addressed below in the Discover and Mediate stages.
- Persist - The consumer may itself become the source to another “downstream” consumer in the future, and therefore its metadata should be similarly persisted.
- Discover – As sources are identified to feed into the consumer, their metadata documentation must be discovered within the persistence tool.
- Mediate – With Source and Consumer metadata inhand, differences must be identified and a reconciliation strategy defined. Specific actions will depend on the type of metadata, but examples include ETL design or access log configuration.
- A welldesigned metadata management solution will not just persist and index metadata, but will also help catalog a library of mediation strategies, allowing previous work to be reused.
- Consumer Runtime When the consumer is run over actual data by an end user, the strategies from the metadata domain described above must be applied efficiently to the runtime data domain.
- Apply – Mediation strategies are applied to the data stream/query. Examples include execution of an ETL job and appending to a lineage chart.
Types of Metadata
The MDLM applies to metadata in general, but it is most useful to identify specific types of metadata that pertain to particular goals, and examine individual treatment of these types throughout the lifecycle. The following metadata types serve to address the concerns outlined earlier in this document:
- Context – Schema, format, and semantics
- Lineage – Evolution of data: starting with the System of Record (SOR), identify all “touches” in the logic chain applied to the data
- Security – Specific access rules for given data. This may additionally include a record of previous accesses (i.e. an audit log).
Note that this not a comprehensive list, and that the MDLM can be generally applied to other types of metadata.
To conclude, we will apply the MDLM to these metadata types. Consider a data warehousing architecture as a motivating backdrop.
Context
- Generate (Source) – Identify schema and column formatting; correlate specific entities and attributes with Master Data definitions and business glossary terms.
- Persist (Source) – Store this information in a metadata documentation system.
- Generate (Consumer) – Identify specific semantic concepts required for the consumer, and select a target schema/format in which to work with those concepts. This may be identification or creation of an operational standard.
- Persist (Consumer) – Store this information in a metadata documentation system.
- Discover – Select data sources, and locate context documentation.
- Mediate – Locate or design ETL (ELT, etc.) logic to transform data between source(s) and consumer.
- Apply – Execute the ETL job.
Lineage
- Generate (Source) – Document the SOR. If this is a relative SOR that sources data “upstream”, an interface may be available to provide richer lineage.
- Persist (Source) – If this is a new SOR, add it to a central directory of data sources and processes.
- Generate (Consumer) – Specify a designation for the new consumer.
- Persist (Consumer) – Add this designation to the central directory of data sources and processes.
- Discover – Locate data source SOR designations.
- Mediate – Design an append operation to the lineage chart.
- Apply – Execute the append operation to the lineage chart.
Security
- Generate (Source) – Define access rules for the data source.
- Persist (Source) – Store this information in a metadata store or security enforcement system.
- Generate (Consumer) – Identify required access to data sources; specify downstream access restrictions.
- Persist (Consumer) – Store this information in a metadata store or security enforcement system.
- Discover – Identify available access to the required data sources.
- Mediate – Select or create a security principal that supports all required access on the data source.
- Apply – Utilize this principle against the data source.
In conclusion, the MDLM can provide a useful framework for assessing metadata gaps in a data architecture. If these steps are wellcovered, then risks of a flawed architecture are greatly reduced, and the overall effort can proceed to address specific tooling solutions.