Schema-on-what?
Recently, a customer asked us to help transition a set of data flows from an overwhelmed RDBMS to a “Big Data” system. These data flows had a batch dynamic, and there was some comfort with Pig Latin in-house, so this made for an ideal target platform for the production data flows (with architectural flexibility for Spark and other technologies for new functionality, but here I digress).
One wrinkle from a vanilla Hadoop deployment: they wanted schema enforcement soup-to-nuts. A first instinct might be that this is simply a logical data warehouse – and perhaps it is. So often these days one hears about Hadoop and Data Lakes and Schema-On-Read as the new shiny that it is easy to forget that Schema-On-Write also has a time and a place, and as with most architectural decisions, there are tradeoffs – right (bad pun… intended?) times for each.
Schema-On-Read works well when:
- Different valid views can be projected on a given data set. This data set may be not well-understood, or applicable across a number of varied use cases.
- Flexibility outweighs performance.
- The variety “V” is a dominant characteristic. Not all data will fit neatly into a given schema, and not all will actually be used; save the effort until it is known to be useful.
Schema-On-Write may be a better choice when:
- Productionizing established flows using well-understood data.
- Working with data that is more time-sensitive at use than it is at ingest. Fast interactive queries fall into this category, and traditional data warehousing reports do as well.
- Data quality is critical – schema enforcement and other validation prevents “bad” data from being written, removing this burden from the data consumer.
- Governance constraints require metadata to be tightly controlled.