Commercial Components For Big Data Architectures

The Big Data landscape is largely dominated by powerful free open source technologies. Different configurations and applications of these technologies seemingly consume the majority of mindshare, and it can be easy to lose sight of commercial offerings that can provide relevant business value.

Some of the areas in which commercial vendors offer particular value:

  • Managed Architectures – Assembling, operating, and maintaining an architecture from stock project distributions can be daunting. It can make sense for many organizations that need fine-grained control and have a sufficient investment in the appropriate technical staff, however many in this situation still find stack distributions to be a source of lift for both architecture build-out and ongoing maintenance and version compatibility. For smaller organizations, managed offerings can provide a great reduction in on-site effort required to create and operate a stack with the tradeoff of a reduced level of granular control – version lag, for example.
  • Speciality Query Engines – Hive, Impala, and other open source technologies represent a powerful query capability across both structured and semi-structured data sources. Technologies such as Apache Drill extend this reach even farther, and for focused needs such as search, Apache Solr is a powerful solution. However, the commercial space abounds with query engines that are focused on niche/industry-specific use cases and libraries, and with those that seek to provide even greater performance in specific situations.
  • Integrations – Vendors are often able to provide a library of custom data ingest and egress connectors to specific products, technologies, and data sets. Common examples include ERP systems, embedded device platforms, and proprietary niche data systems.
  • Collaboration and Process – Enterprise-level programs involve many contributors in many roles working on parallel projects over years on the calendar. Just as the software engineering trade has developed processes (Agile) and technologies (source control, defect tracking, project planning), data science and analytics and (meta)data management programs will suffer if they lack similar maturity of process. There is a rich vendor space in this area, and many “enterprise editions” of open source packages attack this need as well.
  • Metadata Management – While technologies such as Hive/HCatalog and Avro/Parquet are widely used to manage schema metadata, metadata management is not comprehensively addressed purely within the open source community. Vendors have value to add in areas such as security management, governance, provenance, discovery, and lineage.
  • Wrangling and Conditioning – Data comes dirty, it comes fast or it comes with different schema, format, and semantics, and the lines separating different irregularities are very blurry. This record is different from the last one: is it dirty? Did the model change? Is this even part of the same logical data set? Open source ETL GUI tools and dataflow languages provide a tried-and-true means for creating fairly static logic pipelines, but commercial solutions can bring innovative approaches for applying machine learning and crowdsourcing techniques to data validation and preprocessing use cases.
  • Reporting and Visualization – Last but not least, the commercial marketplace is rich with products that will consume data, big and small, and help make the information contained within that data accessible, bridging the gap from technology back to insights providing business value.

So when planning your next data architecture, consider whether your business constraints compel you to plan for a purely FOSS solution, or whether commercial technologies may have a place within your architecture. Open source software is a foundational core, but in these and other areas, commercial technologies also have a lot to offer.