ARCHITECTURE AND POC OF MANAGED DATA HUB

BigRio LLC
Harvard Square,
One Mifflin Place, Suite400,
Cambridge, MA02138
(617) 500-5093|[email protected]
www.BigR.io

ARCHITECTURE AND POC
OF MANAGED DATA HUB

BY: GREG HARMANN

PROBLEM

A rapidly growing niche analytics firm has developed cutting-edge models that help retailers within their industry make improved product recommendations to customers. These recommendations not only match appropriate products to those customers, but also help configure the product to the customers’ specific, personalized needs. This requires no direct input from customers, and the recommendations are inferred by analyzing sales and return history, specific product attributes, and different models across manufacturers.

As the data pools grew, this analytics firm began to have performance, reliability, and management challenges related to the data infrastructure challenges. The data was stored in an overflowing Relational Database Management System (RDBMS). The RDBMS based data warehouse had limited complimentary tooling attempts to keep cobbling together, and workarounds had hit the point of diminishing returns. Regular ingest and occasional correction of retailers’ sales datasets took hours– sometimes failing silently. Data scientists spent considerable time working around data access and query limitations, often slicing data into much smaller segments than they would prefer. This wasted valuable time by having to strategize the best way to massage specific data elements out of the system.

OBJECTIVES

BigRio was engaged to help modernize and design the next generation data infrastructure. The goal was to alleviate these problems and open up future growth of both data volume and analytical approaches and tooling.

Specific objectives of this new architecture included:

Improvement of data set ingests time to minutes.
Ability to replay and correct past data ingests and reflect the changes in “downstream” analyses.
Interactive analysis and query across broader cross-sections of data.
Support for a variety of analysis and reporting tools– supporting the needs of both business users and data scientists.
Strict schema enforcement.
Managed infrastructure with bounded costs.

METHODS

This initiative was addressed in three phases:

DISCOVERY & DESIGN

The BigRio team worked with stakeholders across divisions to understand specific use cases, constraints, and pain points. During this phase, candidate architectures were developed and presented with a description of relevant tradeoffs, and stakeholders were guided through a collaborative decision-making process culminating in single future-state architecture. BigRio was able to make quick-fix recommendations to help reduce the severity of some specific pain points within the existing data warehouse.

POC

The future state architecture was implemented by the BigRio team, and tested and benchmarked against a subset of production data in order to validate both functionality and performance. Specific implementation risks were identified, and micro-strategies were prototyped to assess viability in addressing these risks.

PRODUCTION ROLLOUT

The final production state of the POC was undertaken by the client’s internal engineering group with initial guidance and advice from BigRio. This helped the client to develop internal expertise and self-sufficiency for ongoing development and troubleshooting.

RESULTS

Implementation of Amazon Web Services (AWS) Architecture that has the following features:

Data at rest is stored in S3 using a tiered bucketing topology. Raw ingest data is cataloged in a “data lake” storage area. Original stored data and downstream analytics are structured in external tables managed by Hive/H Catalog.
Processing of data is conducted on ephemeral Elastic Map Reduce (EMR) clusters accessing S3 via EMRFS; HDFS is utilized in-job where appropriate, but compute and storage are effectively decoupled.
Business users are able to query data via Hive, Presto, and Spark SQL; visualize data using Tableau and other BI tools; and use Apache Pig, Spark, and integrated Scala and Java logic.
All data writes are immutable, but partitioned on a per-ingest job basis allowing a last written strategy to support logical updates; a caching layer helps address speed and concurrent access concerns.

New AWS Architecture Paved the Way For Long-Term Growth While Eliminating Immediate Pains:

Queries that used to take hours can now be executed in minutes or seconds.
Analyses that previously needed to be time-sliced into increments as small as one day of data can now be run over all time.
All objectives were met, enabling our client to refine their analytical models and focus on their core business instead of stumbling over infrastructure challenges.

BigRio LLC, Harvard Square, One Mifflin Place, Suite400, Cambridge, MA02138
(617) 500-5093| [email protected] | www.BigR.io

ARCHITECTURE AND POC OF MANAGED DATA HUB