THE SITUATION

A rapidly growing niche analytics firm has developed cutting-edge models helping retailers within their industry make improved product recommendations to customers. These recommendations not only match appropriate products to those customers, but also help configure the product to the customers’ specific personalized needs. This requires no direct input from customers, and is inferred by analyzing sales and return history as well as specific product attributes, and differences across models and manufacturers.

As their data pools grew, they began to have performance, reliability, and management challenges related to the data infrastructure challenges. The data was stored in an overflowing RDBMS based data warehouse, with some limited complimentary tooling attempts, to keep cobbling together. Workarounds had hit the point of diminishing returns. Regular ingest (and occasional correction) of retailers’ sales datasets took hours, sometimes failing silently. Data scientists spent considerable time working around data access and query limitations, often slicing data into much smaller segments than they would prefer and wasting valuable time strategizing the best way to massage specific data elements out of the system.

OBJECTIVES

BigR.io was engaged to help modernize and design the next generation data infrastructure, alleviating these problems and opening up future growth of both data volume and analytical approaches and tooling.

Specific objectives of this new architecture included:

  • Improvement of data set ingests time to minutes.
  • Ability to replay and correct past data ingests and reflect changes in “downstream” analyses.
  • Interactive analysis and query across broader cross-sections of data.
  • Support for a variety of analysis and reporting tools supporting the needs of both business users and data scientists.
  • Strict schema enforcement.
  • Managed infrastructure with bounded costs.

This initiative was addressed in three phases:

  • Discovery & Design – The BigR.io team worked with stakeholders across divisions to understand specific use cases, constraints, and pain points. During this phase, candidate architectures were developed and presented with a description of relevant tradeoffs, and stakeholders were guided through a collaborative decision-making process culminating in single future-state architecture. BigR.io was able to make quick-fix recommendations to help reduce the severity of some specific pain points within the existing data warehouse.
  • POC – The future state architecture was implemented by the BigR.io team, and tested and benchmarked against a subset of production data in order to validate both functionality and performance. Specific implementation risks were identified and micro-strategies prototyped to assess viability in addressing these risks.
  • Production Rollout – The final productionization of the POC was undertaken by the client’s internal engineering group with initial guidance and advice from BigR.io. This helps the client to develop internal expertise and self-sufficiency for ongoing development and troubleshooting.

THE RESULTS

The architecture that was selected is based on Amazon Web Services (AWS) and has the following characteristics:

  • Data at rest is stored in S3 using a tiered bucketing topology. Raw ingest data is cataloged in a “data lake” storage area, and stored original data and downstream analytics are structured in external tables managed by Hive/H Catalog.
  • Processing of data is conducted on ephemeral Elastic Map Reduce (EMR) clusters accessing S3 via EMRFS; HDFS is utilized in-job where appropriate, but compute and storage are effectively decoupled.
  • Business users are able to query data via Hive, Presto, and Spark SQL and to visualize using Tableau and other BI tools; developers and data scientists are able to use Apache Pig (allowing use of existing logic), Spark, and integrated Scala and Java logic.
  • All data writes are immutable but partitioned on a per-ingest-job basis, allowing a last written strategy to support logical updates; a caching layer helps address speed and concurrent access concerns.

This new architecture resulted in “a huge win” and paved the way for long-term growth while eliminating the immediate pains:

  • Queries that used to take hours (or that could not be run at all) can now be executed in minutes or seconds.
  • Analyses that previously needed to be time-sliced into increments as small as one day of data can now be run over all time.
  • All objectives were met, enabling our client to focus on their core business, refining the analytical models that are spearheading their industry instead of stumbling over infrastructure challenges.

Secure HIPAA-Complaint Collaboration

Written by Greg Harman

Managing Partner of BigR.io

​Abstract

BigR.io architected and implemented a distributed, secure collaboration and document sharing platform that allows parents to exchange documents and conduct collaborative discussions with their children’s healthcare and educational providers, even across provider organizations. This platform utilizes a distributed document storage system with multi-layered security that allows this sensitive information to be administered without a centralized point of access to PHI, relieving the accompanying responsibilities and risks of storing and utilizing this data. The platform is designed to support HIPAA and other privacy regulations. An anonymized data extraction pipeline supports advanced analytics without compromising user privacy and security.

ABOUT BIGR.IO

BigR.io is a technology consulting firm empowering data to drive innovation and advanced analytics. We specialize in cutting-edge Big Data and custom software strategy, analysis, architecture, and implementation solutions. We are an elite group with MIT roots, shining when tasked with complex missions. Whether it’s assembling mounds of data from a variety of sources, surfacing intelligence with machine learning, or building high-volume, highly-available systems, we consistently deliver.

With extensive domain knowledge, BigR.io has teams of architects and engineers that deliver best-in-class solutions across a variety of verticals. This diverse industry exposure and our constant run-in with the cutting edge, empowers us with invaluable tools, tricks, and techniques. We bring knowledge and horsepower that consistently delivers innovative, cost-conscious, and extensible results to complex software and data challenges. Learn more at www.bigr.io

THE SITUATION

Despite privacy rules and standards, such as the Health Insurance Portability and Accountability Act (HIPAA) and supporting technologies within the internal IT and data systems at healthcare organizations, a secure communications gap exists between healthcare organizations and their patients. While some larger organizations do support document access portals, these portals are not ubiquitous and almost never span multiple organizations, leading patients or their caregivers to often resort to email and fax when coordinating care across a team of professionals at different organizations.

Even in the cases of organizations that offer more advanced portals, these portals and their underlying databases provide a single point of attack for malicious individuals as evidenced by a recent spate of hospital data being encrypted and ransomed. While this activity affects the organization’s bottom line, it also exposes patients to privacy breaches and prevents them access to their own medical records.

A team of professionals from the MIT and Harvard communities have launched a new initiative to address these issues for coordinated healthcare and education of children. They are developing a cross-organization document sharing and collaboration portal that is built on a foundation of distributed control and privacy.

OBJECTIVES

BigR.io was engaged to architect and implement the secure, distributed document storage system that underpins this secure portal. Specific design objectives included:

  • Distributed storage such that Protected Health Information (PHI) and other sensitive documents are stored in parent-controlled datastores. A breach of the core portal cannot expose PHI, and a breach of any specific datastore exposes PHI for only one patient.
  • Flexible storage architecture allowing these parent-controlled datastores to exist on top of heterogeneous technologies such as AWS S3, Google Drive, and local file storage.
  • Encryption of all PHI using a distributed key paradigm.
  • Secure collaboration on PHI compatible with these storage and security constraints.
  • Mechanism for removing Personally Identifiable Information (PII) and aggregating PHI data into (opt-in) data sets to support scientific research.

 

ENGAGEMENT

This engagement was addressed in three phases:

  • Design & Architecture – The BigR.io team worked with stakeholders to understand specific use cases, constraints, and pain points. During this phase, candidate architectures were developed and presented with a description of relevant tradeoffs, and stakeholders were guided through a collaborative decision-making process culminating in single target architecture. An implementation plan was developed allowing the customer to divide the project into parallel paths, off shoring the low-risk portal front end development while continuing to engage BigR.io to develop the more complex data storage design.
  • Back-End Implementation – The back-end architecture was implemented by the BigR.io team, and subjected to third-party security testing and deployed to beta organizations. Additional roadmap features and future risks were identified, designed and backlogged for ongoing development.
  • Production Support and Transition – BigR.io has assisted the transition of technology and development to the customer’s internal development team, providing both general knowledge transfers for ongoing development, as well as production support for early portal rollouts.

 

RESULTS

The architecture that was selected is exposed as a REST API deployed on Amazon Web Services (AWS) and has the following characteristics:

  • The REST API was implemented using AWS Lambda, API Gateway, and SNS.
  • AWS Cognito and OAuth 2.0 were used to provide flexible authentication across data storage systems.
  • Initial distributed storage implementations were created using Google Drive (primary) and AWS S3 (secondary)
  • An asymmetric encryption scheme protects symmetric keys that support a mode of collaboration in which no unencrypted key is stored in any data system and every document is encrypted with a unique key for each user that has access to that document
  • The secure nature of this architecture exceeds the security offered by any centralized document store, and carries the additional benefit of releasing the customer from the HIPAA requirement of Business Associate Agreements (BAAs) with infrastructure providers such as AWS and Google.

eBay Enterprise Marketing Solutions (EEMS) contracted with BigR.io to build fundamental components of a new Customer Engagement Engine (CEE) for their Commerce Marketing Platform.

The CEE, a demand generation marketing solution, creates feedback loops that are extremely data rich and integrated with marketing analytics. eBay Enterprise Marketing Solutions and its clients leverage those analytics to improve the accuracy and preferences of customer profiles and maximize campaign results.

BigR.io’s expertise enabled the CEE to handle data from tens of millions of end users every day with 99.999% availability at peak traffic. The state-of-the-art system integrates the eCommerce operations of eBay partners, such as major retailers, Fortune 500 consumer brands, professional sports leagues, and major U.S. airlines.

THE SITUATION

eBay Enterprise was formerly known as GSI Commerce, which eBay acquired in 2011 and renamed as eBay Enterprise two years later. eBay Enterprise Marketing Solutions, a division of eBay Enterprise, creates, develops, and operates online shopping sites for more than 1,000 brands and retailers.

The company delivers consistent omnichannel experiences across all retail touch points to attract and engage new customers, convert browsers into loyal buyers, and deliver products with speed and quality. Among the services that eBay Enterprise Marketing Solutions provides are marketing, consumer engagement, customer care, payment processing, fulfillment, fraud detection, and technology integration. The company has offices in North America, Europe, and Asia.

The CEE is the central component of the EEMS Commerce Marketing Platform (CMP).

The CMP Contains Nine Components:

  • Customer Engagement Engine
  • Display Engine
  • Affiliate Engine
  • Attribution Engine
  • Media Planner
  • Audience Insights Engine
  • Social Engine
  • Optimization Engine
  • Loyalty Engine

Using eBay Enterprise Marketing Solutions, the parent company wanted to reestablish its competitive advantage as the premier e-commerce vendor by re-architecting GSI’s legacy e-commerce platform from a hodgepodge of disparate technologies and frameworks into a single, powerful, and efficient platform.

The new platform had to outperform its predecessor in terms of transactional data velocity, while reasserting their dominant position as one of the most reliable highest volume e-commerce vendors in the world.

BigR.io’s Roles and Responsibilities in the Project

As the developer of CEE’s core module, the Data Ingest, BigR.io had the task of architecting and building an edge server and data ingest component that had three primary functions:

  • Transparently redirect customer browsers from a vanity URL embedded within their messages
    to a final destination,
  • Log customer metadata to a NoSQL store for marketing analysis, and
  • Perform additional logging of open detect and click-to-sale conversion events.

Because the Data Ingest Module is the most critical component in the CEE, it must operate at higher availability (99.999%), higher performance and with more robust security than the other components. BigR.io’s responsibility included architecting and developing the data store (CEE’s system of record for security metadata, encryption keys, and critical marketing metadata), as well as the load balancing and disaster recovery functionality.

BigR.io had to build all this functionality to the following constraints:

  • Capable of handling on the order of 120 billion messages annually
  • High-Availability: 99.999 percent uptime
  • Multi-layered security measures to thwart phishing attacks
  • Distributed operation across multiple data centers and logical pods
  • Integration with numerous other system components produced both in-house and by third-party vendors

The project business case for the CEE included the following requirements:

  • A foundation characterized by a scalable, extendable, and maintainable architecture
  • Increased marketing campaign management control and functionality
  • The capability to funnel and capture more, and more detailed, customer data than was possible at the project outset
  • A richer marketing analytics feature set

To meet eBay Enterprise Marketing Solutions objectives for the CEE, BigR.io first determined appropriate technologies and frameworks for the engine. This was a significant challenge, as EEMS had been operating a mixed bag of legacy systems pieced together stemming from several corporate acquisitions. BigR.io architected, deployed, and managed the following:

  • A Portfolio of Apache & Other Open Source Projects
  • Modern DevOps (Continuous Integration)
  • Cloud Computing
  • Content Management Systems
  • Marketing Analytics Systems
  • NoSQL
  • Cutting Edge Development Tools, Languages, & Databases
eBay Redirector

THE RESULTS

BigR.io’s Data Ingest module is the functional center of the CEE. This ensures that incoming and archived data are available to CEE’s powerful analytics technologies – providing the commerce insights and demand generation necessary to optimize the relevance and value of each customer’s purchasing journey.

With the CEE in place, eBay Enterprise Marketing Solutions enabled its marketers to drive and optimize one-to-one commerce at scale. Both EEMS and their clients now capture and utilize more customer data than ever before. As a result of BigR.io’s work on the CEE, EEMS expects to double performance across a number of metrics.

Performance Metrics

eBay Results

Implications of Big Data on Future Applications

The business case for the investment in the CEE was based upon the enormous, yet previously unrealized value in the data represented by the customers of eBay and its downstream business partners. That data enables the targeting of marketing campaigns more precisely than ever before. With so much data available from customers, marketers can speak to them virtually on a one-on-one basis.

BigR.io helps organizations exploit Big Data by specializing in the development and support of enterprise level data projects. We can help your company pool and normalize your data with our cutting-edge custom data integration, cleansing, and validation services. Thanks to the volumes of transactional and behavioral data that customers are sharing through their multichannel interactions, there are tremendous opportunities for digital organizations to identify new ways to drive internal efficiencies and optimize the customer experience. Our teams of data scientists and engineers will pave the way to efficient analytics, new business insights, and better decision making.

For every engagement, BigR.io brings depths of experience in all aspects of data management, from integration, access, and governance to security and cluster operations. Our infrastructure expertise provides best practices in Apache Hadoop, Massively Parallel Processing (MPP), and NoSQL systems. Our structured and unstructured data strategies include integration expertise in pooling and partitioning integration, and quality expertise in cleansing and validation. Our analytics capabilities ensure more usable and accurate data to drive better business insights.