Design Considerations for Building a DSS Data Pipeline

The complexity of a Decision Support System (DSS) varies widely between use cases. Sometimes the DSS will require data from one single, and independent source, such as streaming U.S. stock market Nasdq, or S&P500 data. However, other DSS systems, which are the topic of this post, require data from multiple disparate data sources, such as exists in many large corporate enterprise environments with multiple lines of business (LOB) — such as Gartner‘s “Logical Data Warehouse Design Pattern”

Figure 1. Gartner’s Architecture Patterns for Analytics on Hadoop

Depending on the DSS requirements, there can be a variety of structured, semi-structured, and unstructured data. Regardless of the types of data, maximizing its usability and value will require that the data are organized and documented properly so that analysts and other end users can quickly access the required data. Once deployed, it’s too late to be working through performance issues caused by faulty design, non-scalable components, and data management shortcuts. The equipment for a DSS can be a huge capital expense, but the greatest risk, and cost lies in achieving proper system design that in the end will provide real business value (Return on Investment). This post attempts to present some of the critical “Design Considerations for Building a DSS Data Pipeline” that actually work, and provides real business value.

*The term DSS is used here because of its generic definition. A DSS in essence is any information system that supports the decision-making process. A DSS can consist of many types of platforms, architectures, and end user applications. Therefore, the data pipeline can be significantly different from one DSS to another. For the purposes of this post, the DSS will consist of multiple RDBMS’s, and a Hadoop platform.

A Typical Framework:

Too many companies today are discovering the one-way data lake where data goes in, but nothing of real value comes out. We have all heard of the three V’s used to define Big Data, and they are Volume, Velocity, and Variety. For Big Data best practices management, many are now adding two more V’s to this list, which are Veracity, and Value.

Figure 2: Enterprise Data Pipeline Framework

Integration of mastered data is the key to veracity and value, and without those two V’s, what’s the point? The lack of this integration is what turns a Data Lake into a One-way Data Lake. Notice the term “mastered data” which enables accurate integration. Mastered data refers to the identification of Key Data Elements (KDE) across the enterprise that are then assigned a Global Universal Identifier (GUID). This GUID is also called a Universally Unique Identifier (UUID), as well as other names used by various MDM vendors.

The key point is that the KDEs are assigned a unique identifier. Once accomplished, this process serves as an enabler of integration by providing a method for mapping and joining KDEs between System of Record (SOR) databases. For example, if a customer in Consumer Banking is also a Wealth and Investments customer, in both databases this customer should have the same GUID assigned. Master Data Management, as well as master reference data management are complex topics that would warrant several posts to adequately address. The point here is that integrating data without mastered data is problematic and will compromise the veracity of any reporting and analysis derived from the DSS.

Metadata Management is a Must:

Additionally, this integrated data must contain metadata that defines the technical data, such as file, column names, data types, etc., as well as the business metadata such as file and column definitions as they pertain to their business context. Capturing data and dumping it into a file system, whether distributed or not, is of little value if the content is unknown, or impossible to identify. To demonstrate the point, without having the metadata, could anyone determine with certainty how the following data might be used (and this is an easy one):

Table 1: Pitching Stats for the 2010 Boston White Sox

If you’re a baseball fan, and keep up with the Boston White Sox, you might recognize the pitcher’s name in column 1, and 2010 is probably the year in column 2, and you might figure out that BOS and AL refer to the Boston White Sox in the American League, but what are columns 3, 6, 7, 8, and 9 referring to? Like I said, this is an easy one, but how long would it take to figure out what each column is referring to with 100% certainty? Metadata would tell us the file location, the file name (pitching_stats), and the columns are: 1) player_id, 2) year_id, 3) stint, 4) team_id, 5) league_id, 6) wins, 7) losses, 8) games, and 9) games_started. In addition, the metadata would provide the data type for each column, whether the field is a primary or foreign key, and whether the field should contain NULLS or not, etc. Also provided in the business metadata are the definitions for each table/file and column in the given context.

The previous example was obviously referring to structured data, but what about unstructured data? Metadata is required there as well. What if your business performs analytics for insurance companies, and keeps images of accident photos from claims, or documents of the adjuster’s notes? How would this data be organized? Search engines like ElasticSearch, or Solr are databases of files based on metadata data. Without metadata, searching this data would be impossible.

The data pipeline must be planned meticulously, and great care taken to ensure that the data structures are usable by end users. The data must be organized, and accurate to ensure its veracity and value.

Streaming Data Design:

There are decisions to be made regarding streaming feeds and how to best architect the framework for real-time analytics. The following figure shows the prevalent architectures and for more information on the Lambda and Kappa architectures, I will refer you to a post on the Talend, website. Streaming real-time metrics are addressed in an upcoming post on design by benchmark testing.

Figure 3: Lambda vs Kappa for Real-time Big Data Architectures

The Real Design Consideration:

Ultimately, the primary concern is: does the design satisfy the requirement(s)? If you’re like me, you look at diagrams like Figures 1, 2, and 3, and end up asking more questions than having answers. From a simplistic perspective, data is either generated through company processes (e.g., new accounts, customers, products), or imported from external sources (e.g., financial feeds, demographics, social media). The data are then stored, moved, transformed/aggregated, consumed and eventually archived. From a not so simplistic perspective, data comes in various formats and methods. Decisions must be made regarding capacity planning, bandwidth availability, security requirements (e.g., masking, tokenization, encryption, entitlements), partitioning, compression (e.g., Snappy, gzip), file formats (e.g., Avro/Parquet), movement of data from one data store to the next (data flows and process), where will data transformation occur (Extract Transform Load vs Extract Load Transform), data integration, master data management, metadata management, and optimizing data structures.

Designing a DSS can be very involved and complicated, and notice that establishing performance metrics is not mentioned in the paragraph above. This is a topic that is frequently overlooked and should be addressed as well. Design decisions such as choosing Parquet over Avro, or Snappy compression over Gzip can greatly impact performance and the choices validated through benchmark testing — the results are not always identical depending on the data, and use cases.

Design Decisions Based on Performance Metrics:

Frequently, design decisions are based on PowerPoint presentations, and possibly a scaled down proof of concept. However, to optimally design a DSS data pipeline, design decisions should be evaluated using quantifiable benchmark performance metrics. These metrics should consist of rigorous queries at various levels of complexity characterized by high CPU and IO load; the same type that are typically used to answer real-world problems in a DSS. If you don’t know which design pattern is optimal, test them using performance metrics.

Conclusion:

All of the design decisions discussed here impact the data pipeline and performance, and ultimately determine whether the system is a one-way pipeline, or a system that provides real business value and analysis that can be trusted and used in the decision making process. Your comments and questions are welcome.

Design Considerations for Building a DSS Data Pipeline

A Typical Framework:

Metadata Management is a Must:

Streaming Data Design:

The Real Design Consideration:

Design Decisions Based on Performance Metrics:

Conclusion:

Related

About Randall Shane

Leave a Reply

Your email is safe with us.

Design Considerations for Building a DSS Data Pipeline

A Typical Framework:

Metadata Management is a Must:

Streaming Data Design:

The Real Design Consideration:

Design Decisions Based on Performance Metrics:

Conclusion:

Share this:

Related

About Randall Shane

You also might be interested in

Anscombe Quartet

Predictive Analytics Platform for Overcoming Memory Limitations

Ontologies: Practical Applications

Leave a Reply

Your email is safe with us.