Data Modeling in the Big Data Era

Over the years of evolving technologies, data modeling has become less and less important as a fundamental skill set. It’s impossible to say how many times the phrase has been uttered: “we no longer need data models because now we have _____________ (fill in the blank)”. Everything from a data lake, to NoSQL database, to schema on read, to the still popular, “we will manage it in the application”, to magic. Unfortunately, all of these are false, and failures abound due to a poor understanding of basic, fundamental data management practices. The value of data modeling in the Big Data era cannot be understated, and is the subject of this post.

The Work that goes Into Data Modeling:

Briefly, the first place a data modeler begins, hopefully, is with a set of requirements. From there conceptual, logical and physical data models are developed using a data modeling tool, such as Erwin (my personal favorite). Barring requirements, the data modeler reverse engineers the database, and begins a process of data profiling to determine what the existing structures look like. Is it a well organized database following conventional standards for modeling an Online Analytical Processing (OLAP) database, and have they used dimensional with standard star, or snowflake schema, and have they followed an accepted standard for dimensional modeling such as defined by Ralph Kimball or Bill Inman (personally I have a preference for Bill Inmon’s Corporate Information Factory)? If it is an Online Transaction Processing (OLTP) database, are the rules of at least 3rd Normal Form relational databases used — following E.F. Codds relational database modeling techniques that are mathematically based on relational algebra and calculus. Whether OLAP, or OLTP, are naming standards defined and adhered to, can you easily identify reference tables by their names, are there associative tables, are there circular relationships, is there any type of hierarchy for parent-child relationships (e.g., hospital, ward, clinic), and most importantly, is there documentation for the database, like a data dictionary? Armed with this information, a data modeler can develop a data model to satisfy business requirements.

Schema On Read:

Let me be clear, I am not saying that schema on read is not a viable option in the big data world. What I am saying is that just the term itself leads to the misconception that there is not already a schema defined before the developer/analysts begins to extract data. I have done my share of data analysis and profiling, and one thing that is certain, without a defined schema, and a relatively complete data dictionary, your first dive into the data is a fishing expedition in unknown waters.

Figure 1: Benefits of knowing your data and automating integration

Schema on read can work IF there is a data schema predefined. In the big data environment, this is often referred to as metadata. Schema on read is probably a better example of why we still need data models than the opposite. From Techopedia: “Schema on read refers to an innovative data analysis strategy in new data-handling tools like Hadoop and other more involved database technologies. In schema on read, data is applied to a plan or schema as it is pulled out of a stored location, rather than as it goes in.” I underlined the last sentence because this is where the whole concept falls apart unless there is already a pre-existing schema, or metadata that defines the data being extracted. How can you apply a schema to data as it is extracted if there is no schema defined. The other problem, is there just one person using this tool, or do multiple developers/analysts use this data and define their own schemas? Even working collaboratively, there would need to be extensive testing to determine that the data was used the same way by both developers. Bottom line: schema on read requires a data model or schema definition. Basically, without a data model, which is used to derive metadata, schema on read leaves all of the data modeler’s work to each individual developer/analyst, which is sure to introduce inconsistencies, not to mention the inefficiencies introduced by such a process.

Side note: the lack of a data model, even for a data lake, is the main reason data scientist/analyst spend 80% of their time cleaning up the data, and 20% doing analysis. Figure 1 depicts a good example of where you want to go (step 5), and a good reason why a company without data models (or automated integration which requires extensive knowledge of the data) will never get there.

Key Value Pairs:

Again, a common misconception is that using data structures like JavaScript Object Notation (JSON an object built using key:value pairs) prevents one from needing a data model; I must disagree. The following is a subset from a commercial site with a published API. There are 5 main JSON files in this dataset: 1) business.json, 2) checkin.json, 3) review.json, 4) tip.json, 5) user.json. Each file contains a key value pair separated by a colon (:). You will notice the first line of the business.json file contains a key followed by a colon. Following the colon is a description of what the value should be. I can’t show the actual data, but you get the idea. Notice that there are keys in the business.json file that are arrays for strings. This array could easily be a nested list of tuples, or another key:value structure – like a Python dictionary. In fact a set of JSON files can be just as complicated as a 100 table 3rdNF data model. The problem is that referential integrity and constraints are not enforced; it is up to the developer to do this. Before some of you developers go off the deep end, I realize there are kluges for trying to identify primary and foreign keys, but they are not enforceable, and not supported by the JSON Standard.

The business. json file structure looks similar to the following:

{
    "business_id":"encrypted business id",
    "name":"business name",
    "neighborhood":"hood name",
    "address":"full address",
    "city":"city",
    "state":"state -- if applicable --",
    "postal code":"postal code",
    "latitude":latitude,
    "longitude":longitude,
    "stars":star rating, rounded to half-stars,
    "review_count":number of reviews,
    "is_open":0/1 (closed/open),
    "attributes":["an array of strings: each array element is an attribute"],
    "categories":["an array of strings of business categories"],
    "hours":["an array of strings of business hours"],
    "type": "business"
}

The tip.json object structure:

{
    "text":"text of the tip",
    "date":"date formatted like 2009-12-19",
    "likes":compliment count,
    "business_id":"encrypted business id",
    "user_id":"encrypted user id",
    "type":"tip"
}

The user.json file structure:

{
    "user_id":"encrypted user id",
    "name":"first name",
    "review_count":number of reviews,
    "yelping_since": date formatted like "2009-12-19",
    "friends":["an array of encrypted ids of friends"],
    "useful":"number of useful votes sent by the user",
    "funny":"number of funny votes sent by the user",
    "cool":"number of cool votes sent by the user",
    "fans":"number of fans the user has",
    "elite":["an array of years the user was elite"],
    "average_stars":floating point average like 4.31,
    "compliment_hot":number of hot compliments received by the user,
    "compliment_more":number of more compliments received by the user,
    "compliment_profile": number of profile compliments received by the user,
    "compliment_cute": number of cute compliments received by the user,
    "compliment_list": number of list compliments received by the user,
    "compliment_note": number of note compliments received by the user,
    "compliment_plain": number of plain compliments received by the user,
    "compliment_cool": number of cool compliments received by the user,
    "compliment_funny": number of funny compliments received by the user,
    "compliment_writer": number of writer compliments received by the user,
    "compliment_photos": number of photo compliments received by the user,
    "type":"user"
}

The review.json file structure:

{
    "review_id":"encrypted review id",
    "user_id":"encrypted user id",
    "business_id":"encrypted business id",
    "stars":star rating, rounded to half-stars,
    "date":"date formatted like 2009-12-19",
    "text":"review text",
    "useful":number of useful votes received,
    "funny":number of funny votes received,
    "cool": number of cool review votes received,
    "type": "review"
}

Just from looking at the files above, you can easily see that there are what appear to be primary keys like review_id, user_id, and business_id for each of the respective files. What about the tip.json file? There is no tip_id field, so what uniquely identifies a tip.json record? Probably, given the structures, the date field, along with user_id, and business_id is as close as you can get given that the data is just the date (not timestamped). So how many ways could you join this data to get what you are looking for? There are numerous ways, and all but few of them would be incorrect. While JSON is a great structure for passing data over the internet, and it is not the bandwidth hog, or as complicated as XML, it is not a database. Therefore, modeling JSON structures can be very helpful as depicted below for developers to better understand the data, and use it consistently across the enterprise.

Figure 2: Logical model of JSON data structures (only portions show attributes to help obfuscate the data)

NoSQL and Hadoop Data Structures:

I have worked on several big data projects, and worked with several NoSQL databases. As for NoSQL, the data modeling approach varies slightly depending on the type of NoSQL database. Let’s start first with neo4J and Apache Giraph. As we had discussed in previous posts, there is very little difference between these NoSQL databases and modeling for a relational database. The nodes are entities, each node can have attributes, the nodes are connected to related nodes through edges, and these edges can have names as well as attributes (e.g., a person HASA address). For a more in depth review of that discussion, please see the post: “Graphs, What Are They?”

There are 4 types of NoSQL databases: 1) Key:value (Oracle, Redis, Riak, numerous others) 2) Key:Columnar (HBase, Cassandra, . . . ), 3) Document Based (CouchDB, and MongoDB, although there are numerous), and 4) Graph databases (neo4j, Giraph). There are differences between the numerous types of databases and their modeling requirements are different for each, although document databases are a subtype of key:value. One of the documents typically used are JSON objects which were discussed above.

The primary difference between the various NoSQL, and relational databases too for that matter, is that the data is the same, but the structures are different. Also, most NoSQL databases are BASE as opposed to ACID based as discussed in the referenced article above on Graphs (neo4j is ACID based). The point I was trying to make above with JSON objects though, is that even though a JSON object is not relational per se, there are implied relationships and constraints that must be enforced. All this said, NoSQL databases are typically much faster than RDBMS databases, and have many advantages as long as they are applied to the appropriate the use cases.

As for data modeling, the NoSQL structures focus more on what the application needs. In a relational, and dimensional database as well, the data is stored in very different structures than what end up on the application UI screen (ergo all the joins to shape the data). NoSQL databases lend themselves to a model more conducive to object oriented programs. That said, the models in many cases are simply views of a well designed relational model. This topic deserves more specific details, but I want to address data traceability.

Data Traceability:

The ultimate goal, in my opinion, for an enterprise data management system is to have data traceability. If an API is published, the governance team, as well as the enterprise architects should know exactly where the data comes from, and the data’s path through the enterprise (preferably documented). When designed properly, there must be a layer of abstraction that protects the data from direct access. The microservices architecture is a good choice in my opinion, and illustrate the concept well.

Data flows into the enterprise through various means, it is transformed, consumed and presented in the form of an API. That API consists of services, and each of these services MUST own the data they provide. For example, the design might call for a ‘get_customer’ service. If data is needed regarding the customer data, there would be a call to the ‘get_customer’ microservice. There is not another service written to retrieve customer data because this would introduce potential inconsistencies and greatly complicate change management. These services could be split between public/anonymized data and personally identifiable information (PII) data. If the customer service needs account information, for example, a call would be made to the ‘get_account’ microservice. The service, as well as the data owned and managed by each service is thoroughly documented, and the governance team maintains contracts that will understand the impact of changing a service signature. So how does this fit into a data modeling discussion?

Figure 3: Enterprise Framework for Data Traceability

Conclusion:

Data Modeling is one necessary process in any enterprise data management endeavor, but data management involves more than just storing data in a database and wiping your hands clean. Enterprise Data Architects must extend their reach beyond the database, and ensure that the data consumption processes (service layer) is managed properly as well. And as for data modeling, it is a fundamental process, whether automated or not, that will provide the foundation for a strong enterprise data management framework. Saying that you don’t need a data model is like saying you don’t need blueprints before you start building that 3,000 sqft house. This whole process can be summed up in the data pipeline. Once the pipeline is defined, and operational, you have data traceability. With data traceability, you are then managing your data as you would any other important asset. Would a jeweler not keep track of every precious stone that came through his store, or a Lamborghini dealer not know where each an every car was located and have records of its sale? Data must be treated as that kind of asset, because it is.

Data Modeling in the Big Data Era

The Work that goes Into Data Modeling:

Schema On Read:

Key Value Pairs:

NoSQL and Hadoop Data Structures:

Data Traceability:

Conclusion:

Related

About Randall Shane

Leave a Reply

Your email is safe with us.

Data Modeling in the Big Data Era

The Work that goes Into Data Modeling:

Schema On Read:

Key Value Pairs:

NoSQL and Hadoop Data Structures:

Data Traceability:

Conclusion:

Share this:

Related

About Randall Shane

You also might be interested in

Anscombe Quartet

The Next Giant Leap in Data Management: Automated Integration

Data Modeling in the Big Data Era: HDFS (Part 2)

Leave a Reply

Your email is safe with us.