Ask any data scientists how they spend most of their time and they will tell you “understanding the data, and then cleaning, and organizing that data into a useable format,” or just plain “data wrangling.” Bottom line is that most of the data wrangling problem is caused by the lack of metadata management, or just plain proper data management.
At a minimum, there must be enough metadata to enable search capabilities. It can be inferred from the above that proper data management practices can all but eliminate data wrangling for data scientists, but is it possible to automate the data wrangling process? Obviously, I believe the answer to this question is yes, but it is not a simple solution and requires the application of numerous concepts, processes, and technologies. Imagine how much more efficient organizations would be if the time from raw data to analysis was essentially an automated process.
The Problem:
The massive proliferation of data has strained legacy systems to capacity and the problem is only getting worse. These systems were originally designed to solve a very specific problem with little thought for scalability and extensibility. As a result, organizations ended up with data silos which frequently contain redundant and/or inconsistent data.
To extract meaningful insights from this data many are turning to cloud based big data solutions, which requires moving their data to a clustered data store, like Hadoop. This can be accomplished by: 1) Integrate the data through multiple ETL jobs and design a model for the integrated data (typically called something like Centralized or Common Repository), or 2) Load the data directly from the source without integration or transformation, and perform what is called “Schema on Read” analysis. Unfortunately, both of these processes are problematic and frequently lead to extended time-to-value analysis, or simply flawed, or misleading analysis. Data integration is time consuming and expensive, and moving the data without integration or data profiling puts the burden on the data scientist — data wrangling.
Schema on Read:
Schema on Read is a valid process, and in some use cases will decrease time-to-value analysis. However, its usefulness can be greatly exaggerated and misleading. Many believe that schema on read means there is no need for data profiling, integration, or modeling, when in reality it still requires at least some metadata and places the majority of the work on the data scientists. Additionally, not every data scientist will interpret the data in the same way, so it can introduce the possibility of inconsistencies.
For those interested, there is a post on AWS Big Data Blog called “Build a Schema-On-Read Analytics Pipeline Using Amazon Athena” that does a nice job of defining a use case where schema on read is appropriate.
However, if you read through the article carefully, the proposed solution is demonstrated using data from Centers for Disease Control (CDC) called Behavioral Risk Factor Surveillance system (BRFSS). On the CDC website there is a lot of information on the data which includes a data dictionary with specific instructions on how to use the data.
You can read the article for more information, but my point is that even Schema on Read depends on someone providing at least some technical and content metadata, whether embedded or not. If the business analyst, data architect, and data engineers don’t do this prep work, then it falls on the shoulders of the data scientist.
The Solution:
Regardless of who does the work, integration and proper data management are essential requirements for developing a reliable decision support system. Data volumes are growing, and requirements changing too fast to keep pace by relying on traditional data management processes; the answer lies in automation through the use of industry specific ontologies. As stated above, it is not a simple solution and requires the application of numerous concepts, processes, and technologies.
In my opinion, this is where the next big leap will come in analytics, and that is the ability to immediately identify the data you have, and extract information in near real-time. Without metadata, the data in the figure below is unknown even though it is obviously organized in a table-like structure with rows and columns. What if there was a way to immediately open a file like the one represented in Figure 1, and provide a report on the percentage of Medicaid claims by ICD-10 Diagnosis and Procedure Codes within each region, and immediately predict fraud, epidemics, or an increase in mortality rates?
Figure 1: Undocumented, unstructured data – just one of 100s of similar files.
The Role of Ontologies:
What if the ontology approach were extended to not only provide the vocabulary, definitions, synonyms, and relationships, but also to provide a way of identifying domain specific entities and attributes (i.e. the data)? As discussed in the post “Anscombe’s Quartet“, descriptive statistics on data alone would not be sufficient for identifying the data. There are sophisticated processes out there now that use Schema and Record Matching algorithms to integrate multiple disparate data sources, but these assume that the data being integrated has some structure. What we are talking about here is finding a table of data in a data lake, or going through a hundred dimensional tables, each with 50-100 columns, analyzing it against an ontology and automagically determining similar information as depicted in Figure 2 below.
Conclusion:
In the next post I will discuss some of those concepts, processes, and technologies needed to automate data wrangling. Building a good ontology requires a strong command of a domain specific vocabulary, and to assist with this I will use WordNet, the lexical database of the English language from Princeton University. I will also parse data from both WordNet and domain specific ontologies using a couple of packages from Python (NLTK and RDFLIB) [2], as well as demonstrate how to examine ontologies using either TopBraid or Protégé editors.
The requirement to manage data and integrate disparate data sources does not go away as cloud and big data technologies evolve. If anything data management becomes more complex. To solve this problem, the process of data modeling, and integration must be automated. With this automation, data scientists can focus on analysis and not data wrangling.
References:
-
Princeton University “About WordNet.” WordNet. Princeton University. 2010. http://wordnet.princeton.edu
-
Bird S., Klein E., and Loper E., Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit. http://www.nltk.org/book/
Leave a Reply
Your email is safe with us.