Poor Data Management Practices
Most people from large organizations can relate to the picture of grain silos representing isolated data stores, and my guess is that even small to medium sized company owners and employees can see their companies evolving to have these silos of data. When you see this, hopefully, it makes you ask yourself, “are we being left behind due to poor data management practices?” With all the promise that comes with the the latest technologies, you probably also ask yourself, “is my company capable of achieving the level of sophistication necessary to capitalize on these new technologies?” These are serious questions, and their answers could determine how long your company will be around.
We have all heard of the recent advances in Artificial Intelligence (AI) systems, big data, cloud computing, and data sciences. The promise of these technologies is being realized today by companies like Amazon, Google, IBM, John Deere, and many other smaller companies as well. AI systems are being developed now as chat bots, manufacturing jobs have been disappearing for years now, and big data is used by some companies for everything from daily chores like Extract, Transform, Load (ETL), storage, and data analysis, to more sophisticated applications with collaborative filtering (recommendation systems), predictive analytics, and sentiment analysis.
We have also heard how there are not enough qualified people in the job market today, leaving pundits to wonder as a society, how we will address this problem. However, what you don’t hear as often is how many companies will remain competitive? These technologies are called disruptive technologies for a reason.
Granted, companies go out of business for many different reasons, but not having a good understanding of your customer, risk, exposure, and opportunities must be at the top of the list. Making good decisions based on bad information is just as catastrophic as the opposite. What are the consequences of doing nothing about poor data management? If your company has a 2020, or 2025 plan that conceptually has the data stores looking metaphorically like a row of grain silos, as opposed to a cloud of networked data, chances are your company won’t be here to worry about the lack of qualified candidates in the job market.
So how does your company achieve a level of sophistication that will make them competitive? The answer relies partly on returning to fundamentals. When you hear a football coach talk about rebuilding a team, you will likely hear them refer to the fundamentals of the game, tackling, blocking, . . . When NASA gets ready to design a new spacecraft, they are still restricted by the laws of physics; Newton’s laws still apply. In the same way, basic best practices in data management still apply.
Data Management Practices:
Anyone that has worked with me probably knows what I am about to write; data integrity is the most important, and fundamental concept in properly managing data. And, yes, this includes applying the rules of relational data established by EF Codd. However, this does not mean that everything needs to be a 3rd normal form Relational Database Management System (RDBMS). There are ways of managing the integrity of data without an RDBMS, but even Hadley Wickham, Chief Data Scientist at RStudio, as well as many other respected data architects, data analysts, and statisticians, will tell you that proper data analysis requires the data be clean and properly consolidated for the analysis to be accurate. Dr. Wickham calls this “tidy” data, and in his article on tidy data he references none other than EF Codd. The algebraic rules, and fundamental principles that Codd applied to relational data must be enforced, whether during schema-on-read, or schema-on-write. If we are not enforcing these principles in an RDBMS, then the application developer must enforce them during schema-on-read processes.
Data quality frequently takes a back seat to performance, and there are alternatives that are both responsive and enforce the veracity of data. The problem of data silos has existed for years, and the technical response to this problem was the advent of data warehousing. The idea was that data would be moved from these disparate data sources, and consolidated in a single reporting repository called a data warehouse. For the time, this was probably the most technically sound approach available. From the era of data warehouses, Ralph Kimball and Bill Inmon emerged as the leaders of data warehousing management, and dimensional data modeling was born. However, what they both provided was an architectural framework for managing data. Dimensional Modeling is a set of techniques and concepts, and is not based on scientific principles, laws of physics, or mathematics as was EF Codd’s relational algebra for relational models.
NOTE: Bill Inmon’s Corporate Information Factory (CIF), if implemented faithfully, is sustainable, and a potential solution for the data silos problem. However, I have worked in numerous data warehouse environments and have yet to see a CIF implemented properly. I won’t go into the details of it, but at the heart of a CIF is a 3rd Normal Form data model. If you are interested in reading more about them, I will refer you to Bill Inmon’s web site here.
Data warehousing is not the cause of the data silos problem, and it was an interim solution whose time has come and gone. In my opinion, big data developers are likely to experience the same problems as data warehousing without a course correction. I will intentionally repeat myself, data integrity, or the veracity of data is the single most important consideration when dealing with data. Whether preparing an analysis for projected growth, a sales report for management, designing and implementing an analytics platform, or an Enterprise Information Architecture (EIA), data quality is the reigning factor.
Technical Debt Never Gets Paid Back
One of the problems encountered with the approach in figure 2, and data warehouses in general, is that each LOB database almost always has it’s own unique structures. To merge data from two, or more disparate databases requires integration. This is where the “T” in ETL comes from; the transformation process. Additionally, for the data modeler, they must now create one model that will accommodate the data from two or more databases. Once the model is complete, the data modeler and the ETL developers work together to define a source-to-target mapping. This mapping should contain the transformation rules. This sounds like it would be a relatively simple task, but it’s not. The success of this process depends entirely on the skills and knowledge of the data architect and the ETL developers.
The complexity is partially due to how databases are developed and evolve over time. Data architects and ETL developers come and go, undocumented changes are made to databases, one database is more normalized than the other, and standards, if they exist, are not followed. As the database ages, columns are frequently added to accommodate evolving business requirements. However, because the consequences of removing a column are not well understood, or because the project schedule does not allow for adequate regression testing, legacy columns are left in the database. The typical Agile project calls this incurring technical debt, meaning they are willing to accept the fact that the column will remain, and schedule for removal during some future sprint that never comes, thereby incurring technical debt that never gets paid back.
Are Data Warehouses Obsolete?
Given the technologies available today, and the systemic problems that seem to accompany most data warehouse projects, my answer would be yes, data warehouses are obsolete. If obsolete sounds too severe, let’s go with “an unnecessary corporate burden.” At the very least, the days of Kimball based data warehouses have seen their day, or at least I hope so.
So, how do you fix the problem of data silos. Most companies can’t afford to start over and build from scratch. Companies are stuck with making legacy systems work in sync with modern, and more flexible, extensible, and scalable systems. How can this be done? There are many viable alternatives, and using a data lake where data is ingested, staged, augmented, and transformed as part of the solution is one possible implementation. However, this does not really fix the problem, it simply moves the data warehouse to a more powerful platform. It does have the added advantage of augmenting with external data sources and taking advantage of schema-on-read applications. Also, unless the data volume and velocity is sufficient to warrant a distributed computing platform, there probably should still be a reporting or analytics platform for data consumption. That said, when specifically addressing the silos problem, there are a couple of viable alternatives.
NOTE: The remainder is not intended to be a completed framework for a replacement architecture, but rather to provide some ideas on how solving this problem might be approached.
SOA, and Microservices:
The Service Oriented Architecture (SOA) was developed in response to the legacy system problem. I personally believe SOA’s are a good choice given sufficient time and money. The complexity increases with the addition of an Enterprise Service Bus (ESB), and development of a canonical model can take a lot of time and resources. Microservices, on the other hand, make up for some of the pitfalls associated with SOA’s. This is especially true if the microservices are designed before built, data lineage is maintained, and data ownership by services is assigned and enforced. Without proper design, and governance, microservices will introduce just as many problems as a SOA. The primary consideration in properly implementing a microservices architecture is data ownership. Two services should not be capable of providing the same data element(s) as implied by the arrow going from the Employee service to the Customer data in figure 4. Services can request data from another service, but cannot request data from the database unless it is the owner of that data.
Ultimately, what is really being discussed here is a Master Data Management (MDM) system without the vendor applications typically associated with MDM. In a like manner, there could be a services layer between the neo4j database and the silos which is not depicted in figure 4. Adding an RDBMS is an alternative, and the feeds from the legacy systems would eventually be retired (which implies an application). These services would get data from each silo and put it in the neo4j database, or whatever platform was chosen. This would be a one time cost, and eliminate the ETL jobs.
As stated above, the framework described is not intended to be a complete solution, but rather to provide an alternative solution to the traditional silos and data warehouses. Whether it is this framework, or some other, it is clear that we must start looking at data differently than before. To truly capitalize on the modern capabilities, we must have good data quality. Maintaining data in disparate silos will not get us there. This discussion ties into the earlier discussions on ontologies, as does neo4j, and I will let you connect the dots here.
I realize that I just dropped neo4j out there and didn’t really talk about it. neo4j is the leading graph database and is an alternative consideration for MDM solutions. In my opinion, packaged MDM solutions are a waste of money. Technologies exist today that provide a very solid framework for an in-house developed MDM solution, and most corporations already own them. I will come back and discuss neo4j later, but they have a very good website with tutorials for those that are interested.