A network is a collection of things, like computers, that interact with one another. This can be a physical network with routers, switches, and hubs, as well as the World Wide Web where documents are linked to one another. While the documents themselves might be interesting, a vast amount of information can be extracted by studying the interactions, or links between these documents, as an example. The same holds true for social networks and people’s relationships with one another, as it does with words where one word is associated with another. There are two definitions for a graph according to Wolfram. If you ever took an analytical geometry class, you plotted coordinates and came up with something that looked similar to Figure 1. They were usually more complicated than this, and might even have a third axis for 3 dimensional graphs, but Figure 1 is not the type of graph we are interested in. The kind of graph we are talking about consists of nodes, or vertices (synonymous) connected via edges. So instead of the visual in Figure 1, the graphs being discussed here look more like what you see in Figure 2. As mentioned, these graphs can depict any number of networks, as well as any number of objects that are related to one another via relationships, and I will use graph databases in this project to automate the association between words and data.
Graph Databases:
As it turns out, there is a lot of information that can be extracted from graphs for various types of problems, and this information can be retrieved extremely fast. Basically, graph data is translated into matrices, and matrix math is extremely fast.
The main takeaways from Figure 2 for now are that graphs consist of nodes, or vertices, that are connected to one another via edges. Depending on whether the nodes are ordered, or unordered, the nodes are connected by what is called an edge (a line joining these two nodes). According to Wolfram, the terms “arc,” “branch,” “line,” “link,” and “1-simplex” are sometimes used instead of edge (e.g., Skiena 1990, p. 80; Harary 1994).
Graph “like” databases have existed in various forms, and are similar to databases of the 70’s. The move from there was the relational database, but because of the speed of graph algorithms, and the exponential growth of data, the concept of managing data with graph structures is evolving into a viable alternative. In many cases graph databases are reported to be anywhere from 10 to 1000 times faster than other more traditional databases. Currently, the most popular and widely used graph database is neo4j, but for an alternate solution in a Hadoop ecosystem Apache Giraph is a good choice.
As you might expect, designing a graph database is different than designing relational database structures in 3rd Normal Form, dimensional (which include star and snowflake), and cube data structures. However, you might be surprised to discover that in many ways they utilize similar concepts.
Brief Terminology:
To avoid any terminology debates, the following definitions apply to this discussion. If the definition, or usage of the chosen word materially modifies the intended meaning, please let me know.
- Entity: a structure that contains attributes or characteristics for a noun object, whether real or virtual. For example, PERSON, ACCOUNT, PRODUCT, APPOINTMENT, PRACTITIONER, and NETWORK. When the word entity is used, it will in reference to the conceptual or logical structure, not the physical implementation.
- Attribute: a single characteristic of an entity. For example, FIRST NAME, LAST NAME, GENDER, and BIRTH DATE, are all attributes shared by a PERSON.
- Row/Tuple/Observation: will be used synonymously for the logical and physical implementation to mean a group of attributes within an entity that collectively define an instance of that entity.
- Table: the actual physical structure in a database that contains the data as defined in the logical entity. Once the entity has been converted to DDL, and built, then it is a table.
- Column: the physical implementation of an attribute.
- Field: the intersection of a column and a row in a database, or model.
Apache Giraph:
For this project neo4j is used, but for extremely large volumes of data, as in petabytes of data, Apache Giraph might be a better solution. Giraph is a an iterative graph processing system built for high scalability. Giraph originated as the open-source counterpart to Pregel, the graph processing architecture developed at Google and described in the paper titled: “Pregel: A System for Large-Scale Graph Processing.” A search for this paper on Google will provide you a copy without going to ACM and paying $15.00 for it.
There are several papers online that walk you through the process of setting up Giraph on a single node cluster for testing and development: one from Apache and the other from Cloudera. I won’t say much more on Giraph for this article. When I get more time I will set it up and demo it.
neo4j Graph Database:
neo4j is a NoSQL, ACID compliant, Graph Database that is currently the leader in the market, being used by companies like Walmart, and eBay. It is a very well documented database with a large user and support community. Aside from being a graph database, they have developed their own proprietary query language called Cypher. I will validate their claim that it requires far less code to extract data than does the equivalent SQL query once I get the OpenMRS data loaded. The reason neo4j so much faster is that there are no joins. The data model is built with joined edges as defined by the graph model, and these joined edges are the relationships, and joins. In addition to Cypher, they have many interfaces developed that allow the use of SPARQL, Spark, Python and several other languages for manipulating data.
Atomicity, Consistency, Isolation, and Durability (ACID) compliance is one of their claims that is a discriminator with other NOSQL databases. Here is a link to an article on ACID vs BASE (Basically Available, Soft state, Eventual consistency) that is basically pro BASE if you are interested in learning, or refreshing yourself on the concepts. For my opinion, with certain applications BASE databases are the way to go, but if you can overcome the performance issues of traditional RDBMS databases and still maintain ACID, I believe it is the best of both worlds.
I won’t cover installation of neo4j since it is a very well documented process. I tested it on OS X 10.12.5, Ubuntu Linux 16.04, and Windows 7 and had no problems with any of the operating systems.
The Graph Data Model:
As you might recall from the last article “Words Have Power: Part 2,” I used the open source healthcare application OpenMRS primarily because they provided a test data set. I will use this data for the baseline in the project. The data has already been loaded into a MySQL instance, and reverse engineered in Erwin 9.7. I will use this model to assist in building the graph data model.
The first step in modeling a graph database, just as it is in developing any data model, is to define your requirements. Even for Agile development processes, the whole needs to be broken down into subject areas (SA) and the SA developed over the Sprints with required “Stories” or requirements. For example, the model below depicts the ENCOUNTER SUBJECT AREA. One of the artifacts from this exercise is the conceptual data model. As defined above, this model provides the entities, and relationships. For a graph data model this is where the nodes and relationships are defined. From the OpenMRS reverse engineered database, here a small portion of the relational logical data model:
A few quick critiques of this model. What I would consider a major problem is that there is a circular relationship between patient, encounter, encounter_provider, and provider. This model does not ensure that the provider will be assigned to the same encounter that the patient is assigned to. Less major, but still an issue, you will notice that there are NOT NULL constraints on many attributes, but they don’t make a lot of sense. For example, in the PERSON entity, GENDER is not required (doesn’t have a NOT NULL constraint), yet “dead” is required, which I assume is an indicator field (Y/N). In the same entity, deathdate_estimated is a required attribute (NOT NULL). What if the person is not dead, do they put some future date in there to satisfy the constraint? Also, in the PERSON_NAME entity, the only non-key information that is required is the person’s PREFERRED (name?). Even with this being set to NOT NULL, the entry can be omitted in the table — just don’t populate the table. There are several inconsistencies like this that could be a problem, but the exercise here is not to critique the OpenMRS data model. As stated before, the fact that this is not perfect makes it a better candidate for the exercise.
From this model, with the relationships having Parent-Child and Child-Parent Phrases defined in Erwin 9.7, we can derive the following information about the nodes and relationships:
PERSON -- hasA --> PERSON_ATTRIBUTE -- is associatedW (associated with) --> PERSON PERSON -- hasA --> PERSON_NAME -- identifiesA --> PERSON PERSON -- hasA --> PERSON_ADDRESS -- associatedW --> PERSON PERSON -- isA --> PROVIDER -- instanceOf --> PERSON PROVIDER -- assignedTo --> ENCOUNTER_PROVIDER* -- assignedA --> PROVIDER PERSON -- isA --> PATIENT -- instanceOf --> PERSON PATIENT -- assignedTo --> ENCOUNTER -- hasA --> PATIENT ENCOUNTER -- associatedW --> ENCOUNTER_PROVIDER -- assignedTo --> ENCOUNTER LOCATION -- placeOf --> ENCOUNTER -- hasA --> LOCATION PARENT_LOCATION -- contains --> LOCATION -- componentOf --> PARENT_LOCATION
With this information, there is enough to define the nodes and relationships, and the only thing remaining is to associate the appropriate attributes with the nodes and relationships. In Figure 4, the ER diagram above is depicted in a graph data model. At first glance it doesn’t appear possible that these structures could contain everything that the Erwin model does, and from an entity/attribute/relationship perspective, it does. However, I believe a tool like Erwin could be used to create the same structures. They wouldn’t look the same, but the ER model is more than capable of capturing the information in the graph model below.
From the model, you can see that a person is a patient, and a patient is an instance of a person. Also, a person has an address of type home, and/or work, and the address is associated with a person. Each of the attributes, just like in relational models, are associated with a noun (entity or node/vertex), and the relationships themselves can have attributes, such as home and work. The modeling is very straightforward, but I don’t see a lot of difference in the difficulty. Just as data modelers debate for hours over whether an entity should be a subtype, I am sure there is some debate needed to define the relationship names. Is “hasA” a good name for a relationship, or “assignedTo”?
To convert the data from the OpenMRS database in MySQL, and format it as depicted in the graph model, I would create a view of the data, and export it as a csv file. neo4j has a utility that loads the data. I didn’t have time to complete the process, but below is what the view looks like, and it certainly confirms the complexity of having to do a bunch of joins with relational database:
CREATE OR REPLACE VIEW vPERSON ( person_id,prefix,given_name,middle_name,family_name_prefix, family_name,family_name2,family_name_suffix,preferred,degree,gender,birthdate, birthdate_estimated,dead,death_date,cause_of_death,deathdate_estimated,birthtime, race,birthplace,citizenship,mother_name,civilStatus ) AS SELECT p.person_id,pn.prefix,pn.given_name,pn.middle_name,pn.family_name_prefix, pn.family_name,pn.family_name2,pn.family_name_suffix,pn.preferred,pn.degree, p.gender,p.birthdate,p.birthdate_estimated,p.dead,p.death_date,p.cause_of_death, p.deathdate_estimated,p.birthtime,paRace.value,paBirthplace.value, paCitizenship.value,paMotherName.value, paCivilStatus.value FROM openmrs.person p LEFT JOIN openmrs.person_name pn on p.person_id = pn.person_id LEFT JOIN openmrs.person_attribute paRace on p.person_id = paRace.person_id and paRace.person_attribute_type_id = 1 LEFT JOIN openmrs.person_attribute paBirthplace on p.person_id = paBirthplace.person_id and paBirthPlace.person_attribute_type_id = 2 LEFT JOIN openmrs.person_attribute paCitizenship on p.person_id = paCitizenship.person_id and paCitizenship.person_attribute_type_id = 3 LEFT JOIN openmrs.person_attribute paMotherName on p.person_id = paMotherName.person_id and paMotherName.person_attribute_type_id = 4 LEFT JOIN openmrs.person_attribute paCivilStatus on p.person_id = paCivilStatus.person_id and paCivilStatus.person_attribute_type_id = 5
Courtesy of Tom Dickinson, I did load WordNet 3.1 and ConceptNet 5 into a local neo4j instance. His site provides the Python scripts that merge WordNet with ConceptNet 5 and creates three csv files that are then loaded into a neo4j database. The load script for the csv files is rather straightforward:
./neo4j-admin import --database={NEO4J_LOCATION}/data/databases/wordnet_conceptnet.db --nodes={PATH_TO_DATASET_FOLDER}/synsets.csv --relationships={PATH_TO_DATASET_FOLDER}/relationships.csv --nodes={PATH_TO_DATASET_FOLDER}/words.csv
Once loaded, the neo4j.conf file must be modified, changing the name of the database to wordnet_conceptnet.db. Once this is done, restart the server. neo4j has a web interface which is really cool, and intuitive (somewhat). Here is the neo4j interface running on a Chrome browser and the WordNet database loaded:
There is a lot of work yet to do on the project, but I believe the use of graph databases is certainly a logical choice for the technology stack.
Leave a Reply
Your email is safe with us.