Automated Data Wrangling 2

As discussed in the original post, we are trying to answer the question, “can we automate the data wrangling process, and more specifically can data and words be associated dynamically through the use of ontologies?” On the surface, this appears to be an extremely complicated problem, and it is. However, NASA has done the same thing [1], and I’m not a rocket scientist, but I doubt that everything NASA does requires a rocket scientist; we will see.

The challenge I had in completing this post originally was the lack of data needed to develop training and testing data sets for the predictive algorithms. However, I have since competed in a contest where data was made available and my team won the contest. The exercise was accomplished with a very limited scope, but we did successfully prove that the concept is valid. The synset, or dependent variable, must have representative independent variables in adequate numbers to provide statistically significant results, and with sufficient data, we were able to accomplish this.

The Tools Needed:

The tools used are WordNet, a lexical database of the English language from Princeton University, Erwin to visualize the structures, Python 3.x (along with two libraries, NLTK, and RDFLIB), and for an ontology editor I will use TopBraid from TopQuadrant and Protégé for smaller ontologies, but the WordNet RDF is too large for Protégé. It probably is my lack of knowledge, but I couldn’t get it to load the WordNet RDF. Topbraid also supports connections to numerous databases, and I used the Jena TDB database. Depending on your machine, you may have to extend the Eclipse memory.

Some Quick Terminology:

The following terms were taken directly from the website Vocabulary.com:

synset: a set of one or more synonyms
holonym: a word that names the whole of which a given word is a part
hyponym: a word that is more specific than a given word
hypernym: a word that is more generic than a given word
hypernymy: the semantic relation of being superordinate or belonging to a higher rank or class
lemma: the heading that indicates the subject of an annotation or a literary composition or a dictionary entry
meronym: a word that names a part of a larger whole
pertainym: meaning relating to or pertaining to (relational adjective: as in criminal and crime)
synonymy: the semantic relation that holds between two words that can (in a given context) express the same meaning
troponym: a word that denotes an increasingly specific manner of doing something, like communicate-talk-whisper.

WordNet:

WordNet contains nouns, verbs, adjectives and adverbs, each grouped into synonyms (they use the term cognitive synonyms and call them synsets). These synsets are linked to other words that are conceptually related and this is the real power of WordNet.

At first glance, you would think it is just a glorified thesaurus. However, WordNet interlinks specific senses of words. Basically, this allows us to parse out words that are found in close proximity to one another. For example, “wheeled vehicle” is associated with “motor vehicle”, “landrover”, “patrol car”, which are all types of motorcars. They call this being semantically disambiguated. While this alone is quite an accomplishment, in my opinion, the most powerful aspect of WordNet is the definition of semantic relationships between words. If you recall, Resource Description Framework (RDF) consists of Triples: Subject, Predicate, and Object. In a previous post on “Machine Readable Ontologies” the following was extracted from a wines ontology:

s = http://www.w3.org/TR/2003/PR-owl-guide-20031209/wine#Port
p = http://www.w3.org/2000/01/rdf-schema#subClassOf
o = http://www.w3.org/TR/2003/PR-owl-guide-20031209/wine#RedWine

The above reads that Port is a subclass of Red Wine. In WordNet, Port in the context of wines (5 nouns, 8 verbs, and 1 adverb) is the second noun listed, and is linked to Port Wine, which is a type of Sweet Dark-Red Dessert Wine.

Conceptual Relationships:

In WordNet there are 117,000 synsets linked to other synsets by “conceptual relations.” Most relations between synsets are hierarchical, or super-subordinate relation. For example, Port Wine (hyponym – more specific) is a (ISA) Red Wine (hypernym – more generic). If we were looking at the healthcare industry, the synsets {‘health_professional’, ‘primary_care_provider’, ‘PCP’, ‘health_care_provider’, ‘caregiver’} are hyponyms of professional. The hyponyms of {health_professional, and health_care_provider} are {medical_practitioner, . . . . }, and a medical practitioner’s hyponyms are { ‘Dr.’, ‘MD’,’dental_practitioner’,
‘dentist’,’doc’,’doctor’,’inoculator’,’medic’,’medical_officer’,’medico’,’physician’,’tooth_doctor’,’vaccinator’}. Even though WordNet is not an ontology, it provides much of the needed vocabulary and relationships that would be needed in building our ontology. As you would expect, in WordNet, all nouns ultimately go up the root node {entity}:

['entity.n.01',
 'physical_entity.n.01',
 'causal_agent.n.01',
 'person.n.01',
 'adult.n.01',
 'professional.n.01',
 'health_professional.n.01',
 'medical_practitioner.n.01']

Taken from WordNet directly “Hyponymy relation is transitive” meaning that if professional is a kind of adult, and adult is a kind of person, then a professional is a kind of person. In WordNet, as in ontologies, there is a distinction between classes, or types, and instances of those classes. Thus, a doctor is a type of medical_practitioner, and Dr. Marcus Welby is an instance of a doctor (fictitious in this case). These are always leaf nodes in the hierarchy. Basically, this is the data that we need to associate with the word.

There are numerous types of relationships in WordNet, and these can be very helpful in building the ontology, as well as validating assumptions. The two most commonly used relations are the lexical relations hypernyms and hyponyms. These two relations navigate the hierarchy with “ISA” relationships, indicating a relationship between one synset and another. However, obviously, not all words fall into these categories. For example, we know a hospital and ward are related, but we cannot say a hospital is a ward, or hospital is a type of ward, or vice versa. We need other relationship types, and these are meronyms (components of), and holonyms (contained in). If you run the Python script below, and enter the word ‘hospital’ it will come back with component meronyms for a hospital are coronary_care_unit, and intensive_care_unit. Obviously, hospitals contain more than this, but this provides examples. This is one reason why a domain specific ontology is necessary. WordNet is not an ontology, and could not exhaustively contain all words with their relations to others for all domains. If you want to play with what we have discussed so far, the Python code snippet below will ask you to enter a word and return its synonyms, hyponyms, hypernyms, and meronyms/holonyms. Obviously, not every word will have values for each.

from nltk.corpus import wordnet as wn

def main():

    synsetWord = input('Provide a word to receive its list of synonyms:')
    synsetsAll = wn.synsets(synsetWord)

    for i in range(len(synsetsAll)):
        print('Lemma names: ', synsetsAll[i].lemma_names())
        print('Definition:', synsetsAll[i].definition())
        if synsetsAll[i].examples() != []: print('Examples:', synsetsAll[i].examples())
        print('Lemmas:', synsetsAll[i].lemmas())
        #Hyponyms:
        types_of = synsetsAll[i].hyponyms()
        print('Types of/Hyponyms: ', sorted(lemma.name() for synset in types_of for lemma in synset.lemmas()))
        #Hypernyms:
        if synsetsAll[i].hypernyms() != []: print('Hypernyms:', synsetsAll[i].hypernyms())
        word_paths = synsetsAll[i].hypernym_paths()
        print('Path to root:', [synset.name() for synset in word_paths[0]])
        # Meronyms
        if synsetsAll[i].part_meronyms() != []: print('Components Meronyms:', synsetsAll[i].part_meronyms())
        if synsetsAll[i].substance_meronyms() != []: print('Substance Meronyms:', synsetsAll[i].substance_meronyms())
        if synsetsAll[i].member_meronyms() != []: print('Collections of Meronyms:', synsetsAll[i].member_meronyms())
        #Holonyms
        if synsetsAll[i].part_holonyms() != []: print('Components Holonyms:', synsetsAll[i].part_holonyms())
        if synsetsAll[i].substance_holonyms() != []: print('Substance Holonyms:', synsetsAll[i].substance_holonyms())
        if synsetsAll[i].member_holonyms() != []: print('Collections of Holonyms:', synsetsAll[i].member_holonyms())

if __name__ == '__main__':
        main()

So far, all we have discussed are nouns, which is all we will cover for now. If you want to learn more about the other parts of speech in WordNet, and their relationships, their website is extremely thorough. However, if I don’t stop here, I will either have to write a book, or drop the last two topics, which is the ontology itself along with the editors, and the data.

Importing WordNet into TopBraid:

Topbraid has a free, open source version of their program, as does Protégé, and they can both be downloaded from their websites. I had a hard time loading WordNet into Protégé so I used Topbraid. If you would like to rebuild the process, you can downloaded WordNet in NTriples format from the Princeton website. I converted the NTriples format to RDF/XML format (smaller, more compact) using the Python ‘rdflib’ package:

from rdflib import Graph

def graphNT(fileIn, fileOut):
    g = Graph()
    g.parse(fileIn, format='nt')
    g.serialize(destination=fileOut, format='xml')

def main():
    fileIn =  sys.argv[1]
    fileOut = sys.argv[2]
    graphNT(fileIn, fileOut)

if __name__ == '__main__':

     main()

It would not be possible to load WordNet into either ontology editor without a database, or at least I was not able to do so. As I mentioned earlier, I used TopBraid’s Jena TDB database to store WordNet and was able to store it rather easily.

Figure 1: TopBraid with WordNet Loaded Showing Hyponyms for Hospital

TopBraid is a very powerful tool, and allows you to automatically generate SPARQL queries from your graphs, and run them against your data. Protégé supports SPARQL queries as well.

Healthcare Domain Ontology

For the domain specific ontology we will go with healthcare. There aren’t a lot of these available, but I was able to find one on the Protégé Ontology Library page called the Universal Electronic Healthcare Record loaded into Protégé in the figure below:

Figure 2: Protégé with the Electronic Healthcare Record Ontology Loaded

So now we have introduced WordNet and some its capabilities, loaded WordNet into TopBraid, and loaded the Electronic Healthcare Record Ontology into Protégé. The question now is how does this all tie together and will we use the ontology to translate data into words, which in turn will allow us to utilize the relations to define data structures.

Healthcare Data:

There is a lot of interest in healthcare ontologies so I searched for a representative data set of an Electronic Medical Records (EMR) system. As you might imagine, with Health Insurance Portability and Accountability Act (HIPPA) regulations, and the proprietary nature of most EMR systems, this was difficult to come by. Fortunately, I ran across OpenMRS:

“OpenMRS is both software and a community. As a software it serves as an electronic medical record system (EMR) originally designed for developing countries. Through its open source community it has grown into a medical informatics platform used on every continent. This page will provide an introduction to the OpenMRS software: our electronic medical record and the platform supporting it.”

And even more fortunately, OpenMRS provides demo data with 5000 anonymized patient records and over 500,000 observations. I believe this will be adequate for an initial trial. At some point, there will be a need for multiple data sets with differing data structures to prove the point.

Loading the OpenMRS Data:

The data from OpenMRS was loaded into a MySQL database and reverse engineered using Erwin 9.7. The database contains 102 tables, with 40 tables actually containing data. The tables with data and their counts are shown below:

mysql> SELECT TABLE_NAME, TABLE_ROWS FROM `information_schema`.`tables`  WHERE `table_schema` = 'openmrs' and table_rows > 0;
+-------------------------+------------+
| TABLE_NAME              | TABLE_ROWS |
+-------------------------+------------+
| care_setting            |          2 |
| concept                 |       2450 |
| concept_answer          |        844 |
| concept_class           |         16 |
| concept_datatype        |         12 |
| concept_description     |       2433 |
| concept_map_type        |         70 |
| concept_name            |       3513 |
| concept_numeric         |        749 |
| concept_set             |        234 |
| concept_stop_word       |         10 |
| drug                    |          6 |
| encounter               |      14365 |
| encounter_provider      |      14134 |
| encounter_type          |          4 |
| field                   |        215 |
| field_type              |          5 |
| form                    |          4 |
| form_field              |        366 |
| global_property         |        288 |
| liquibasechangelog      |        697 |
| location                |         19 |
| obs                     |     473656 |
| order_type              |          2 |
| patient                 |       5284 |
| patient_identifier      |       5305 |
| patient_identifier_type |          2 |
| person                  |       5230 |
| person_address          |       5276 |
| person_attribute        |       5348 |
| person_attribute_type   |          7 |
| person_name             |       5256 |
| privilege               |        263 |
| relationship_type       |          4 |
| role                    |          4 |
| role_privilege          |         25 |
| scheduler_task_config   |          7 |
| user_property           |          7 |
| user_role               |          2 |
| users                   |          2 |
+-------------------------+------------+
40 rows in set (0.00 sec)

Figure 3: The Populated Tables in the Demo Dataset in OpenMRS

A deeper dive into the data reveals that the USERS table is related to almost every record in the database as you can see in Figure 2. This appears to be the method used for keeping track of which user of the system created a record, undated a record, and retired a record. This is not particularly interesting, or required. The USERS table clutters up the model so for now I decided to exclude the USERS table and focus on the ENCOUNTER, and PATIENT subject areas.

Figure 4: Screenshot of Populated Tables from OpenMRS Dataset — USERS table circled in Red

After removing the USERS table from the Subject Area (SA), I created two diagrams, one for encounter and one for concept. This called another table into question, “form”, which appears to be where the application gathers initial data from the patient. The field “xslt” (language for transforming XML and other documents) can be extracted with the following Python code if you’re interested:

#!/usr/bin/python3

import pymysql as my
import lxml.etree as ET

db = my.connect("localhost","uname","password","openmrs" )

cursor = db.cursor()

count = cursor.execute("SELECT xslt from form")
data2 = cursor.fetchall()
for i in range(0,len(data2)):
    dom = ET.parse(data2[i][0])
    xslt = ET.parse(data2[i][0])
    transform = ET.XSLT(xslt)
    newdom = transform(dom)
    print(ET.tostring(newdom, pretty_print=True))

db.close()

After looking at the data provided, the “form” table did not seem to be relevant at this point, so it also was dropped from the SA view as well. From my perspective, this model has areas that I would have done differently, but that makes it an even better candidate for the exercise. The purpose is not to see if data can be extracted from a perfectly normalized data structure with perfect naming standards, and documented data elements. If the ontology has the proper definitions, and relationships, the data extracted should be at least as organized as the source data, and hopefully more organized once extracted. The encounter SA that I decided on for now is shown in the following figure:

Figure 5: Encounter Subject Area from the OpenMRS Dataset

Conclusion:

How do we pull this all together? The process I envision is that the ontology development would come first, using as many existing ontology sources and models as possible. As the ontology development proceeds, as many synsets as possible would be identified. There would be synsets for facilities, addresses, ICD-9/10 codes, encounters, etc. and the data linked in this manner. Also, notice that some of these synsets would be for entities and some would be for attributes. The training and testing data sets would be different for each. Hopefully, I am making sense here and you are following my logic. The key is developing synsets with training and testing data sets for both entities and attributes (e.g., PERSON is an entity, GENDER is an attribute).

For example, the following information was extracted from WordNet using the code provided earlier:

/Users/RPy/virt_env/bin/python /Volumes/G-RAID/Dropbox/0-NWU/Blog/6-June2017/code/triesWordnet.py
Provide a word to receive its list of synonyms:blood_pressure
Lemma names:  ['blood_pressure']
Definition: the pressure of the circulating blood against the walls of the blood vessels; results from the systole of the left ventricle of the heart; sometimes measured for a quick evaluation of a person's health
Examples: ['adult blood pressure is considered normal at 120/80 where the first number is the systolic pressure and the second is the diastolic pressure']
Lemmas: [Lemma('blood_pressure.n.01.blood_pressure')]
Types of/Hyponyms:  ['arterial_pressure', 'diastolic_pressure', 'systolic_pressure', 'venous_pressure']
Hypernyms: [Synset('pressure.n.01'), Synset('vital_sign.n.01')]
Path to root: ['entity.n.01', 'physical_entity.n.01', 'process.n.06', 'phenomenon.n.01', 'natural_phenomenon.n.01', 'physical_phenomenon.n.01', 'pressure.n.01', 'blood_pressure.n.01']
Components Holonyms: [Synset('circulation.n.02')]

Process finished with exit code 0

Notice the number of hyponyms for blood pressure. Each of these synsets would require data sets collected in numerous formats. The names associated with the data from the source would probably be irrelevant since most data is stored using cryptic names that would be of little value. As an example, there would be a synset tagged to blood pressure. Some data would be two fields of data, one for systolic, and the other for diastolic blood pressures, and the other would be overloaded with both systolic and diastolic pressures (e.g., 120/80). The synsets would be mapped to the data in training and testing data sets. The key is defining the synsets and then having a large enough sampling of data to create data sets for each synset.

As mentioned at the beginning of the post, the problem with continuing this exercise with my original post on this subject was the lack or real data from a specific domain. This data was critical to perform the analysis and develop trained predictive models. However, since then, I was able to participate in a competition where my team successfully mapped trained models to an ontology of insurance data, thereby integrating property and casualty claims which provided a complete view of claims against a given policy. Again, this was a small sample, and was accomplished with a very narrow scope with the ontology only covering a few classes, but I believe it was sufficient to prove that the concept is valid and ontologies can be used to automate data integration and greatly reduce, if not eliminate the burden of data wrangling.

References:

Earley, S. (2016) Really, Really Big Data: NASA at the Forefront of Analytics, IEEE Computer Society Computing Edge, Retrieved from: http://doi.ieeecomputersociety.org/10.1109/MITP.2016.10
Bou-Ghannam A., (2013) Foundational Ontologies for Smarter Industries. IBM Red Paper. Retrieved from: https://www.redbooks.ibm.com/Redbooks.nsf/RedbookAbstracts/redp5081.html
Chandrasekaran B.,Josephson J.R., Benjamins V. R., (1999) What Are Ontologies, and Why Do We Need Them? IEEE Intelligent Systems Volume 14 Issue 1. Retrieved from: https://www.csee.umbc.edu/courses/771/papers/chandrasekaranetal99.pdf
Earley, S. (2016) Really, Really Big Data: NASA at the Forefront of Analytics, IEEE Computer Society Computing Edge, Retrieved from: http://doi.ieeecomputersociety.org/10.1109/MITP.2016.10
Earley, S. (2016) There is no AI without IA. IEEE Xplore IT Professional Vol: 18, Issue: 3. Retrieved from: http://ieeexplore.ieee.org/document/7478581/
Noy, N. F., & McGuinness, D. L. (2001). Ontology development 101: A guide to creating your first ontology. Retrieved from: http://protege.stanford.edu/publications/ontology_development/ontology101.pdf

Automated Data Wrangling 2

The Tools Needed:

Some Quick Terminology:

WordNet:

Conceptual Relationships:

Importing WordNet into TopBraid:

Healthcare Domain Ontology

Healthcare Data:

Loading the OpenMRS Data:

Conclusion:

Related

About Randall Shane

Leave a Reply

Your email is safe with us.

Automated Data Wrangling 2

The Tools Needed:

Some Quick Terminology:

WordNet:

Conceptual Relationships:

Importing WordNet into TopBraid:

Healthcare Domain Ontology

Healthcare Data:

Loading the OpenMRS Data:

Conclusion:

Share this:

Related

About Randall Shane

You also might be interested in

Latent Dirichlet Allocation (LDA): Topic Models

Automation of Data Wrangling

Graphs, what are they, and can they help us associate Words with Data?

Leave a Reply

Your email is safe with us.