Logistic Regression: Modeling an Expert
MIT, Johns Hopkins, Stanford, Harvard, and several other prominent universities provide training through Massive Open Online Course (MOOC) sites. If you haven’t tried one yet, you need to check them out. The courses are taught by some brilliant professors, and provide excellent training tools for employees, as well as sources of well documented methodologies for solving problems. Of course, like anything, you only get out of them what you put in. Another advantage of taking these courses is that they often provide data sets that allow you to demonstrate the methodologies. I have experience in several industries, but obviously don’t have access to their data to play with. Using data obtained from MIT’s “The Analytics Edge” course delivered on edx.org, I demonstrate how to build an Expert System.
The Expert System:
An Expert System provides answers to binary questions that typically can only be answered by an expert in the field. However, because of the volume of data, it is not practical to have an expert poring through volumes of data to make a simple determination for potentially millions of events; in this case, the Expert system performs the task that a physician would normally have to do, and that is answer one question: is a patient receiving quality care? There are thousands of patients, and a handful of physicians. Not only would it be expensive, but would not even be possible if the patient to physician ratio exceeded a certain point. In this particular case study, we are using healthcare data. However, this solution is applicable across numerous domains. For example, an expert (e.g., CEO, CFO, Physician, Attorney, Director/Manager) in any industry (e.g., Manufacturing, Healthcare, Retail, Banking, Insurance) can use this method to answer questions like: 1) will next month be profitable, 2) will profit be higher this year than last 3) is a claim suspicious 4) how does a particular judge rule on cases like this one, …. anything that requires a yes/no answer and the expertise of subject matter expert in the field.
What is Logistic Regression:
Logistic regression, also known as logit regression, or logit model, is a regression model where the dependent variable is categorical, or binary. The predictive answer could take the form of zero or one, which we do in the Expert System case study below, but could be anything with only two options: win/lose, high/low, pass/fail, etc. There are other logistic regression models that can be applied when the dependent variable has more than 2 possible outcomes called multinomial logistic regression. When multiple categories exist, and they must be ordered, this is known as ordinal logistic regression.
D2Hawkeye, a market leader in transforming healthcare data into actionable information, needed to assess claims data from millions of patients. The goal of this exercise is to develop a predictive model that determines the quality of a patient’s care on a large scale using claims data; is the patient receiving quality care?
I have worked in the healthcare industry several years, and just about all of them gather data from healthcare groups that includes patient medical records, as well as claims data. This claims data is generated during an encounter (when a patient visits a doctor). Claims data includes ICD-9 diagnosis codes, procedures codes, and costs associated with the encounter. Likewise, pharmacy claims are part of the data set, defining drugs prescribed and the associated costs.
This data is typically gathered using Electronic Medical Record (EMR) systems and is somewhat standardized, and the codes are clearly defined. The problem comes when the healthcare group uses several EMR systems and the data must be integrated. Also, there are humans in the loop since humans generate the data which is not always accurate. Under-reporting is a problem as well since the task of filing the claims is a tedious task. For this study, MIT used a large insurance claims database, and randomly selected patients as follows:
131 diabetes patients
Costs were approximately $10,000 to $20,000
Dates: September 1, 2003 to August 31, 2005
For each of these patients, an expert physician reviewed the claims and provided descriptive notes such as “ongoing use of narcotics”, “only on Avandia, not a good first choice drug”, “had regular visits, mammogram, and immunizations”, and “was given home testing supplies.” Using information like this, the physician rated the care as being either poor, or good. Based on the physician’s notes, variables were extracted and used for this analysis.
The dependent variable is the quality of care, and the independent variables are those based on the variables extracted as described above, along with patient demographics, health care utilization, providers,
claims, and prescriptions. As stated earlier, an Expert System provides a yes/no answer; a categorical variable that takes only two possible values. In this case, a “1” was assigned to patients that received poor care, and a “0” was assigned for patients that received good care.
Simple linear regression could be used for this exercise if the outcome were rounded to either a “0” or a “1”, but logistic regression is more appropriate given that the dependent variable is categorical (0/1).
Logistic regression predicts the probability of the outcome variable being true. As described above, the case study model will predict the probability that a patient received either poor, or good care. Since in our study, a 1 represents poor care, our model will predict the probability of . Since the probability that the patient is receiving poor care is provided by y, then the probability that the patient is receiving good care, or , is simply .
The independent variables for this problem are identified by where k represents the total number of independent variables. To predict we use the Logistic Regression Function:
The result will be a value between 0 and 1 which represents the probability of . Positive coefficients increase the linear regression and therefore the probability that , and in our case, increases the probability that the patient is receiving poor care. Obviously, negative coefficients have the opposite effect, and decreases the probability of and increases the probability of , or that the patient is receiving good care.
You can also think of the logistic response function in terms of odds which is simply the probability of 1 divided by the probability of 0 (not dividing by 0 :). If the odds are greater than 1, then the likelihood of 1 occurring is greater than the likelihood of 0 happening. Likewise, if the odds are less than 1, then the likelihood of 0 occurring is greater (patient receiving good care). If the odds are 1, then the outcomes are equally likely.
It you take the log of both sides, or what is called Logit, the right side of the equation is the linear regression equation.
Regression in general is about defining correlations between the outcome (dependent variable), and the set of regressors x (the independent variables), which are also sometimes called features. In regression analysis the statistical relation between y and x is determined.
How can we accurately use x to predict y, and how does the predicted value of y change if we change a component of x (assuming the rest of the components of x are fixed)?
In our current problem, the prediction questions become:
- How to use patient relevant characteristics, such as InPatientDays, ERVisits, OfficeVisits, and others to accurately predict whether the patient is receiving poor, or good care?
- What are the inferences that can be drawn? How does the application of Narcotics affect the predicted quality of care?
Establish the Baseline:
How do we know that our model will provide any better results than if we just guessed? My assumption, without looking at the data is that more patients receive good care than receive poor care. I loaded the data and looked at the 131 observations using R, and did a count. The count revealed that 98 received good care, and 33 received poor care. If I had just gone with my assumption, and since more patients actually did receive good care, how often would I be correct if I predicted that all 131 patients receive good care? I would be correct 98 out of 131 times, for an accuracy rate of 98/131, or 0.74809. This means that our model must do a better job of predicting than 75% for it to be any better than just guessing. This is called establishing the baseline.
During the analysis phase, the objective is to determine the independent variables that appear to have the greatest predictive values. This is usually done by plotting the data and looking for patterns that imply some sort of relationship between the dependent and independent variables. Also, running a regression analysis and looking for positive coefficients, low p-values, and other factors like the Akaike Information Criterion (AIC) values. AIC provides a measure of the model quality, which is an aid in model selection.
For the sake of space and time, I will not plot the data here. Instead, I will begin by fitting a Generalized Linear Model (glm) in R against all regressors, and from that model, select the variables that are determined to be most relevant given their positive coefficients, p-values, and AIC scores. The results initially indicate that Narcotics, StartedOnCombinationTRUE, and AcuteDrugGapSmall are the most relevant features.
# Read in dataset quality = read.csv("quality.csv") # Look at structure firstMod <- glm(PoorCare ~ ., data = quality, family = binomial) summary(firstMod) Call: glm(formula = PoorCare ~ ., family = binomial, data = quality) Deviance Residuals: Min 1Q Median 3Q Max -1.7886 -0.5458 -0.3843 -0.0513 2.3407 Coefficients: (1 not defined because of singularities) Estimate Std. Error z value Pr(>|z|) (Intercept) -1.747648 1.051081 -1.663 0.09637 . MemberID -0.002472 0.007522 -0.329 0.74245 InpatientDays 0.025651 0.058217 0.441 0.65949 ERVisits -0.040921 0.194901 -0.210 0.83370 OfficeVisits 0.072159 0.039211 1.840 0.06573 . Narcotics 0.085728 0.041640 2.059 0.03952 * DaysSinceLastERVisit -0.002389 0.001418 -1.685 0.09200 . Pain -0.007394 0.014855 -0.498 0.61870 TotalVisits NA NA NA NA ProviderCount 0.025349 0.028652 0.885 0.37631 MedicalClaims -0.001547 0.020746 -0.075 0.94057 ClaimLines -0.005297 0.006272 -0.845 0.39836 StartedOnCombinationTRUE 3.322318 1.285513 2.584 0.00975 ** AcuteDrugGapSmall 0.222019 0.091300 2.432 0.01503 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 147.88 on 130 degrees of freedom Residual deviance: 96.70 on 118 degrees of freedom AIC: 122.7 Number of Fisher Scoring iterations: 6
Preparing the Data:
When performing predictive analysis, a common practice is to divide the data set into a training and a testing data set. There are other checks that normally accompany the data preparation process, like collinearity, near-zero-variance checks, missing values, imputation, and other, but for the sake of time and space, this process is omitted here (recall that I got the data from MIT and it already has been preprocessed). However, we do need to split the data into train and test data sets.
# Randomly split data library(caTools) set.seed(88) split = sample.split(quality$PoorCare, SplitRatio = 0.75) # Create training and testing sets qualityTrain = subset(quality, split == TRUE) qualityTest = subset(quality, split == FALSE)
With the data split, and the initial features selected, several models are built, and evaluated using ANOVA:
# AIC == minimum AIC is the preferred model fit1 <- glm(PoorCare ~ StartedOnCombination, data=qualityTrain, family = binomial) fit1$aic  111.3223 fit2 <- update(fit1, . ~ . + Narcotics) fit2$aic  99.26025 fit3 <- update(fit1, . ~ . + Narcotics + AcuteDrugGapSmall) fit3$aic  94.00769 fit4 <- update(fit1, . ~ . + Narcotics + AcuteDrugGapSmall + OfficeVisits) fit4$aic  88.35658 anova(fit1, fit2, fit3, fit4,test="Chisq") Analysis of Deviance Table Model 1: PoorCare ~ StartedOnCombination Model 2: PoorCare ~ StartedOnCombination + Narcotics Model 3: PoorCare ~ StartedOnCombination + Narcotics + AcuteDrugGapSmall Model 4: PoorCare ~ StartedOnCombination + Narcotics + AcuteDrugGapSmall + OfficeVisits Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 97 107.322 2 96 93.260 1 14.0621 0.0001769 *** 3 95 86.008 1 7.2526 0.0070800 ** 4 94 78.357 1 7.6511 0.0056737 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The outcome of a logistic regression model is a probability with a value between 1 and zero. To convert this probability into a prediction we use what is called a “threshold t value.” It simply means that we establish a value that will be the point at which a value is either a 1 (poor care), or a 0 (good care). The key is to select the right value for t since you can easily skew your results in one direction or another. For example, if you set the t value to be 0.80, and all probabilities are less than 0.80, then all predictions will be 0 (good care), which will be equivalent to the baseline model.
It comes down to whether you have a preference for the type of error you would prefer to have. For example, in this case there might be a strong desire to identify all patients that are receiving poor care, even if it means that some patients will be identified as having poor care, when in reality they are receiving good care. If we pick a large threshold value, then we will rarely predict poor care. As stated earlier, a t value of 0.8 would result in all patients predicted as receiving good care. However, the goal might be to pick the patients with the worst care (assuming a higher probability meant that their care was necessarily worse than someone with a lower probability).
On the other hand, picking a small threshold value would improve the chances of identifying all patients receiving poor care, but the model would predict a lot of patients as receiving poor care, when in fact they are receiving good care.
There are often good reasons to favor one error type over the other. For example, if having a highly contagious disease was the outcome, you would want a low t value to ensure that you identified everyone that had the disease. This of course would mean there would be a high false-positive error.
Barring a preference for one type of error or the other, setting the threshold to t = 0.5 will simply predict the most likely outcome.
To quantify the results of the predictions, there is a way to present the data in a format called a confusion matrix.
When the predictions are made, there is a certain percentage of predictions that accurately predict the outcome, and accurately predict a negative outcome. For example, a TRUE POSITIVE in this analysis would be when the model accurately predicted the patients that received poor care, and a TRUE NEGATIVE would be when the model accurately predicted the patients receiving good care. This is determined by comparing the actual outcomes against the predicted outcomes.
There are several measures that are computed from the confusion matrix that help us determine the types of errors we’re getting:
So what is the correct answer? There are ways of providing a more quantitative method for selecting the threshold value and it involves using something called the Receiver Operator Characteristic curve.
In Figure 2, the x axis represents the FALSE POSITIVE RATE, or SPECIFICITY, and the y axis represents the TRUE POSITIVE RATE, or SENSITIVITY. The higher the threshold is set, the more skewed the model is toward a false positive rate; all patients receive good care, and prediction would be a 0. This is where the ROC curve always starts at the point (0, 0). With a threshold of 1, your model will predict all good care, and will have a sensitivity of 0. The opposite is true when you select a threshold of 0; all patients receive poor care, and the specificity will be 1. All of this simply means that as the threshold decreases from 1, you move from (0,0) to (1,1).
The ROC curve captures all thresholds simultaneously. From looking at the graph, you can intuit that the closer to (0, 0),
the higher the false positive rate, or specificity, and the lower the true positive rate, or sensitivity, and vice versa.
So which threshold value should you pick? Unfortunately, it is a trade-off and depends on the problem being addressed, and the effect of either a false positive, or negative.
For the purposes of this exercise, let’s go with attempting to pick a threshold value that predicts all “poor care” patients, and minimizes the number of false negatives. Obviously, this was the objective from the beginning, but let’s run the code and see which threshold values come closest to meeting this objective.
Using the data provided, and model 4, we get the following ROC Curve (code for plot below):
Run Against Test Data:
It was determined that fit4 was the best model. With further time we probably could have improved the results by looking at other combinations, but as it is, the test data had 8 patients receiving poor care, and as you can see from the results below, this model accurately predicted 7 out of the 8.
> summary(fit4) Call: glm(formula = PoorCare ~ StartedOnCombination + Narcotics + AcuteDrugGapSmall + OfficeVisits, family = binomial, data = qualityTrain) Deviance Residuals: Min 1Q Median 3Q Max -1.6015 -0.5872 -0.4013 -0.1340 2.4677 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.30744 0.65470 -5.052 4.38e-07 *** StartedOnCombinationTRUE 2.51088 1.42061 1.767 0.07715 . Narcotics 0.02109 0.03850 0.548 0.58388 AcuteDrugGapSmall 0.26986 0.09701 2.782 0.00541 ** OfficeVisits 0.08976 0.03532 2.541 0.01105 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 111.888 on 98 degrees of freedom Residual deviance: 78.357 on 94 degrees of freedom AIC: 88.357 Number of Fisher Scoring iterations: 6
# Analyze predictions using model fit4 # predictTestF4 <- predict(fit4, type="response", newdata = qualityTest) # Confusion matrix table(qualityTest$PoorCare, predictTestF4 > 0.3) FALSE TRUE 0 20 4 1 1 7 cmF4 <- table(qualityTest$PoorCare, predictTestF4 > 0.3) sensitivityF4 <- cmF4[2,2]/sum(cmF4[2,]) sensitivityF4  0.875 specificityF4 <- cmF4[1,1]/sum(cmF4[1,]) specificityF4  0.8333333 # Total Accuracy (cmF4[1,1]+cmF4[2,2])/(sum(cmF4[1,]) + sum(cmF4[2,]))  0.84375 # AUC Area Under Curve ROCRpredTestF4 <- prediction(predictTestF4, qualityTest$PoorCare) aucF4 <- as.numeric(performance(ROCRpredTestF4, "auc")@y.values) aucF4  0.8333333
The above R code runs the logistic regression model (fit4) against the test data that was split out earlier. The analysis resulted in an overall accuracy of 0.84375, which beats the baseline accuracy of 0.74809 by approximately 10% (0.09566). While this might not seem like much, the expert’s task is now automated, and can be applied to millions of records with a fairly high degree of certainty, and it did not consume the resource of an expert to do it. This type of analysis is not restricted to the medical field either. As discussed earlier, this process can be applied to any system that requires the opinion of an expert, and since this process applies the knowledge and skills of an expert across potentially millions of records, the Expert System can make things possible that were not possible before.