# Logistic Regression Case Study: The Challenger

As most of us know, Challenger is the name for one of NASA’s space shuttle orbiters that experienced a catastrophic failure on January 28, 1986 resulting in the death of all seven on board: 5 astronauts, and 2 payload specialists. It was determined that the catastrophic failure resulted after all five O-ring seals in its right solid rocket booster failed at liftoff. Apparently, the O-rings were not designed for use in cold weather as occurred the day of the tragedy when temperatures reached 36º Fahrenheit. NASA had experienced O-ring failures before, and as Figure 1 shows, there is no apparent relationship between temperature and O-ring failure.

As you can see from Figure 1, the red vertical line shows the average temperature when an O-ring failure occurred, and the blue vertical line the average temperature for all launches. Furthermore, there were just as many failures above the red line as there were below it, five. This supports the idea that temperature did not play a role in the O-ring failures.

Selection of the statistical technique and proper interpretation is always a challenge, and numerous analyses have fallen short when closely scrutinized. As for the question being addressed here, it is clearly a binary classification problem. Which independent variables could answer the question: would the O-ring FAIL or NOT FAIL? This seems like a perfect application for logistic regression analysis.

The next problem is to determine which independent variables should be used? Was it the material, temperature, speed of the shuttle, fuel, vapor, or possibly atmospheric pressure? The list could be endless, and temperature, which we now know was at least a contributing cause, does not appear to be the determining factor.

Another challenge when performing any analysis is what to do with missing data. There are methods called imputation used for replacing missing values with plausible values and sometimes this is a good solution, and other times it’s better to simply remove the observations with missing values. In this case, we have all the data we need regarding temperature and O-ring failure, and there is no missing data. However, in the exploratory data analysis if we do not consider both FAILURE and NON-FAILURE of the O-ring, then we are omitting observations from our data. In “Applied Predictive Modeling,” the authors discuss something they call “informative missingness,” and how this data provides its own patterns, and since they are missing, they are sometimes easily overlooked. There is no missing data here, but by plotting only the cases of O-ring FAILURE, the information presented by O-ring NON-FAILURE is omitted. Once the values for both cases are plotted, Figure 2 leaves a different impression than Figure 1.

As you can see from Figure 2, for launches with no O-ring failures, it appears that there is a bias for O-ring failure at lower temperatures.

## The Data:

A shuttle has 5 O-rings, and there were 24 launches, which gives us a total of 120 observations. By looking at the data, one launch experienced 3 O-ring failures, one experienced 2 O-ring failures, and all the others experienced 1 O-ring failure (the launch at 70ºF is two separate launches).

## The Model:

While the plots themselves do not provide enough information to establish any causal relationship between temperature and O-ring failure, they certainly provide enough visual queues to warrant more investigation. With this being a binary outcome, the appropriate algorithm appears to be logistic regression for determining whether there is a statistically significant relationship between temperature and O-ring failure.

Dependent Variable: O-ring No FAILURE/FAILURE (0/1)

Independent Variable: Temperature in Degrees Fahrenheit

The temperature on the morning of the catastrophe was 36ºF.

Prior to the incident, and as our data reflects, there had been 10 FAILURES and 110 NO FAILURES

The question addressed by the model is: What is the likelihood of an O-ring failure in relation to the temperature.

The probability that is given by the following formula:

More specifically, we are interested in the probability that given that X equals 36. Here is the formula:

The results from the logit.fit() command in Python 3.5.2 (complete code below):

Logit Regression Results ============================================================================== Dep. Variable: Y No. Observations: 120 Model: Logit Df Residuals: 118 Method: MLE Df Model: 1 Date: Fri, 07 Jul 2017 Pseudo R-squ.: 0.1549 Time: 06:52:11 Log-Likelihood: -29.089 converged: True LL-Null: -34.420 LLR p-value: 0.001094 ============================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------ Intercept 7.4049 3.041 2.435 0.015 1.445 13.365 X -0.1466 0.047 -3.104 0.002 -0.239 -0.054 ==============================================================================

The results from the regression analysis are shown above. As you can see, there are 120 observations. You see that the confidence interval for ß is between -0.239 and -0.054. Since this range does not include 0, ß is considered significant. The P-value of 0.002 is considerably less than 0.05, also implying the significance of ß. This gives us a 95% degree of confidence that temperature does have an effect on the likelihood of O-ring failure. The null hypothesis is rejected.

The negative sign of the coefficient for X is inversely related and tells us that the odds of O-ring failure decrease by approximately 14% for every increase in temperature of 1ºF.

The exponential of the X coefficient is called the odds ratio, and is 0.86.

As shown from the output result.predict((1, 36)) = 0.89349705 ~ 90% (Confirmed in R: 0.89349705), the probability of O-ring failure given the temperature = 36ºF is 89.35%

This is the probability of one O-ring failing. What is the probability of all 5 O-rings failing? The failure of each O-ring is an independent result, and therefore, the probability of two independent events occurring is the product of their probabilities. Like with dice, the probability of rolling a 1 on a fair die is 1/6, and the probability of rolling two 1’s is 1/36. Therefore, the probability of all five O-rings failing is %.

## Conclusion:

There are certainly some caveats you would want to add to this analysis, but based on the limited amount of data available, I would not want to be on a rocket that has a 57% probability of catastrophic failure, and I am sure had the NASA management realized the risk, they would have aborted the mission. The important caveats are that this is a very simple logistic regression model with only one predictor. Also, in the data set, there were no observations that were close to the 36ºF that existed on the day of the failure. There are probably numerous caveats, but the fact remains that this evidence alone probably would have convinced management to abort the mission and save the seven lives that were lost that day.

NASA now has a statistician available for every launch.

## Python Code:

import pandas as pd import numpy as np from patsy import dmatrices import statsmodels.discrete.discrete_model as sm from matplotlib import pyplot as plt data = pd.read_csv("challenger-data.csv") data.describe() total_num_failures = sum(data.Y) # subsetting data failures = data.loc[(data.Y == 1)] no_failures = data.loc[(data.Y == 0)] avg_temp = np.mean(data.X) avg_temp_failures = np.mean(failures.X) avg_temp_no_failures = np.mean(no_failures.X) # frequencies failures_freq = failures.X.value_counts()#failures.groupby('X') no_failures_freq = no_failures.X.value_counts() # plotting fig, ax = plt.subplots() ax.spines['right'].set_visible(False) ax.spines['top'].set_visible(False) ax.minorticks_on() plt.xlim(50,80,2) plt.ylim(-1,4, 1) ax.scatter(failures_freq.index, failures_freq, c='red', s=40) count = 0 for x, y in failures_freq.items(): # <-- if count % 2 == 0: ax.annotate('({0}, {1})'.format(x,y), xy = (x+0.1, y+0.1)) else: ax.annotate('({0}, {1})'.format(x, y), xy=(x-0.2, y - 0.2)) count+=1 plt.xlabel('Temperature (ºF)') plt.ylabel('Number of O-ring Failures') plt.title('Challenger: Temperature (ºF) vs O-ring Failures ') plt.axvline(avg_temp_failures, linewidth = 2.0, ls='--', color = 'red') # plt.axvline(avg_temp, linewidth = 2.0, ls='-.', color = 'blue') plt.axvline(avg_temp_no_failures, linewidth = 2.0, ls='-.', color = 'green') plt.scatter(no_failures_freq.index, np.zeros(len(no_failures_freq)), c='blue', s=40) ax.annotate('Avg Temp failures occurred: {0}ºF'.format(avg_temp_failures), xy=(avg_temp_failures,3), xytext=(65,3.5), arrowprops=dict(facecolor='red', shrink=0.05),fontsize=12) # ax.annotate('Avg Temp all launches: {0}ºF'.format(avg_temp), # xy=(avg_temp,3), xytext=(65,2.5), # arrowprops=dict(facecolor='blue', shrink=0.05),fontsize=12) ax.annotate('Avg Temp no failures: {0}ºF'.format(round(avg_temp_no_failures,2)), xy=(avg_temp_no_failures,1), xytext=(67,0.5), arrowprops=dict(facecolor='green', shrink=0.05),fontsize=12) plt.show() #get the data in correct format y, X = dmatrices('Y ~ X', data, return_type = 'dataframe') #build the model logit = sm.Logit(y, X) result = logit.fit() # summarize the model print(result.summary()) # odds ratio print (round(np.exp(result.params[1]),2)) # odds ratios and 95% CI params = result.params conf = result.conf_int() conf['OR'] = params conf.columns = ['2.5%', '97.5%', 'OR'] print( 'Confidence Interval Odds: ', np.exp(conf[1:]))