Anscombe’s quartet actually has nothing to do with music, but when I hear the word quartet I associate it with music. However, this particular quartet refers to four datasets with very similar descriptive statistics. When these data are plotted you will see that they are obviously very different data sets. The idea was developed by Francis Anscombe in 1973 to demonstrate a very important concept, and that is descriptive statistics alone has limited value, and can even be very misleading. The lack of data visualization can often lead to misleading conclusions and overlooked critical information, like those elusive outliers. What I found so interesting about this simple demonstration is how well it highlights the need for a thorough understanding of the data before modeling of any kind begins, whether data, statistical, or mathematical.
Data modelers, as well as analyst, tend to rely heavily on some very basic descriptive statistics for their data profiling. This might be adequate if there is some a priori knowledge of the data, but can be very risky when developing integration models. Most widely used vendor tools provide the same reports, and for the most part are inadequate. For example, a data profiling report will typically consist of data type, length, value range (min/max), domain values (e.g., only contain specific values like M, F, UK for gender), cardinality (uniqueness – not relational), number of nulls, and for numeric fields, the mean, median, and standard deviation. For analysts and statisticians, they might even run a regression analysis and find that the fitted regression equation for each dataset is identical.
From these reports, it would seem reasonable to assume that data elements with very similar, if not identical results would represent the same attribute. Unfortunately, that conclusion could be incorrect. To be fair, if the data modelers are provided with the information that they are supposed to have, like process and data flow diagrams, data dictionaries, and the opportunity to interview with data owners, the risk of drawing the incorrect conclusion is reduced. Unfortunately, data modelers and analysts alike are not so fortunate, and are left to their own devices, which often leads to relying on vendor generated profiling reports, or inadequate personal analysis.
To demonstrate Anscombe’s concept, let’s assume that you have been given some data from four different, but related lines of business and asked to develop an integration model for an analytics platform. A small sample of the data is listed in the table below (Figure 1: Anscombe’s Quartet ):
Just glancing at the data, each set appears relatively similar. The 4th data set with all x values being identical with the exception of one outlier might raise concern, but otherwise, the data appears very similar. To alleviate any concerns, you run summary statistics and a regression analysis on the data. The results from the summary statistics in Figure 2 reveal that the x values for all four datasets have a mean equal to 9.0, a median of 6.5, with all but one having a minimum value of 4 and maximum value of 14. The y values have similar summary statistics as well. During your regression analysis, you also discover that the proportion of response variance, or multiple R squared is 0.67 for all four data sets (in R code below).
So far, everything is looking good and you are about to move forward with your analysis when you decide to plot the datasets along with the regression lines to validate your conclusions. The following figure shows a scatterplot of each dataset along with its linear regression line. As you can see, they all have similar regression lines, but the data is quite different. As they say, a picture is worth a thousand words.
For those interested, here is the R code used to generate the figures above:
anscombe <- data.frame( x1 = c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5), x2 = c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5), x3 = c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5), x4 = c(8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8), y1 = c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26,10.84, 4.82, 5.68), y2 = c(9.14, 8.14, 8.74, 8.77, 9.26, 8.1, 6.13, 3.1, 9.13, 7.26, 4.74), y3 = c(7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73), y4 = c(6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.5, 5.56, 7.91, 6.89)) png(filename = "Anscombe.png",width = 480,height = 480, units = "px") par(mfrow = c(2,2), oma = c(1, 1, 2, 1), mar = c(5,4,2,2)) limx = c(0,20) limy = c(0,14) ####################################################################### plot(anscombe$x1,anscombe$y1 , pch = 21, bg = "purple", xlab = "x1", ylab = "y1", bty = "n", xlim = limx, ylim = limy, main=("Dataset I")) abline(lm(anscombe$x1 ~ anscombe$y1), lwd = 2, lty = 2, col = "red") ####################################################################### plot(anscombe$x2,anscombe$y2 , pch = 21, bg = "purple", xlab = "x2", ylab = "y2", bty = "n", xlim = limx, ylim = limy, main=("Dataset II")) abline(lm(anscombe$x2 ~ anscombe$y2), lwd = 2, lty = 2, col = "red") ####################################################################### plot(anscombe$x3,anscombe$y3 , pch = 21, bg = "purple", xlab = "x3", ylab = "y3", bty = "n", xlim = limx, ylim = limy, main=("Dataset III")) abline(lm(anscombe$x3 ~ anscombe$y3), lwd = 2, lty = 2, col = "red") ####################################################################### plot(anscombe$x4,anscombe$y4 , pch = 21, bg = "purple", xlab = "x4", ylab = "y4", bty = "n", xlim = limx, ylim = limy, main=("Dataset IV")) abline(lm(anscombe$x4 ~ anscombe$y4), lwd = 2, lty = 2, col = "red") title(main = "Anscombe Quartet", sub = NULL, xlab = NULL, ylab = NULL, line = NA, outer = TRUE, cex = 0.5) graphics.off() print(summary(lm(anscombe$y1 ~ anscombe$x1))) print(summary(lm(anscombe$y2 ~ anscombe$x2))) print(summary(lm(anscombe$y3 ~ anscombe$x3))) print(summary(lm(anscombe$y4 ~ anscombe$x4))) summary(anscombe)
Leave a Reply
Your email is safe with us.