# Comparing Bivariate Plots Under Different Assumptions

I’m a strong proponent of graphical comparisons before diving into models, but which exploratory plot to use depends heavily on the underlying distributions of the data and which signals you’re looking for. Below I compare 4 kinds of bivariate plots for continuous variables (binning into 10 quantiles with boxplots, scatter-plot with a lowess fit, hexbins with a greyscale color scheme, and 2d density with a rainbow color scheme). I show they each have strengths and weaknesses recovering five functional forms (Piece-wise linear stair-step, piece-wise linear, exponential, quadratic, and sinusoidal) and three distributions of the covariates (uniform, normal, and mixed exponential and normal).

## The Setup

Plotting a set of visualizations for a given relationship is handled all at once with a single function. We specify the covariate X and predictions from a prespecified model Y, and it returns five combined plots that we can later combine.

```path = "your/path/here"
library(ggplot2); library(gridExtra); require(grid)
colfunc <- colorRampPalette(c("darkblue", "lightblue", "green", "yellow", "red"), space = "Lab")

finalplot <- function(true, e_mean=0, e_sd=.5) {

xplot <- ggplot(data=as.data.frame(x), aes(x=x)) +
geom_histogram(colour = "darkgreen", fill = "white", binwidth = 0.01) + theme_bw() +
ggtitle("X Covariate (Uniform Distribution)") + xlab("") +
theme(plot.title = element_text(size = rel(2)))
ggsave(xplot, file=paste0(path, "xplot_unif.png"))

y = true + rnorm(n, mean = e_mean, sd = e_sd)
up=max(y); bottom=min(y)
data = as.data.frame(cbind(x,y,true))

g <- theme_bw() + theme(plot.margin=unit(c(0,0,0,0),"mm"), axis.text.x = element_blank(), axis.line = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), axis.title.x = element_blank(),  axis.title.y = element_blank())

#True Functional Form without Noise
a <- ggplot(data, aes(x=x,y=true)) + geom_line() + g + ylim(bottom,up)

#Boxplots with 10 quantiles
b <- ggplot(data, aes(x=cut(x, breaks=quantile(x,probs = seq(0, 1, 0.1)), include.lowest = T),y=y)) + stat_smooth() + g + ylim(bottom,up) + geom_point(col="grey",position = "jitter") + geom_boxplot(alpha=.6)

#Scatter-plot with Lowess
c <- ggplot(data, aes(x=x,y=y)) + geom_point(col="grey", alpha=.6) + g + ylim(bottom,up) + stat_smooth()

#Hexbins Density
d <- ggplot(data, aes(x=x,y=y)) + stat_binhex(bins = 10) + scale_fill_gradient(low = "white", high = "black") + g + stat_smooth()

#2d Density
e <- ggplot(data, aes(x=x,y=y)) + stat_density2d(geom="tile", aes(fill = ..density..), contour = FALSE) +

gg <- arrangeGrob(a,b,c, d,e, ncol=5, main="")
return(gg)

}
```

For each set of plots, we pick the number of samples, the distribution of those samples and then a model. In this case, I use a moderately large n of 2000 on the domain of 0 to 1. In this case, the covariate is drawn from a uniform distribution.

```n=2000
x=runif(n)

#Piece-wise mean shifting
true = rep(0,n); true[x<.33] <- -1; true[x>.66] <- 1;g1 <- finalplot(true)

#Piece-wise linear
true=5*x; true[x>.5] <- -5*x[x>.5] + 5; g2 <- finalplot(true)

#Exponential
true=2*x^2; g3 <- finalplot(true)

true = 2*x - 2*x^2; g4 <- finalplot(true)

#Sinusoidal
true = sin(x*10); g5 <- finalplot(true)

total <- grid.arrange(g1,g2,g3,g4,g5, nrow=5)

```

## Uniform Distribution

In this example, samples of the covariate are drawn from a uniform distribution. A true functional form appears in the first column, to which gausian noise is added (drawn from a normal distribution with mean zero and a standard deviation of .5).  Several interesting things pop out immediately. First, all four types of plots do a good job capturing the underlying functional form. Second, a uniformly distributed covariate provides the best chance of observing the function across the full domain, but even then there will necessarily be hot spots that can be distracting in the two density plots. Third, with constant variance, a boxplot does less to convey the underlying relationship than just the scatterplot with a lowess curve.

## Normal Distribution

More typically, a covariate could be normally distributed (or normalized to a normal distribution). In this case, there are few observations in the tails which leads to misleading turns in the lowess fit and the boxplot to completely misrepresent a sinusoidal relationship as a quadratic one. Both of the density plots perform especially bad, overemphasizing the center of the distribution at the expense of visual detail on the tails.

## Mixed Distribution (Exponential for low values and normal for high)

In a more interesting case, the covariate is generated by multiple processes. In estimating population, for example, I’ve found that areas of high population density appear approximately log normal but areas of low population density are exponentially distributed with most of the mass at no population, fewer with one person, fewer with two, etc. In contrast to the other two examples, the two density plots do a good job illustrating the bimodal divide. If we had not plotted the histogram separately,  the scatter-plot with a lowess fit would not have tipped us off that there might be two different populations with different data generating processes.

## Summing Up

The results suggest a couple of useful guidelines.  First, as should be expected, no single plot proves universally superior to the others. Second, if you are at all worried about multiple populations/data generating processes then check the marginal distribution and one of the two density plots. Third, always add a lowess fit and keep in mind that it may be misleading in low density areas. Finally, 2d hexagon binning is a nice compromise between a scatter plot and a density plot. By altering the size of the hexagons and the color scale for density, you can dial it closer to one or the other to fit your particular situation.