animating regression, part 2

Returning to the question of animating a regression model and its residuals. Part 1 stepped forward through basic animation with d3 toward a simplistic regression model and a first view of residuals. Let's take that through two new views, a Q-Q plot and a plot of potential outliers with Cook's Distance. In this part we'll complete a full cycle through these four plots, and in a final piece we'll review and tweak the look of it all and add some more interesting data into the mix.

Picking up where we left off, we have a regression line transitioning to a view of residuals. Let's start by emphasizing that those residual distances are most valuable in the latter view, and remove them from the model view.

Showing residuals properly

Let's improve on this by showing appropriate axes for each view. Then, during the transitions, we will re-scale the y-axis to the residual values and back to the true data points again.

Some quick notes about this:

To relocate the residuals and the corresponding scale, the residual values are now explicitly calculated:

var residuals = [];
data.forEach(function(d, i) {
    residuals.push(expected(i) - d);
    });

Then, we look for the maximum residual value to define the domain of the y-scale:

var max_residual = d3.max(residuals, function(d) { return Math.abs(d); });
var y_residuals = d3.scale.linear()
    .domain([-max_residual, max_residual])
    .range([height - padding, padding]);

Finally, whereas in the first version above, I just picked an arbitrary y-location (200) to anchor the residual bars after the transition...
```
o.transition()
    .duration(duration)
    .attr("transform", "translate(0, " + (200 - y(expected(i))) + ")");
```
...now we can locate these correctly according to the residuals scale. We just have to replace the arbitrary location with the exact midpoint of the y_residuals scale:
```
o.transition()
    .duration(duration)
    .attr("transform",
        "translate(0, " + (y_residuals(0) - y(expected(i))) + ")");
```

Adding Cook's Distance

In simple linear regression models like this, outlier values can influence the slope model significantly, making predictions based on resulting model with strong outliers less accurate than desired. A standard way to evaluate whether any outliers exist in a dataset is to examine Cook's Distance. It's easy enough to calculate in R:

    fit <- lm(formula=d ~ seq(0, 10))
    cd <- cooks.distance(fit)

The basic rule of thumb is to look for any Cook's Distance values of 1 or greater. It's easy enough to plot with this in mind, typical graphs for Cook's D show the values as vertical bars with a horizontal line at 1. We'll need to transition the y scale/axis again, and a detail to notice is that the Cook's D values might all be well below 1, so we need to make a choice: if the values are well below 1, leave the line off completely. If any data points approach or pass 1, show the line. The risk is that if none of the values are particularly large at all (e.g. all below 0.10, as in the example diagnostics image at the top of part 1) then if we scale the y axis all the way to 1, the variation among the small values will blur down into nothing. To handle this, we'll look for a mid-range value like 0.33 or 0.5, and if the max Cook's D is below that line, we'll scale the axis with the narrower value domain; otherwise, we'll scale it up through 1.

This took some fiddling - ultimately, removing the transform/translate() bits from the section described earlier made this simpler. Instead of translating the values, directly setting the y values based on the appropriate scale functions for each view mode is more direct. As it turns out, the translation above wasn't even correct, so this anchors us back in cleaner code with less of a cognitive gap to verify that the values are accurate. Lesson learned: don't fall back to SVG translate() when simpler (and higher-order) d3 scales will do the job.

Once it started working correctly an immediate benefit of this animation approach became clear. Look at the data point at x=6. It looks as if it's a big outlier, especially when we shift into the residuals view, where it carries the largest residual error. But when shifting into the Cook's Distance view, its impact as an outlier proves to be much less than that of the point at x=10, which is roughly 0.8. This makes intuitive sense when shifting back to the model fit view; x=10 drags the slope of the model down substantially, enough to be wary of, even if not enough to consider throwing the value out.

Adding the Q-Q Plot

To add the Q-Q plot we have to calculate the quantile of each value and plot that against normal quantiles. To make it work with the animation loop, though, we further have to sort the quantiles and plot them all in the correct order.

The plot itself should be straightforward, with the normal quantiles and observed values on the x- and y-axis, respectively, and a normal line running through it all.

Because the axes and points are moving around so much, we'll add a simple title label and update it as we switch plots.

That rounds out the sketch.

Now: to clean this up enough to be able to use it to render multiple regressions side-by-side. Stay tuned...

data.onebiglibrary.net

about colophon