data.onebiglibrary.net

Choosing a Favicon

2026-02-25T00:00:00-05:00

This Pelican blog has sat still since about 2014. I used to publish it to an S3 bucket with static hosting turned on. I've been thinking about establishing a new place to write lately, then remembered this was already here, and the domain kinda covers my professional interests still, so I went with it. But retconning it into contemporary versions of things wasn't a straightforward task. Not an overwhelmingly hard task, or very large, but it was a bunch of stuff that had to change a little all at once.

Then there's Claude.

I'm not going to go on about Claude itself a lot, there are plenty of people doing that already. But suffice it to say that while I'm not employed I have a lot of time on my hands and a lot of experience managing software teams. Tools this good - at least in terms of performance in building software systems, setting aside the much bigger questions at play - are really still brand-spanking new. They weren't this good more than three months ago. They just let me build pretty much whatever I feel like building, as if I had my own team that works really quickly and lets me operate in the plan/scope/refine/review mode I've been in professionally for a while now.

So while I have this time available and this ridiculous toolkit and before it all crumbles somehow or the prices go sky-high (both of which will probably happen sooner than we expect) and I have a bunch of ideas of stuff to build, I'm building stuff.

Fiddly tiny blog migration project? I know a thing that will help me get that done a lot faster than I could myself. It created this pull request to modernize the environment, and this pull request to retcon old posts into the new structure. It coached me on getting migrated to a new hosting platform and updating DNS. Darned thing just worked, dang it all.

I was testing the retconned content and noticed a warning about lacking a favicon. What to do? I asked for proposals. Thoughtfully it reached into my older content and came up with good ideas, so I asked it to draft the rest of this post for content.

I went with "four patch grid" because it looked best as a favicon, and there was something familiar about it, beyond just the color choices (see Albers color study posts from 2014). Took a few minutes to figure it out, but it's basically the same design as the old del.icio.us icon, just with the Albers study colors taken from those prior posts. Seemed like a good fit. I don't know if joshu is still out there, but I hope he's been good, and I hope he doesn't mind.

Everything below this sentence was bot-generated.

The Brief

The site needed a favicon. Given that so much of the early content here is built around Josef Albers' color studies rendered in D3, it made sense to draw from that same visual language: simple geometric shapes, deliberate color choices, and the interplay between them.

The constraint is severe. A favicon lives at 16x16 or 32x32 pixels. Whatever we choose has to read clearly at those sizes, in a browser tab alongside a dozen others, often against both light and dark chrome. That rules out anything with fine detail or subtle gradation. What survives is flat color and strong shape.

With that in mind, here were four candidates, each inspired by a different Albers study from the earlier posts.

Option 1: Nested Squares

Three concentric squares: deep purple, red-orange, yellow. This echoes the very first Albers exercise on the site — placing one color inside another to see how the surround changes your perception of the inner color. The nesting reads well at small sizes because the contrast between the three layers is high. At 16px the innermost square is only a few pixels across, but the warm-to-cool progression still registers.

Option 2: Offset Overlap

Two overlapping rectangles — teal and yellow — with a mint intersection zone. This one references the transparency and optical mixture studies, where Albers demonstrated that you can simulate the appearance of overlapping translucent layers using nothing but flat, opaque color. The intersection is the key: it's a third color that your eye reads as the blend of the other two, even though it's just a solid fill.

At 128px this tells the richest story of any of the options. At 16px the three-zone structure starts to muddy — the intersection becomes hard to distinguish from its neighbors, and the asymmetric layout loses its spatial logic. The idea is stronger than its execution at favicon scale.

Option 3: Four-Patch Grid

A 2x2 grid of colored squares: pink, red, green, grey. These are the four colors from the juxtaposition study in the second Albers post, where every permutation of layering was laid out in a grid to show how quantity and ordering affect the feel of a color combination.

This one holds up the best at small sizes. Four equal quadrants is about the simplest spatial structure you can have beyond a single block, and the four colors are different enough in both hue and value that they stay distinct even at 16px. The 2x2 grid also has a nice visual rhythm — it reads as a deliberate pattern, not a smudge.

Option 4: Stacked Bands

Four horizontal bands progressing from dark olive to bright yellow. This references the middle mixture and light intensity studies, where Albers explored how adjacent colors of similar value seem to merge while colors with strong value contrast maintain their boundaries.

The bands are clean and simple, but at 16px the four-step gradient compresses into something that reads more like a generic color swatch than a distinctive mark. The progression is too smooth — there's no focal point for the eye to grab.

The Choice

The four-patch grid won. It has the clearest identity at the sizes that matter, the colors are directly drawn from the Albers work on the site, and the 2x2 structure is simple enough to be iconic without being generic. It's the one you're seeing in your browser tab right now.

And We're Back

2026-02-24T00:00:00-05:00

I've reanimated the site and updated all the bits, and will be publishing here again. Hooray.

The short-form update: I spent 10 years as a data scientist, doing work I enjoyed a great deal and had meaningful impact and reach, but then federal government funding priorities shifted. Now there's Claude, so I'm working on stuff. If you're reading this, you will see more soon!

Thanks for sticking around. Seriously. It means a lot.

animating Anscombe's Quartet regression diagnostics

2014-11-12T00:00:00-05:00

Using the sketch developed in animating regression parts 1 and 2, let's take a look at Anscombe's Quartet. What makes these datasets useful, as wikipedia points out, is their near-equivalent stats: the x and y sets share the same mean, sample variance, correlation and simple linear regression model. It's instructive as a clear example of what to watch out for when developing simple linear regressions, and the issues each dataset highlights come clear in the different diagnostic plots.

The technical challenge here is to use the sketch developed in part 2 four times. That code is a mess; it reflects my learning process, but it's not anything I'd want to reuse. The simplest approach to solving this is to turn the viz element into a reusable chart (and to-read: Exploring Reusability with D3.js)

I'm under several class deadlines just now, so I won't go as far as possible in making this nice and cleanly configurable and modifiable, but I certainly don't want to write the same code out four times, so I'll look for a middle ground that achieves some code cleanup and a modicum of reuse.

First off, we need to pull the source data into this page. The Anscombe datasets and their summary statistics are readily available, but their linear model residuals and cook's distance values require a little calculation. There are javascript stats libraries that can handle the regression, but they don't seem to ship with a cook's distance implementation. (to-do: pull request.) Fortunately R ships with the anscombe data pre-loaded, and it's easy to put all this together and draw it out as JSON for easy use here:

library(rjson)
a1 <- data.frame(anscombe$x1, anscombe$y1)
names(a1) <- c("x", "y")
a1fit <- lm(y ~ x, a1)
a1$cooks <- cooks.distance(a1fit)
a1$error <- a1fit$residuals
a1$quantile <- scale(a1$error)
# repeat for a2, a3, a4
aout <- vector(mode="list", length=4)
names(aout) <- c("a1", "a2", "a3", "a4")
aout$a1 <- a1
aout$a2 <- a2
aout$a3 <- a3
aout$a4 <- a4
toJSON(aout)

This can be written to a file for later use, like here.

The plan is to make at least these following changes to the chart:

define function regcycle() as the reusable chart and invoke it four times
instead of creating multiple scales and axes, rewrite each within each plot/view mode function instead so they're self-contained
move the axis updates to the top of each plot function
load the source Anscombe data and initialize the charts using a d3.json() callback
bind the selections and the data to each of the four charts

Let's see how it goes.

This seems about right. Some additional changes proved necessary:

color! highlighting each data point with a color from the color brewer "spectral" should support a viewer's ability to follow any specific observation through the four plots.

the JSON output from R I described above was more awkward to work with than a more row- or observation-oriented dataset shape, so there's a quick reshaping step. This results in simple references to the data values.

// reshape the data into observations
for (i=0; i < seldata.x.length; i++) {
    var obs = {
        x: seldata.x[i],
        y: seldata.y[i],
        residual: seldata.error[i], 
        cooks: seldata.cooks[i],
        quantile: seldata.quantile[i],
    };
    data.push(obs); 
};

the Cook's Distance calculation for dataset four results in a NaN value for the far-right value, so I added a check for that to result in shooting the data point straight up way off the viewpane. This is perhaps not viable statistically but it feeds the animation well, specifically in the transition to the Q-Q plot, making the story told clearer to my eye. Following the bottom right pane, watch the green dot snap back into place at the very end of the transition to Q-Q and you get the effect. The data check for the NaN is simple but effective:

var c = svg.selectAll(".observation");
c.each(function(d, i) {
    var o = d3.select(this);
    o.select(".data-point").transition()
        .duration(duration)
        .attr("cx", x(i + 1))
        .attr("cy", y(isNaN(d.cooks) ? 50 : d.cooks));
    o.select(".residual-bar").transition()
        .duration(duration)
        .attr("x1", x(i + 1))
        .attr("y1", y(isNaN(d.cooks) ? 50 : d.cooks))
        .attr("x2", x(i + 1))
        .attr("y2", y(0));
});

added a 1.1 buffer factor around the input domain for several of the axes to draw the data inside the axis lines.
bringing the residual and distance bars into the two relevant plots proved to require more attention. because the data points move around, the bars can be left in a position from an earlier plot that doesn't make sense two plots later. that could lead to the bars whooshing in from odd angles as they reappear, which is awkward, detracting from the intended narrative. This temporary mistake evokes that awkwardness:

Left as an exercise for the writer

There are several unresolved issues I would like to revisit:

synchronizing the transitions among four different chart instances doesn't seem to have a single obvious solution. you can see them fall out of sync if you want the cycle long enough, and avoiding that might require some sort of clock check or simple communication pattern. even so, the eye can only meaningfully follow one plot at a time, so it doesn't ruin the effect, and if you believe google analytics few readers spend more than one minute per page on this site, so it's not a serious problem here, now.
i can see case for pulling the axis resetting back out of the individual plot modes again; it's a little cumbersome to keep reassigning each time. on the other hand, this way all the logic for a plot is self-contained, so it would feel a little cleaner to add more plots to the reel without having to bounce around and keep track of a dozen different scale and axis variables.
would be nice to jitter or spread out the residuals so the bars don't overlap like on the fourth dataset.
no configuration accessors keeps this from being particularly resuable by anybody else, but that's the other side of that line i drew in going for that middle ground. other homework awaits!
several details are hard-coded, like the color scale and the size and styling of different elements.
if jstat or simple statistics had a cook's distance function we could take arbitrary datasets and render them all inline, or at least as part of the reusable graph.

Always good to have something to work on next.

animating regression, part 2

2014-10-18T00:00:00-04:00

Returning to the question of animating a regression model and its residuals. Part 1 stepped forward through basic animation with d3 toward a simplistic regression model and a first view of residuals. Let's take that through two new views, a Q-Q plot and a plot of potential outliers with Cook's Distance. In this part we'll complete a full cycle through these four plots, and in a final piece we'll review and tweak the look of it all and add some more interesting data into the mix.

Picking up where we left off, we have a regression line transitioning to a view of residuals. Let's start by emphasizing that those residual distances are most valuable in the latter view, and remove them from the model view.

Showing residuals properly

Let's improve on this by showing appropriate axes for each view. Then, during the transitions, we will re-scale the y-axis to the residual values and back to the true data points again.

Some quick notes about this:

To relocate the residuals and the corresponding scale, the residual values are now explicitly calculated:

var residuals = [];
data.forEach(function(d, i) {
    residuals.push(expected(i) - d);
    });

Then, we look for the maximum residual value to define the domain of the y-scale:

var max_residual = d3.max(residuals, function(d) { return Math.abs(d); });
var y_residuals = d3.scale.linear()
    .domain([-max_residual, max_residual])
    .range([height - padding, padding]);

Finally, whereas in the first version above, I just picked an arbitrary y-location (200) to anchor the residual bars after the transition...
```
o.transition()
    .duration(duration)
    .attr("transform", "translate(0, " + (200 - y(expected(i))) + ")");
```
...now we can locate these correctly according to the residuals scale. We just have to replace the arbitrary location with the exact midpoint of the y_residuals scale:
```
o.transition()
    .duration(duration)
    .attr("transform",
        "translate(0, " + (y_residuals(0) - y(expected(i))) + ")");
```

Adding Cook's Distance

In simple linear regression models like this, outlier values can influence the slope model significantly, making predictions based on resulting model with strong outliers less accurate than desired. A standard way to evaluate whether any outliers exist in a dataset is to examine Cook's Distance. It's easy enough to calculate in R:

    fit <- lm(formula=d ~ seq(0, 10))
    cd <- cooks.distance(fit)

The basic rule of thumb is to look for any Cook's Distance values of 1 or greater. It's easy enough to plot with this in mind, typical graphs for Cook's D show the values as vertical bars with a horizontal line at 1. We'll need to transition the y scale/axis again, and a detail to notice is that the Cook's D values might all be well below 1, so we need to make a choice: if the values are well below 1, leave the line off completely. If any data points approach or pass 1, show the line. The risk is that if none of the values are particularly large at all (e.g. all below 0.10, as in the example diagnostics image at the top of part 1) then if we scale the y axis all the way to 1, the variation among the small values will blur down into nothing. To handle this, we'll look for a mid-range value like 0.33 or 0.5, and if the max Cook's D is below that line, we'll scale the axis with the narrower value domain; otherwise, we'll scale it up through 1.

This took some fiddling - ultimately, removing the transform/translate() bits from the section described earlier made this simpler. Instead of translating the values, directly setting the y values based on the appropriate scale functions for each view mode is more direct. As it turns out, the translation above wasn't even correct, so this anchors us back in cleaner code with less of a cognitive gap to verify that the values are accurate. Lesson learned: don't fall back to SVG translate() when simpler (and higher-order) d3 scales will do the job.

Once it started working correctly an immediate benefit of this animation approach became clear. Look at the data point at x=6. It looks as if it's a big outlier, especially when we shift into the residuals view, where it carries the largest residual error. But when shifting into the Cook's Distance view, its impact as an outlier proves to be much less than that of the point at x=10, which is roughly 0.8. This makes intuitive sense when shifting back to the model fit view; x=10 drags the slope of the model down substantially, enough to be wary of, even if not enough to consider throwing the value out.

Adding the Q-Q Plot

To add the Q-Q plot we have to calculate the quantile of each value and plot that against normal quantiles. To make it work with the animation loop, though, we further have to sort the quantiles and plot them all in the correct order.

The plot itself should be straightforward, with the normal quantiles and observed values on the x- and y-axis, respectively, and a normal line running through it all.

Because the axes and points are moving around so much, we'll add a simple title label and update it as we switch plots.

That rounds out the sketch.

Now: to clean this up enough to be able to use it to render multiple regressions side-by-side. Stay tuned...

year's worth of dots

2014-10-01T00:00:00-04:00

For a project at work we've collected a year's worth of samples from a major non-US social media site. The samples are taken every 30 seconds, a snapshot of the most recent 200 public posts from all users. This created a lot of files, and along the way we missed some in chunks for various reasons (network outage, service error, reboot, etc.). The researcher we're supporting has happily taken a copy of the 100+ GB (compressed) of data to start poring through, but asked that we help prepare a simple visualization of the data that's present - or more importantly, what's missing.

Because it's natural to miss a few files here and there over a year's time, it's not a problem unless there are big chunks missing or patterns of errors that make sampling from this data problematic. An image of what's there and what's not there needs to hit a few key points:

cover the entire collection period (actually ~13 months)
show missing files
show empty files
easily spot large gaps
easily spot significant patterns

In addition to the immediate use (the researcher's own knowledge of what they have) this visualization needs to work for their advisors and others interested in the work, so it should be readily digested, by which I mean:

should fit on one screen
shouldn't require much explanation

To give a sense of volume, this set of files should be roughly 365 day/yr * 24 hr/day * 120 files/hr = 1,051,200 files. It's a good number. It's too many to read from disk in realtime, so this will require preprocessing.

First sketch

Let's start with a rough picture of what it will take to fit the dots onto one screen. One day's worth is 24 * 120 = 2880 dots, which is too much for one screen width of pixels, but if we can divide it at least in half, we're getting closer. The 365-day year is easier; we can multiply it by two or three and still fit a good number of pixels in. So with this in mind, here's a 1440x730 grid.

Yah ok that's too wide.

Let's try again, but half again as wide, but just for fun (and because vertical scrolling isn't so hard) let's make it taller.

One more time, with a grid and some date scales to shape it all out better:

Adding real data

Okay, now we're getting somewhere. It's time to work with some real data and place it onto the scales using dates and times. As a first cut, I've extracted a file count for each hour in the dataset. This resulted in a json file with content like this:

...    
"2014-09-29 08:00:00Z": 120, 
"2014-09-29 09:00:00Z": 120, 
"2014-09-29 10:00:00Z": 119, 
"2014-09-29 11:00:00Z": 120, 
"2014-09-29 12:00:00Z": 120, 
"2014-09-29 13:00:00Z": 120, 
...

Loading this into a sketch is easy with d3.json(). The keys are sorted, but just to be thorough I'll also use d3.min() and d3.max() to get the first and last date/times from the set.

The next piece of all this is to set the scales to use the dates. I created that data file knowing that javascript should be able to parse the dates cleanly; hopefully this will feed right into the d3 time scaling functions.

Finally, it'll all come together with line segments drawn in for each hour. The percentage of files available (should be 120 total for each hour) will feed into a color scale. To see the contrast of missing files well, the scale will have to be exponential rather than linear (earlier discussion of which via Albers is written up in this post). Once again, d3 helps us out, with the d3.scale.pow() exponential scaling function. To scale the input domain to the output range using a power of two, we just set the exponent on the scale as well, and use colors as the range:

var color_scale = d3.scale.pow()
    .exponent(2)
    .domain([0, 120])
    .range(["#fff", "#000"]);

This should make missing files lighter, with a missing file or two barely noticeable, but more than a dozen or so should be noticeable.

That's the trick. This meets the purpose, but has two problems:

The time zones are off. See the way the first day starts at 00:00 but stops at 20:00? That's the four-hour adjustment for eastern (US) time, which is happening in a way I'm not controlling properly. You can see this for yourself by mousing over a 00:00 block; it will show 04:00 as the hour. It doesn't make sense to do that because it introduces a discontinuity. More importantly, it's unclear what time we're looking at for any given block, and for this particular case the data was collected from a Chinese service, so it's doubly annoying for the researcher to have to correct for two offsets. Looks wrong, is wrong.
Not working outside of chrome. Need to debug.

For now I have to leave this aside to get back to other projects. It will be good to circle back around to these to figure out how to get it right. Can't leave it hanging.

Fyi, I posted this up as a gist with similar text to be visible at bl.ocks.org/dchud. If you want to poke at the code without futzing with the rest of all this text, follow that link through or go right to the original gist.

animating regression

2014-09-18T00:00:00-04:00

When performing a simple linear regression, it's important to review all the diagnostic plots that come with it. If the residual errors aren't normally distributed, you will have to rethink your model. Like I referenced in an earlier post you can't just stop at the fit plot, even if it is pretty (here courtesy of SAS):

You have to review its diagnostics:

Typically in a set of diagnostic plots like this, you look first at the top left chart to see if the residuals balance around 0. The Q-Q plot below that should be close to the 45° line, the histogram below that should look normal the way most of us know, and the Cook's distance plot at middle right should show no outliers near or above 1. Any of these plots going wrong should be a sign that there's something amiss with your model. And this is all in addition to reviewing the numbers that come out of the model, like the p-value on the F test of the model, the R-square, the p-value on the t test of the dependent variable, and the p-value of a normality test on the residual errors.

The trick is, though, it can take time to develop an intuitive feel for how to read all these numbers and plots, even for a model that's as (relatively) simple as linear regression.

D3.js offers something better, the chance to animate the relationships between these plots, with object constancy. Maybe it would be useful to do something like the transitions in the showreel with the main fit plot and the diagnostic charts, with constancy among the points in the dataset to show which lie where in the various plots I mentioned above from the fit through the diagnostic set. Let's try that.

Simple transitions

The first step is to wrap our heads around the timed transitions in that showreel; I haven't done those before. The key seems to be the use of setTimeout(callback_function, delay) calls at the end of each function in the showreel source. Note that setTimeout() is a JavaScript timer, not a D3 function.

This should be easy to replicate. To try it out, let's just draw a box, then move it around.

This is pretty straightforward, we draw a rect, then we set the first timeout to call one of four similar functions that does what you'd expect:

setTimeout(move_right, duration);

function move_right() {
    test1.select("#box").transition()
        .duration(duration)
        .attr("x", 100);
    setTimeout(move_down, delay + duration);
}

move_right() uses d3's transition() to shift the x over to 100, then sets a time for move_down(), which shifts y to 100, then sets a timeout with a similar callback to move_left(), then we go to move_up(), which goes back to move_right(), and we have an endless loop of timed transitions. This might not be a model for building UI event-driven animations, of course, but we can settle on this kind of showreel-style series of repeating transitions to show a cycle of plots.

Note that the setTimeout delay on each callback isn't just delay but is rather delay + duration. The delay alone runs concurrent with the transition duration, so if we don't add duration, the delay will end at nearly the same time as the duration! The duration means the transition will take duration milliseconds, but javascript still executes the following call to setTimeout immediately, so we have to set the delay value to something longer or the box will never appear to "pause" between transitions.

Adding constancy

The next trick is to do the same thing but with multiple moving points based on data. To do this, we'll expand our model above to include a simple three-value dataset. We'll still move it right, down, left, up, then right again, but in each of these quadrants we'll use a different set of scales to position each element. It's important to use d3's data binding for this rather than, say, a few circle and rect elements we could draw by hand because ultimately we will want to bind real data from a regression.

We'll use the same structure - four functions with obvious names. The first time through, we'll place circles using the straight values as their x and y positions in the upper left quadrant, then for each of the other functions we'll use different scales to slide them around inside each following quadrant. I've added lines to help distinguish the quadrants.

This works pretty well once you are clear about the scope of the object you want to operate on. At first we create a set of svg g group objects, and place circles inside of each:

var g = test2.selectAll("g")
    .data(data)
    .enter().append("g")
    .attr("class", "object");

    g.each(function(d, i) {
        var o = d3.select(this);
            o.append("circle")
            .attr("r", 15)
            .attr("cx", d)
            .attr("cy", d)
            .attr("fill-opacity", ".80")
            .attr("fill", color_scale(i));
    });

This sets us up with the basic set of "data points" we'll move around. Then we just start firing up transitions like before, using setTimeout(), but with each move function doing a little more:

function move_left2() {
    var x = d3.scale.linear()
        .domain([0, 100])
        .range([60, 40]);

    var c = test2.selectAll(".object");
        c.each(function(d, i) {
            var o = d3.select(this);
                o.select("circle").transition()
                    .duration(duration)
                    .attr("cx", x(d))
                    .attr("cy", x(d))
                    .attr("transform", "translate(0, 100)");
        });
    setTimeout(move_up2, delay + duration);
}

First we define a new scale for each quadrant, changing the output range; in this case, it reverses the ordering, and places the circles in a narrow band just 20 pixels wide. Next, we select the .objects we created, which pulls up those gs we started with, then loops through the set of them, firing off a transition the moves the cx and cy of each according to the new scale, and also resets the coordinate space to each quadrant in turn.

Simulating a regression

We'll use a more substantial dataset when we put it all together, but for now let's assemble a small dataset and sketch a fit plot and residual plot transitioning back and forth. I've made up some values and used R to generate a regression (d is just the same data as in the javascript below):

Call:
lm(formula = d ~ seq(0, 10))

Coefficients:
(Intercept)   seq(0, 10)  
     18.000        9.036

We can use this regression line to get a feel for transitioning more elements together, and for some of the extra elements we'll want to add to make things pop a bit.

For the regression, we apply the results R gave us to define the slope, intercept, and a function that returns expected values from the model:

var slope = 9.036;
var intercept = 18;
function expected(index) {
    return (slope * index) + intercept;
    };

This function lets us put in an index number for a data value and get back what the model expects the data value to be. We can then use this whenever we need to plot the residual, here in the original rendering of the data points and residual lines against the model:

g.each(function(d, i) {
    var o = d3.select(this);
    o.attr("class", "observation");
    o.append("line")
        .attr("x1", x(i))
        .attr("y1", y(d))
        .attr("x2", x(i))
        .attr("y2", y(expected(i)))
        .attr("stroke-width", 2)
        .attr("stroke", "gray");
    o.append("circle")
        .attr("r", 5)
        .attr("cx", x(i))
        .attr("cy", y(d))
        .attr("stroke", "black")
        .attr("fill", "darkslategrey");
});

The line is vertical, so the x-scale places it horizontally using the index number. The vertical line segment representing the residual error starts at the actual value d and ends at the expected value expected(i), with both adjusted to the y-scale using y(). Then, in the residual view/function, we just have to rotate the model line to "level" (height/2) and translate the g-wrapped residual line and data point to level minus the y-scale-adjusted expected value from the model:

.attr("transform", "translate(0, " + (200 - y(expected(i))) + ")");

And when we switch back to the "fit" view, we just translate them back again to (0, 0), and rotate the model line back to the original regression slope.

This feels like a good stopping point for today. Next time, we'll pick up from here, add the additional diagnostic plots, and fill out each stage with axes and other niceties as appropriate.

Albers color studies in D3.js, part 2

2014-09-04T00:00:00-04:00

(See also part one, simple color relationships w/d3.)

Picking up where we left off, in the middle of Josef Albers' Interaction of Color (Yale Press's iPad edition), his study of the "middle mixture" affords a chance to bring in D3.js support for animations and transitions.

In this study, Albers chooses a trio of colors where the middle is a mixture in the middle of the other two. He recommends sliding the lowest part up slowly, so we can observe how the increased ratio of the darker color draws out how that darker color contributes to the mix, and then as you slide it back away again, you can see the top (lighter) color come through in the middle mixture. Concentrate on the middle block as the lower one moves up and down, and you can also see an illusory gradient effect near the top and bottom.

This study demonstrates how varying the quantity of each color present affects the relationships between colors and the overall feeling of a design even when the structure isn't altered in any other way. All of these use the same four colors and the same overall shape.

Each has its own distinct feel, right? Taken together they seem to dance chaotically, and it's not particularly pleasant, but its goal is instructive, of course, not aesthetic. Albers suggests using sheets of paper or your hands to block out smaller sets to look at in turn: a row, a column, etc., and considering which combinations are your favorite and why.

In writing this one up I waffled between writing a routine to generate the color permutations and laying them out explicitly like the example in the book, and I ended up matching the book explicitly. The rest of these exercises have tried to match the book closely, so it seemed okay to just iterate over an array of arrays that had been lined up by hand. I also did a lot of pixel-nudging to get the boxes to line up just so (hence punting on fixing the extra white space at the bottom).

This next study is a similar look at color mixture. The individual lines can be laid out with scaling easily enough with D3, but to make them look uneven/wobbly is a bit of a challenge.

This mostly recreates the effect of the study in the book but is unsatisfying on a few counts. The "wobble" of the individual patches is decent, but they should scale and skew a little more. The ordering of the stacking throws off the effect, and the y-skew is a little too great. Perhaps the biggest issue is the use of translate() to locate each strip in its place sets the top-left (x,y) to too fixed of a point; it needs to vary more. There is a lot going on in the SVG transform attributes that I don't fully understand, largely around the shift in coordinate systems, and that's holding me back from developing the right approach to skewing and placing each strip correctly. I'll have to revisit this to wrap my head around it more fully.

This is definitely the most disappointing recreation of studies from the book I've done so far.

The Weber-Fechner Law

Wikipedia points out discrepancies in the term "Weber-Fechner law" but as that's how Albers referred to the difference between the quantitative and perceptual effect of layering color: linear additions seem to lead to logarithmic effects, and exponential additions seem to lead to linear effects. In the book this study uses translucence, so I'll stick with SVG's opacity support to recreate it.

Arithmetic increases in application of color here, by way of stacking, lead to only slight shifts in the perceived color.

Ah, this recreates the effect of the study in the book much more effectively than the previous one (a relief). With 75% fill-opacity we can trace the distinct shades of color as 2, 3, and 4 patches are overlaid in different spots. The difference from one to two is much greater than the difference from three to four.

This next study repeats a similar process, showing off the difference between linear and exponential layer addition. At left, each succeding strip from top to bottom has one additional layer beyond the one above it; at right, the difference is a power of two. So at left, it is {1, 2, 3, 4, 5} layers, and at right, {1, 2, 4, 8, 16}.

I misread this one, getting it completely wrong at first. The ground is a red, and the added layers are blacks, using SVG fill-opacity. I had thought at first that the layers were all red, but the part at right never converged to black until I re-read that they are indeed black layers added on top, on both sides.

I haven't been able to recreate the subtlety of the shift to barely imperceptable on the left, but this is fairly close.

This final study is a look at near-equality of light intensity, the difficulty of choosing examples for which Albers warns us carefully. If chosen correctly, when the two saw-tooths come together, two colors with similar light intensity should start to blend into each other, even though they are dissimilar otherwise.

This works nicely - for that brief instant when the two sides touch it seems like the sawtooth pattern at their mutual boundary disappears and the colors start to merge.

This has been a great exercise, both in learning about color relativity and digging deeper into the basics of D3. Just makes me want to do more. There a lot of code in the studies I replicated here and in part one that could be much clearner, but I gave up on writing clean code in service of getting it done and keeping things simple. In future posts I'll be working with real datasets more often than not, and cleaner code will always help there. I aimed for staying true to the exact studies in the book, too, to have a target to aim towards, rather than taking the opportunity to do the exercises for myself, finding colors that would be a good match, because I wanted to learn about D3 at the same time, and reproduction is easier than original work. The app version of the book allows for creating your own studies, and I've played around with that some, so I don't feel like I'm missing out too much.

I hope you'll stay tuned, it feels like it's just getting started.

7±2 things to know about data science

2014-08-12T00:00:00-04:00

For a talk given at code4lib DC 2014.

Background

I am a professional librarian and software developer with 17 years in the job post master's. I studied at a strong school, worked at some great institutions and worked with many great people and between all this I've learned a lot about being a hacker / librarian, enough that the good people at GW Libraries saw fit to hire me to manage a team exactly three years ago.

I am a student of data science, halfway through a two-year program at GW School of Business. So far I have learned enough to understand a fair amount about what it is I need to be able to do to apply data science, but I am not yet very good at doing that.

As a manager in tech in a research library, my job is to work to ensure that our team and our library do meaningful work well, reliably. I intend to develop my professional skill at working with data to meet this same goal: do meaningful work reliably well. With that in mind, I have a rough sense of what librarian and archivist colleagues might need to know about data science means, but I still have an awful lot to learn.

Defining "data science" and "business analytics"

Like many aspects of data science, this is best communicated visually.

Here is a canonical industry view of required skills many of us like:

by Drew Conway, see http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

A layer-cake view of analytics tasks that is also helpful:

by Gavin Blackett, see http://www.theorsociety.com/Media/Images/Users/CaraQuinton01011978/ActualSize/17_05_2012-16_06_56.jpg

The questions we ask at these levels, phrased simply:

by MorganFranklin Consulting, see http://www.morganfranklin.com/insights/article/4-types-of-business-analytics-capabilities

Lisa Kurt defined a helpful paradigm for this view at code4lib 2012 in Seattle:

by Mu Sigma, see http://www.mu-sigma.com/analytics/ecosystem/dipp.html

Most simply, we can speak of data science as the application of statistics to support decisions, to understand patterns in data, and to reduce or at least clarify uncertainty in a wide range of domains.

Applying data science

From where I sit (halfway through a degree program) the ability to apply data science techniques meaningfully comes down to something more like this:

Can you identify the Danger Zone?

I am weak on Science, but improving. I am confident in the hacking side of my Skill, but not yet in applying statistical models. I would like to believe I have good sense, but there is an art to applying it here.

Asking the right questions

Much of this work comes down to knowing which questions to ask and being steadfast in attempting to answer them honestly:

What goal do we wish to achieve?
What data do we have to work with?
What gaps in data do we need to fill, and how can we fill them?
What assumptions are we working under, and are they acceptable?
Which of the many available models fits our data and goals well?
What bias is inherent in our data, and what bias are we introducing?
With what level of certainty can we make a claim?

It is particularly risky to learn one model (e.g. linear regression) and one tool (e.g. R) and take whatever data you have and only ever attempt linear regressions with R without asking and answering these other questions. It's not about R (or SAS or Python or SPSS or Julia or Stata or Excel or ...) being a magic tool, and linear regression might be a poor fit for your data.

Data context switching, aka munging

There is no such thing as a clean data set.

This is important.

Any data you start with will have been collected or prepared with a particular purpose. That purpose might or might not have anything to do with your goals. You will most likely need to reframe the data you start with to fit your needs. This might involve ETL pipeline processing, recontextualizing, extracting, summarizing, merging, splitting, and otherwise reshaping data.

Any decent data person will need to become proficient at some or all of these tasks.

There are even style guides for data, such as Hadley Wickham's Tidy Data, which proposed the following principles of tidiness:

Each variable forms a column.

Each observation forms a row.

Each type of observational unit forms a table.

Some tools have their own preferences; in SAS, some procedures like "wide form" (each variable a column) and others "long form" (variable names parameterized). All the more reason to develop munging skills.

Data munging is often the most time-consuming part of statistics work.

Sound familiar?

Applying models

There are many, many types of models. Different models can be used for different tasks as shown in the diagram above. Some, like simple regression, are widely applicable and easy to understand. Many are narrowly applicable and hard to understand, but prove to be far more effective for certain use cases.

Work with most models require similar steps:

Prep sample data to apply the model
Use 2-3 visualizations to explore the data
Re-munge data to apply the model
Run the model, evaluating results
Review residuals/errors
Check model assumptions, bias
Lather, rinse, repeat

After all that, you'll probably want to try all of the above again with another model. Or two or three.

Half of understanding a model is understanding what to look for in results, and how to evaluate assumptions and results. It is easy to think you might have a great model, but if you don't know how to evaluate residuals and check basic model assumptions, your work might not be meaningful.

Here is an example of what this might look like, using SAS.

First, we import data in an attempt to find a relationship between age and weight. The data looks like this (thanks, csvkit!):

A simple regression offers these results:

Every stats app has a report format like this; SAS likes HTML tables. Important details in here are the p-value of the F test result, the R-Square, and the p-value of the t test on the dependent variable age.

The plot is pretty:

But we have to review its diagnostics:

And check residuals precisely:

To be a good data scientist, you have to work any model through all of these steps, knowing which tests to run on results. Every model has its own characteristics.

Applying tools

In some cases, there are straightforward models that can be applied with straightforward tools. For example, this is a time series of the amount of recent airline travel in the US. In just a few lines of R, you can produce this decomposition of seasonal and trend lines:

This is wonderful, but keep in mind that there is always more to the story. The simplest-seeming models and tools often require a lot of subtlety to wield reliably well.

For more on this particular example, see Using R for Time Series Analysis.

Learning the craft

On this part I'm shaky, but my hunch is that like learning to be a competent programmer, applying meaningful data science reliably well is a craft that takes time to learn. It took me a good five years to learn many basic lessons about programming that no CS prof ever taught me (granted, I don't have a CS degree, but I've taken many of the basic courses in formal settings). After five years, I was a good enough programmer to get a job on a great team, where I really started to learn the craft - all the details you need to attend to if you want to build systems that scale up, and if you want to sustain projects over years, through staff turnover and changing technologies.

Like with CS, the science is critical, but it seems like it will take me several years and a lot of repetition to develop the kind of intuitive feel for choosing models, checking assumptions, and explaining results. It's that "good sense" I know I need to strive for, but I don't have it yet. Before I start applying any of this on data in my workplace, I'll be sure to find someone more experienced than me to run ideas by, someone who knows the craft well already.

What can we do to help?

Fill in gaps on campus
Support critical thinking in data selection, munging, and application
Encourage a well-rounded view, especially with Ethics
Apply our experience with workflows and conventions
Learn and apply for ourselves

What next?

Probability and statistics on Khan Academy
Johns Hopkins Data Science Specialization on Coursera
Leek et al., How to share data with a statistician
Provost and Fawcett, Data Science for Business

simple color relationships w/d3

2014-08-08T00:00:00-04:00

I've been reading Josef Albers' Interaction of Color (Yale Press's iPad edition) and am learning quite a lot from it. I particularly enjoy his details about what to expect in student reactions to particular exercises; you know he must have anticipated and savored these reactions each time, with every class.

The basic principles of the first few chapters should be easy to demonstrate using d3.

In "Chapter IV: A color has many faces" we see the first of several color plates and we are quickly drawn into what he has to teach us about the relativity of color, that "color is the most relative medium in art." Let's mimic the first experiment, making one color look different from itself, using different background colors. I'm guessing (poorly!) at colors somewhat close to those in the prepared studies in the text itself, using this color picker.

This use of d3 demonstrates several features which make it an appealing toolkit, even for a beginner:

it's just javascript
it's just SVG
you can do simple things very simply

I learned SVG many years ago, back in 2004, when it was still a fairly new web standard and had very few useable implementations. The good news is that it is much more widely implemented now, and that it hasn't changed much since back then (there's only been one new revision, a "second edition" of the first version), so if you know a few SVG basics then it's easy to see that d3 just uses an API defined in javascript to generate SVG. This is a lot easier than generating SVG by hand yourself; I know from first-hand experience a decade ago.

Another way to think of d3 is as a "domain specific language" for dynamic documents on the web. It's just javascript, but it's a flavor of types and techniques specific to generating SVG using javascript that lends itself well to visualizing data.

In any case, this copied "plate" demonstrates the basic principle well: the inner color is exactly the same in both rectangles, and it is the interaction between this color and the differing surrounding / background colors that makes it look different from itself from one to the next.

Changing colors

To make this a little more dynamic (it is the web after all) let's add the ability to change colors by clicking on the inner boxes. The code will be the same, but with the "click" method defined on each.

Click on the top inner box to make both inner boxes lighter. Click on the bottom inner box to make both inner boxes darker.

This further reinforces the effect; at some points as you click to ratchet the intensity up or down the two inner boxes look like wholly different colors, and at other points (especially the extremes) it is clear that they are the same.

Of course this isn't quite what Albers had in mind with the lovely physical interactions designed into his text (which the Yale Press' folks very creatively transposed to the iPad app) but perhaps we can use the dynamic aspect of the web, made so easy by d3, usefully to embody some of the same lessons he taught.

Lighter and/or darker

To focus us in on light intensity, Albers presents several exersizes in subtle and not-so-subtle gradations of light. SVG's gradient support should help to recreate them.

This really comes alive as the intensity of the two gradients pass each other on the way up/down. It all seems to merge! And little shadows seem to appear around the frame at the top and bottom just past the strips' ends.

The gradients above are explicit. In this next example from Albers, the gradients are illusions.

Every one of the individual rectangles above is a solid color, even though it looks like each has its own gradient. It's the effect of the proximity to slightly lighter and darker colors above and below that makes the contrasts between them appear to form two ends of a gradient in each rectangle. It seems to be most pronounced in the corners.

Transparence and Optical Mixture

Albers teaches that we can simulate transparency and the apparent ordering/stacking of layers with color mixtures; SVG allows for specific opacity settings. Let's try it both ways, first with explicit color changes:

Note that none of the transparent-seeming sections are actually transparent; it is only simulated by shifting the color mix. Even so, it appears that the one at top is "behind" the black, and the one at bottom is "in front of" the black.

Let's try doing it again, but this time with SVG opacity variations.

Looks very similar, right?

If you look at the source, you'll see that the structure of this second version is exactly the same. Only two things change: first, the right halves of the strips are set to the same initial color as the left halves, #eee, whereas in the first version each is set to an explicitly different color on the grey scale; second, the fill-opacity is varied for each of these three from 0.15 at the top (so more of the background black comes through) to 0.85 at the bottom (so more of the white stays "on top"). Just like the first version, each of the "strips" are actually rendered as two separate rect elements.

So there it is, you can truly simulate transparency and ordering / stacking just by varying colors, and achieve results almost exactly like using actual transparency, as demonstrated by the second diagram.

One more example from the book exhibiting the effects of "optical mixture". There are four colors in this example: white, blue, olive, and mint (for lack of better terms). The individual circles and their "donut holes" are all the same size, but the color mixing makes it appear otherwise. Also, changing contrast in the background colors relative to the foreground create their own effects, shifting the sense of what's foreground and background.

Wow, that turned out better than I thought, but it took a while. This was a good exercise in framing scaled elements with padding in d3. I had tried to eyeball the inner frame shape and circle diameters based on calculations based on padding, width, and height, but it didn't line up right until I realized it's just an exact 10 x 24 grid.

Once I reset the scaling to use that grid (worked right away), I rewrote the outer/inner circle rendering bits using one function for each; it could be taken a step further with one function for both that would allow the diameter as a parameter too, and the rows and colors could just be one simple data structure to loop over, but it's good enough as is.

Finally, the colors were a bear to get right. I eyeballed a match to the colors in the iPad app but the contrast just didn't pop the way it does in the Yale-produced ebook. After playing with the colors a lot I remembered: I use the flux app on my desktop, and was working on this at night, so everything was completely wrong! After turning flux off I was able to get a lot closer, though the ebook version is still much better.

Summary

This has been a great exercise in working with the lessons in color Albers lays out so elegantly in his book. If this interests you at all I recommend you get a copy for yourself (the iPad ebook is worth every penny). A colleague at our library told me we have an early print edition with all the fold-outs and flaps, so I will have to take a look at that as well.

It's also been a good lesson in using d3 to render simple shapes and colors, and remembering to look in the d3 docs for a cleanly defined function I'd have otherwise more awkwardly wired up myself in javascript. Even something as simple to do by hand as what d3.range() offers has a familiar feel and semantic specificity that makes d3 just make all the more sense.

I am about halfway through the text and could use a lot more d3 practice, so before I move on to rendering data more explicitly I might take a stab at a "part two" post along these same lines.

If any of the specifics interest you I'd suggest you look at the source directly in your browser or using the github links to view or edit the full markdown+javascript file I'm writing here and feeding into pelican. Pull requests welcome, especially if you spot mistakes or just plain bad ideas, I know I still have a lot to learn.

(See also part two, Albers color studies in D3.js, part 2)