For a project at work we've collected a year's worth of samples from a major non-US social media site. The samples are taken every 30 seconds, a snapshot of the most recent 200 public posts from all users. This created a lot of files, and along the way we missed some in chunks for various reasons (network outage, service error, reboot, etc.). The researcher we're supporting has happily taken a copy of the 100+ GB (compressed) of data to start poring through, but asked that we help prepare a simple visualization of the data that's present - or more importantly, what's missing.
Because it's natural to miss a few files here and there over a year's time, it's not a problem unless there are big chunks missing or patterns of errors that make sampling from this data problematic. An image of what's there and what's not there needs to hit a few key points:
- cover the entire collection period (actually ~13 months)
- show missing files
- show empty files
- easily spot large gaps
- easily spot significant patterns
In addition to the immediate use (the researcher's own knowledge of what they have) this visualization needs to work for their advisors and others interested in the work, so it should be readily digested, by which I mean:
- should fit on one screen
- shouldn't require much explanation
To give a sense of volume, this set of files should be roughly 365
day/yr * 24 hr/day * 120 files/hr = 1,051,200 files. It's a good
number. It's too many to read from disk in realtime, so this will
require preprocessing.
First sketch
Let's start with a rough picture of what it will take to fit the
dots onto one screen. One day's worth is 24 * 120 = 2880 dots,
which is too much for one screen width of pixels, but if we can
divide it at least in half, we're getting closer. The 365-day year
is easier; we can multiply it by two or three and still fit a good
number of pixels in. So with this in mind, here's a 1440x730 grid.
Yah ok that's too wide.
Let's try again, but half again as wide, but just for fun (and because vertical scrolling isn't so hard) let's make it taller.
One more time, with a grid and some date scales to shape it all out better:
Adding real data
Okay, now we're getting somewhere. It's time to work with some real data and place it onto the scales using dates and times. As a first cut, I've extracted a file count for each hour in the dataset. This resulted in a json file with content like this:
...
"2014-09-29 08:00:00Z": 120,
"2014-09-29 09:00:00Z": 120,
"2014-09-29 10:00:00Z": 119,
"2014-09-29 11:00:00Z": 120,
"2014-09-29 12:00:00Z": 120,
"2014-09-29 13:00:00Z": 120,
...
Loading this into a sketch is easy with d3.json(). The keys are
sorted, but just to be thorough I'll also use d3.min() and
d3.max() to get the first and last date/times from the set.
The next piece of all this is to set the scales to use the dates. I created that data file knowing that javascript should be able to parse the dates cleanly; hopefully this will feed right into the d3 time scaling functions.
Finally, it'll all come together with line segments drawn in for
each hour. The percentage of files available (should be 120 total
for each hour) will feed into a color scale. To see the contrast
of missing files well, the scale will have to be exponential rather
than linear (earlier discussion of which via Albers is written up
in this
post).
Once again, d3 helps us out, with the d3.scale.pow() exponential
scaling function. To scale the input domain to the output range using
a power of two, we just set the exponent on the scale as well, and
use colors as the range:
var color_scale = d3.scale.pow()
.exponent(2)
.domain([0, 120])
.range(["#fff", "#000"]);
This should make missing files lighter, with a missing file or two barely noticeable, but more than a dozen or so should be noticeable.
That's the trick. This meets the purpose, but has two problems:
- The time zones are off. See the way the first day starts at 00:00 but stops at 20:00? That's the four-hour adjustment for eastern (US) time, which is happening in a way I'm not controlling properly. You can see this for yourself by mousing over a 00:00 block; it will show 04:00 as the hour. It doesn't make sense to do that because it introduces a discontinuity. More importantly, it's unclear what time we're looking at for any given block, and for this particular case the data was collected from a Chinese service, so it's doubly annoying for the researcher to have to correct for two offsets. Looks wrong, is wrong.
- Not working outside of chrome. Need to debug.
For now I have to leave this aside to get back to other projects. It will be good to circle back around to these to figure out how to get it right. Can't leave it hanging.
Fyi, I posted this up as a gist with similar text to be visible at bl.ocks.org/dchud. If you want to poke at the code without futzing with the rest of all this text, follow that link through or go right to the original gist.