Census 2010#

datashaderholoviewsdaskbokeh
Published: March 27, 2016 · Modified: December 2, 2024


2010 US Census data#

The 2010 Census collected a variety of demographic information for all the more than 300 million people in the USA. Here we’ll focus on the subset of the data selected by the Cooper Center, who produced a map of the population density and the racial makeup of the USA. Each dot in this map corresponds to a specific person counted in the census, located approximately at their residence. (To protect privacy, the precise locations have been randomized at the census block level, so that the racial category can only be determined to within a rough geographic precision.) The Cooper Center website delivers pre-rendered tiles, which are fast to view but limited to the specific plotting choices they made. Here we will show how to run novel analyses focusing on whatever aspects of the data that you select yourself, rendered dynamically as requested using the Datashader library.

NOTE: This dataset is also explorable through the Datashader example dashboard.

Load data and set up#

The census data has been saved in a Parquet-format file, which can be loaded into a columnar data structure like a Pandas or Dask dataframe.

import datashader as ds, datashader.transfer_functions as tf, numpy as np
import dask.dataframe as dd
%%time
df = dd.read_parquet('./data/census2010.parq', engine='fastparquet')
df = df.categorize('race')
df = df.persist()
CPU times: user 7.65 s, sys: 3.14 s, total: 10.8 s
Wall time: 3.78 s

Here we call .persist() to use Dask’s fast in-core operations, but if you have less than 16GB of RAM, you can omit that line to use Dask’s support for out-of-core (larger than memory) operation. Working out of core will be much slower, but should work even on small machines.

df.head()
easting northing race
0 -12418767.0 3697425.00 h
1 -12418512.0 3697143.50 h
2 -12418245.0 3697584.50 h
3 -12417703.0 3697636.75 w
4 -12418120.0 3697129.25 h

There are 306675004 rows in this dataframe (one per person counted in the census), each with a location in Web Mercator format and a race encoded as a single character (where ‘w’ is white, ‘b’ is black, ‘a’ is Asian, ‘h’ is Hispanic, and ‘o’ is other (typically Native American)). (Try len(df) to see the size, if you want to check, though that always forces the dataset to be loaded so it’s skipped here.)

Let’s define some geographic ranges to look at later, and also a default plot size. Feel free to increase plot_width to 2000 or more if you have a very large monitor or want to save big files to disk, which shouldn’t greatly affect the processing time or memory requirements.

USA           = ((-124.72,  -66.95), (23.55, 50.06))
LakeMichigan  = (( -91.68,  -83.97), (40.75, 44.08))
Chicago       = (( -88.29,  -87.30), (41.57, 42.00))
Chinatown     = (( -87.67,  -87.63), (41.84, 41.86))
NewYorkCity   = (( -74.39,  -73.44), (40.51, 40.91))
LosAngeles    = ((-118.53, -117.81), (33.63, 33.96))
Houston       = (( -96.05,  -94.68), (29.45, 30.11))
Austin        = (( -97.91,  -97.52), (30.17, 30.37))
NewOrleans    = (( -90.37,  -89.89), (29.82, 30.05))
Atlanta       = (( -84.88,  -84.04), (33.45, 33.84))

from datashader.utils import lnglat_to_meters as webm
x_range,y_range = [list(r) for r in webm(*USA)]

plot_width  = int(900)
plot_height = int(plot_width*7.0/12)

Let’s also choose a background color for our results. A black background makes bright colors more vivid, and works well when later adding relatively dark satellite image backgrounds, but white backgrounds (background=None) are good for examining the weakest patterns, and work well when overlaying on maps that use light colors. Try it both ways and decide for yourself!

background = "black"

We’ll also need some utility functions and colormaps, and to make the page as big as possible:

from functools import partial
from datashader.utils import export_image
from datashader.colors import colormap_select, Greys9

export = partial(export_image, background = background, export_path="export")
cm = partial(colormap_select, reverse=(background!="black"))

Population density#

For our first examples, let’s ignore the race data for now, focusing on population density alone (as for the NYC taxi example). We’ll first aggregate all the points from the continental USA into a grid containing the population density per pixel:

%%time
cvs = ds.Canvas(plot_width, plot_height, *webm(*USA))
agg = cvs.points(df, 'easting', 'northing')
CPU times: user 2.02 s, sys: 9.53 ms, total: 2.03 s
Wall time: 949 ms

Computing this aggregate grid will take some CPU power (1 second on a MacBook Pro), because datashader has to iterate through the entire dataset, with hundreds of millions of points. Once the agg array has been computed, subsequent processing will now be nearly instantaneous, because there are far fewer pixels on a screen than points in the original database.

If we now plot the aggregate grid linearly, we can clearly see…

export(tf.shade(agg, cmap = cm(Greys9), how='linear'),"census_gray_linear")

…almost nothing. If you know what to look for, you can see hotspots (high population densities) in New York City, Los Angeles, Chicago, and a few other places. More hotspots can dimly be seen when using a white background than with a black, on most monitors, though they are very dim either way. In any case, for feeding 300 million points in, we’re getting almost nothing back in terms of visualization.

The first thing we can do is to prevent undersampling (see plotting pitfalls). In the plot above, there is no way to distinguish between pixels that are part of the background, and those that have low but nonzero counts; both are mapped to black or nearly black on a linear scale. Instead, let’s map all values that are not background to a dimly visible gray, leaving the highest-density values at white. I.e., let’s discard the first 25% of the gray colormap, and linearly interpolate the population densities over the remaining range:

export(tf.shade(agg, cmap = cm(Greys9,0.25), how='linear'),"census_gray_linear")

The above plot reveals at least that data has been measured only within the political boundaries of the continental United States, and also that many areas in the West are so poorly populated that many pixels contained not even a single person. (In datashader images, the background color is shown for pixels that have no data at all, using the alpha channel of a PNG image, while the colormap range is shown for pixels that do have data.) Some additional population centers are now visible, at least on some monitors. But mainly what the above plot indicates is that population in the USA is extremely non-uniformly distributed, with hotspots in a few regions, and nearly all other pixels having much, much lower (but nonzero) values. Again, that’s not much information to be getting out out of 300 million datapoints!

The problem is that of the available intensity scale in this gray colormap, nearly all pixels are colored the same low-end gray value, with only a few urban areas using any other colors. Thus this version of the map conveys very little information as well. Because the data are clearly distributed so non-uniformly, let’s instead try a nonlinear mapping from population counts into the colormap. A logarithmic mapping is often a good choice:

export(tf.shade(agg, cmap = cm(Greys9,0.2), how='log'),"census_gray_log")

Suddenly, we can see an amazing amount of structure! There are clearly meaningful patterns at nearly every location, ranging from the geographic variations in the mountainous West, to the densely spaced urban centers in New England, and the many towns stretched out along roadsides in the midwest (especially those leading to Denver, the hot spot towards the right of the Rocky Mountains).

Clearly, we can now see much more of what’s going on in this dataset, thanks to the logarithmic mapping. Worryingly, though, the choice of 'log' was purely arbitrary, basically just a hunch based on how typical datasets behave. One could easily imagine that other nonlinear functions would show other interesting patterns, or that different functions would be needed for different datasets.

Instead of blindly searching through the space of all such functions, we can step back and notice that the main effect of the log transform has been to reveal local patterns at all population densities – small towns show up clearly even if they are just slightly more dense than their immediate, rural neighbors, yet large cities with high population density also show up well against the surrounding suburban regions, even if those regions are more dense than the small towns on an absolute scale.

With this idea of showing relative differences across a large range of data values in mind, let’s try the image-processing technique called histogram equalization. I.e., given a set of raw counts, map these into a range for display such that every available color on the screen represents about the same number of samples in the original dataset. The result is similar to that from the log transform, but is now non-parametric – it will equalize any linearly or nonlinearly distributed integer data, regardless of the distribution:

export(tf.shade(agg, cmap = cm(Greys9,0.2), how='eq_hist'),"census_gray_eq_hist")

(Histogram equalization also works for non-integer data, but in that case it will use a finite set of bins to divide the interval between the minimum and maximum values, and will thus not be able to normalize the histogram perfectly for highly non-uniform distributions.) Effectively, this transformation converts the data from raw magnitudes, which can easily span a much greater range than the dynamic range visible to the eye, to a rank-order or percentile representation, which reveals density differences at all ranges but obscures the absolute magnitudes that were used to compute the ordering. In this representation, you can clearly see the effects of geography (rivers, coastlines, and mountains) on the population density, as well as history (denser near the longest-populated areas), and even infrastructure (with many small towns located at crossroads).

Given the very different results from the different types of plot, a good practice when visualizing any dataset with datashader is to look at both the linear and the histogram-equalized versions of the data; the linear version preserves the magnitudes, but obscures the distribution, while the histogram-equalized version reveals the distribution while preserving only the order of the magnitudes, not their actual values. If both plots are similar, then the data is distributed nearly uniformly across the interval. But much more commonly, the distribution will be highly nonlinear, and the linear plot will reveal only the envelope of the data, i.e., the lowest and the highest values. In such cases, the histogram-equalized plot will reveal much more of the structure of the data, because it maps the local patterns in the data into perceptible color differences on the screen.

Because we are only plotting a single dimension, we can use the colors of the display to effectively reach a higher dynamic range, mapping ranges of data values into different color ranges. When doing so, it is crucial to use a perceptually uniform colormap (see this paper for more details). Here we’ll use the “fire” colormap from the colorcet package:

from colorcet import fire
export(tf.shade(agg, cmap = cm(fire,0.2), how='eq_hist'),"census_ds_fire_eq_hist")

This should reveal even more of the detail than the gray colormaps above.

You can also import colormaps directly from matplotlib.cm or bokeh.palettes, but only a few of those will be perceptually uniform. For instance, matplotlib’s “hot” colormap is similar to colorcet’s “fire”, but reveals far less of the detail available in this data because it has long stretches of perceptually equivalent colors:

from matplotlib.cm import hot
export(tf.shade(agg, cmap = hot, how='eq_hist'),"census_ds_mhot_eq_hist")

Also note that the cm() function used here that lets us switch backgrounds and prevent undersampling only supports Bokeh palettes (either provided by Bokeh or from cetcolors), though we do provide Bokeh-palette versions of matplotlib’s perceptually uniform viridis and inferno colormaps from within datashader:

from datashader.colors import viridis
export(tf.shade(agg, cmap=cm(viridis), how='eq_hist'),"census_viridis_eq_hist")

For plotting magnitudes of this type, it should be safe to use any of the “linear” colormaps from the colorcet package, depending on your preference.

Colormaps can also be used to address very specific questions about the data itself. For instance, after histogram equalization, data should be uniformly distributed across the visible colormap. Thus if we want to highlight e.g. the top 1% of pixels, by population density, we can use a colormap divided into 100 ranges, and just change the top one to a different color:

grays2 = cm([(i,i,i) for i in np.linspace(0,255,99)])
grays2 += ["red"]
export(tf.shade(agg, cmap = grays2, how='eq_hist'),"census_gray_redhot1_eq_hist")

The above plot now conveys nearly all the information available in the linear plot, i.e. that only a few pixels have the very highest population densities, while also conveying the structure of the data at all population density ranges via histogram equalization. And in no case was any parameter value chosen based on the data itself, just on the question we wanted to ask about the data, which is crucial for posing and answering questions objectively using visualizations.

Categorical data (race)#

Since we’ve got the racial category for every pixel, we can use color to indicate the category value, instead of just extending dynamic range or highlighting percentiles as above. To do this, we first need to set up a color key for each category label, with different color keys as appropriate to make colors that stand out against the background:

if background == "black":
    color_key = {'w':'aqua', 'b':'lime',  'a':'red', 'h':'fuchsia', 'o':'yellow'}
else:
    color_key = {'w':'blue', 'b':'green', 'a':'red', 'h':'orange', 'o':'saddlebrown'}

We’ll show a key for those near the end of this notebook if you want to scroll down, but it requires plotting tools not yet introduced. We can now aggregate the counts per race into grids, using ds.by, instead of just a single grid with the total counts (via the default aggregate reducer ds.count), and then generate an image by colorizing each pixel using the aggregate information from each category for that pixel’s location:

def create_image(longitude_range, latitude_range, w=plot_width, h=plot_height):
    x_range,y_range=webm(longitude_range,latitude_range)
    cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)
    agg = cvs.points(df, 'easting', 'northing', ds.by('race'))
    img = tf.shade(agg, color_key=color_key, how='eq_hist')
    return img

The result shows that the USA is overwhelmingly white in geographic terms, apart from some predominantly Hispanic regions along the Southern border, some regions with high densities of blacks in the Southeast, and a few isolated areas of category “Other” in the West (primarily Native American reservation areas).

%%time
export(create_image(*USA),"Zoom 0 - USA")
CPU times: user 2.66 s, sys: 37.4 ms, total: 2.7 s
Wall time: 1.14 s

Interestingly, the racial makeup has some sharp boundaries around urban centers, as we can see if we zoom in:

%%time
export(create_image(*LakeMichigan),"Zoom 1 - Lake Michigan")
CPU times: user 913 ms, sys: 52.4 ms, total: 966 ms
Wall time: 476 ms