Geographical Analysis#

hvplotbokehdatashaderpanelparamdask

Published: January 29, 2016 · Modified: August 20, 2024

There are a variety of approaches for plotting large datasets, but most of them are very unsatisfactory. Here we first show some of the issues, then demonstrate how Datashader helps make large datasets truly practical.

We’ll use part of the well-studied NYC Taxi trip database, with the locations of all NYC taxi pickups and dropoffs from the month of January 2015. Although we know what the data is, let’s approach it as if we are doing data mining, and see what it takes to understand the dataset from scratch.

NOTE: This dataset is also explorable through the Datashader example dashboard.

Load NYC Taxi data#

These data have been transformed from the original database to a parquet file. It should take about 5 seconds to load (compared to 10-20 seconds when stored in the inefficient CSV file format).

import dask.dataframe as dd

usecols = ['dropoff_x', 'dropoff_y', 'pickup_x', 'pickup_y', 'dropoff_hour', 'pickup_hour', 'passenger_count']

%%time
df = dd.read_parquet('data/nyc_taxi_wide.parq', engine='fastparquet')[usecols].persist()

print(len(df))

11842094
CPU times: user 260 ms, sys: 200 ms, total: 460 ms
Wall time: 459 ms

As you can see, this file contains about 12 million pickup and dropoff locations (in Web Mercator coordinates), with passenger counts.

df.tail()

	dropoff_x	dropoff_y	pickup_x	pickup_y	dropoff_hour	pickup_hour	passenger_count
11842089	-8232492.0	4979234.5	-8232297.5	4980859.5	19	19	2
11842090	-8234856.5	4971131.0	-8235721.0	4972331.0	19	19	2
11842091	-8234202.5	4981092.5	-8235340.5	4975470.0	19	19	1
11842092	-8235618.5	4973722.0	-8237594.0	4973844.0	19	19	1
11842093	-8234151.5	4977120.0	-8233228.5	4977946.5	19	19	1

1000-point scatterplot: undersampling#

Any plotting program should be able to handle a plot of 1000 datapoints. Here the points are initially overplotting each other, but if you hit the Reset button (toolbars at the top right of plot) to zoom in a bit, nearly all of them should be clearly visible in the following Bokeh plot of a random 1000-point sample. If you know what to look for, you can even see the outline of Manhattan Island and Central Park from the pattern of dots. We’ve included geographic map data here to help get you situated, though for a genuine data mining task in an abstract data space you might not have any such landmarks. In any case, because this plot is discarding 99.99% of the data, it reveals very little of what might be contained in the dataset, a problem called undersampling.

import numpy as np
import hvplot.dask # noqa
import holoviews as hv
from holoviews import opts
from holoviews.streams import PlotSize

plot_width  = int(750)
plot_height = int(plot_width//1.2)
x_range, y_range =(-8242000,-8210000), (4965000,4990000)
PlotSize.scale=2.0

opts.defaults(
    opts.Points(width=plot_width, height=plot_height, size=5, color='blue'),
    opts.Overlay(width=plot_width, height=plot_height, xaxis=None, yaxis=None),
    opts.RGB(width=plot_width, height=plot_height),
    opts.Histogram(responsive=True, min_height=250))

samples = df.sample(frac=1e-4)
samples.hvplot.points('dropoff_x', 'dropoff_y', tiles='EsriStreet')

10,000-point scatterplot: overplotting#

We can of course plot more points to reduce the amount of undersampling. However, even if we only try to plot 0.1% of the data, ignoring the other 99.9%, we will find major problems with overplotting, such that the true density of dropoffs in central Manhattan is impossible to see due to occlusion:

df.sample(frac=1e-3).hvplot.points('dropoff_x', 'dropoff_y', tiles='EsriStreet')

Overplotting is reduced if you zoom in on a particular region. However, then the problem switches to back to serious undersampling, as the too-sparsely sampled datapoints get revealed for zoomed-in regions, even though much more data is available.

100,000-point scatterplot: saturation#

If you make the dot size smaller, you can reduce the overplotting that occurs when you try to combat undersampling. Even so, with enough opaque data points, overplotting will be unavoidable in popular dropoff locations. So you can then adjust the alpha (opacity) parameter of most plotting programs, so that multiple points need to overlap before full color saturation is achieved. With enough data, such a plot can approximate the probability density function for dropoffs, showing where dropoffs were most common:

df.sample(frac=1e-2).hvplot.points('dropoff_x', 'dropoff_y', tiles='EsriStreet', alpha=0.1, size=1)

However, it’s very tricky to set the size and alpha parameters. How do we know if certain regions are saturating, unable to show peaks in dropoff density? Here we’ve manually set the alpha to show a clear structure of streets and blocks, as one would intuitively expect to see, but the density of dropoffs still seems approximately the same on nearly all Manhattan streets (just wider in some locations), which is unlikely to be true. We can of course reduce the alpha value to reduce saturation further, but there’s no way to tell when it’s been set correctly, and it’s already low enough that nothing other than Manhattan and La Guardia is showing up at all. Plus, this alpha value will only work even reasonably well at the one zoom level shown. Try zooming in to see that at higher zooms, there is less overlap between dropoff locations, so that the points all start to become transparent due to lack of overlap. Yet without setting the size and alpha to a low value in the first place, the stucture is invisible when zoomed out, due to overplotting. Thus even though Bokeh provides rich support for interactively revealing structure by zooming, it is of limited utility for large data; either the data is invisible when zoomed in, or there’s no large-scale structure when zoomed out, which is necessary to indicate where zooming would be informative.

Moreover, we’re still ignoring 99% of the data. Many plotting programs will have trouble with plots even this large, but Bokeh can handle 100-200,000 points in most browsers. Here Bokeh’s WebGL support is enabled by default, which gives smoother zooming behavior, but the non-WebGL mode also works well. Still, for such large sizes the plots become slow due to the large HTML file sizes involved, because each of the data points are encoded as text in the web page, and for even larger samples the browser will fail to render the page at all.

10-million-point datashaded plots: auto-ranging, but limited dynamic range#

To let us work with truly large datasets without discarding most of the data, we can take an entirely different approach. Instead of using a Bokeh scatterplot, which encodes every point into JSON and stores it in the HTML file read by the browser, we can use the datashader library to render the entire dataset into a pixel buffer in a separate Python process, and then provide a fixed-size image to the browser containing only the data currently visible. This approach decouples the data processing from the visualization. The data processing is then limited only by the computational power available, while the visualization has much more stringent constraints determined by your display device (a web browser and your particular monitor, in this case). This approach works particularly well when your data is in a far-off server, but it is also useful whenever your dataset is larger than your display device can render easily.

Because the number of points involved is no longer a limiting factor, you can now use the entire dataset (including the full 150 million trips that have been made public, if you download that data separately). Most importantly, because datashader allows computation on the intermediate stages of plotting, you can easily define operations like auto-ranging (which is on by default), so that we can be sure there is no overplotting or saturation and no need to set parameters like alpha.

The steps involved in raw datashading are:

Create a Canvas object with the shape of the eventual plot (i.e. having one storage bin for collecting points, per final pixel)
Aggregate all points into that set of bins, incrementally counting them
Map the resulting counts into a visible color from a specified range to make an image:

import datashader as ds
from datashader import transfer_functions as tf
from datashader.colors import Greys9
Greys9_r = list(reversed(Greys9))[:-2]

%%time
cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height, x_range=x_range, y_range=y_range)
agg = cvs.points(df, 'dropoff_x', 'dropoff_y')
img = tf.shade(agg, cmap=['skyblue', 'blue'], how='linear')

CPU times: user 531 ms, sys: 83.9 ms, total: 615 ms
Wall time: 617 ms

The resulting image is similar to the 100,000-point Bokeh plot above, but (a) makes use of all 12 million datapoints, (b) is computed in only a tiny fraction of the time, (c) does not require any magic-number parameters like size and alpha, and (d) automatically ensures that there is no saturation or overplotting:

img

This plot renders the count at every pixel as a color from the specified range (here from sky blue to blue), mapped linearly. If your display device were perceptually linear, and the data were distributed evenly across this color range, then the result of such linear, auto-ranged processing would be an effective, parameter-free way to visualize your dataset.

However, even if you have a display device calibrated to be perceptually linear, real data is rarely distributed evenly. Here, it is clear that there are “hotspots” in dropoffs, with a very high count for areas around Penn Station and Madison Square Garden, relatively low counts for the rest of Manhattan’s streets, and nearly no dropoffs anywhere else but La Guardia airport. NYC taxis definitely cover a larger geographic range than this, so what is the problem? To see, let’s look at the histogram of counts per pixel for the above image:

frequencies,edges = np.histogram(agg.values, bins=100)
hv.Histogram((edges, frequencies)).redim.range(Frequency=(0,6000))

Clearly, most of the pixels have very low counts (under 3000), while a very few pixels have much larger counts (up to 22000, in this case). When these values are mapped into colors for display, nearly all of the pixels will end up being colored with the lowest colors in the range, i.e. sky blue, while the other colors in the available range will be used for only a few dozen pixels at most. Thus most of the pixels in this plot convey very little information about the data, wasting nearly all of dynamic range available on your display device. It’s thus very likely that we are missing a lot of the structure in this data that we could be seeing.

10-million-point datashaded plots: high dynamic range#

For the typical case of data that is distributed nonlinearly over the available range, we can use nonlinear scaling to map the data range into the visible color range. E.g. first transforming the values via a log function will help flatten out this histogram and reveal much more of the structure of this data:

frequencies,edges = np.histogram(np.log1p(agg.values), bins=100)
hv.Histogram((edges, frequencies)).redim.range(Frequency=(0,8000))

tf.shade(agg, cmap=Greys9_r, how='log')

We can now see that there is rich structure throughout this dataset – geographic features like streets and buildings are clearly modulating the values in both the high-dropoff regions in Manhattan and the relatively low-dropoff regions in the surrounding areas. Still, this choice is arbitrary – why the log function in particular? It clearly flattened the histogram somewhat, but it was just a guess. We can instead explicitly equalize the histogram of the data before building the image, making structure visible at every data level (and thus at all the geographic locations covered) in a general way:

frequencies,edges = np.histogram(tf.eq_hist(agg.values)[0], bins=100)
hv.Histogram((edges, frequencies)).redim.range(Frequency=(0,6000))

tf.shade(agg, cmap=Greys9_r, how='eq_hist')

The histogram is now fully flat (apart from the spacing of bins caused by the discrete nature of integer counting). Effectively, the visualization now shows a rank-order or percentile distribution of the data. I.e., pixels are now colored according to where their corresponding counts fall in the distribution of all counts, with one end of the color range for the lowest counts, one end for the highest ones, and every colormap step in between having similar numbers of counts. Such a visualization preserves the ordering between count values, faithfully displaying local differences in these counts, but discards absolute magnitudes (as the top 1% of the color range will be used for the top 1% of the data values, whatever those may be).

Now that the data is visible at every level, we can immediately see that there are some clear problems with the quality of the data – there is a surprising number of trips that claim to drop off in the water or in the roadless areas of Central park, as well as in the middle of most of the tallest buildings in central Manhattan. These locations are likely to be GPS errors being made visible, perhaps partly because of poor GPS performance in between the tallest buildings.

Histogram equalization does not require any magic parameters, and in theory it should convey the maximum information available about the relative values between pixels, by mapping each of the observed ranges of values into visibly discriminable colors. And it’s clearly a good start in practice, because it shows both low values (avoiding undersaturation) and relatively high values clearly, without arbitrary settings.

Even so, the results will depend on the nonlinearities of your visual system, your specific display device, and any automatic compensation or calibration being applied to your display device. Thus in practice, the resulting range of colors may not map directly into a linearly perceivable range for your particular setup, and so you may want to further adjust the values to more accurately reflect the underlying structure, by adding additional calibration or compensation steps.

Moreover, at this point you can now bring in your human-centered goals for the visualization – once the overall structure has been clearly revealed, you can select specific aspects of the data to highlight or bring out, based on your own questions about the data. These questions can be expressed at whatever level of the pipeline is most appropriate, as shown in the examples below. For instance, histogram equalization was done on the counts in the aggregate array, because if we waited until the image had been created, we would have been working with data truncated to the 256 color levels available per channel in most display devices, greatly reducing precision. Or you may want to focus specifically on the highest peaks (as shown below), which again should be done at the aggregate level so that you can use the full color range of your display device to represent the narrow range of data that you are interested in. Throughout, the goal is to map from the data of interest into the visible, clearly perceptible range available on your display device.

10-million-point datashaded plots: interactive#

Although the above plots reveal the entire dataset at once, the full power of datashading requires an interactive plot, because a big dataset will usually have structure at very many different levels (such as different geographic regions). Datashading allows auto-ranging and other automatic operations to be recomputed dynamically for the specific selected viewport, automatically revealing local structure that may not be visible from a global view. We’ll continue to use the hvplot library to embed the generated images into a Bokeh plot to support fully interactive zooming. This is because with hvplot, you can access all the high level functionalities of datashader without having to write many lines of code:

import colorcet as cc

df.hvplot.points('dropoff_x', 'dropoff_y', rasterize=True, dynspread=True, aggregator=ds.sum('passenger_count'),
                 cnorm='eq_hist', cmap=cc.fire[100:], xaxis=None, yaxis=None, width=900, height=500, bgcolor='black')

When you zoom in interactively to this plot, you can see all the points available in that viewport without ever needing to change the plot parameters for that specific zoom level. We called hvPlot with 3 parameters that enable and control the interactive Datashader pipeline orchestrated by HoloViews:

rasterize=True is truly the only argument required to enable rasterization with Datashader. In this mode, a new 2D array is computed on every zoom and pan event on the plot, making use of all the points available in the viewport. By default, every array cell counts the number of data points falling into that pixel. This 2D array is then colormapped client-side by Bokeh (in your browser in Javascript, not in Python) applying the histogram equalization scheme we have already covered.
With aggregator=ds.sum('passenger_count'), we configured the aggregation step of the pipeline to deviate from the default count of data points per pixel. Instead, it computes the total number of passengers per pixel. You can remove the specified aggregator if you prefer the default trip counts.
We’ve also set the optional dynspread parameter to True to automatically enlarge the size of each data point once you’ve zoomed in so far that data points no longer have nearby neighbors.

Check this page out to learn more about the principles of Datashading.

Note that if you are viewing this plot as a static HTML exported web page, zooming in will just make the pixels larger; you need the full running Python process to have it re-rendered interactively.

This web page was generated from a Jupyter notebook and not all interactivity will work on this website.