Geographical Analysis#

datashaderpanel
Published: January 29, 2016 · Updated: August 20, 2024


There are a variety of approaches for plotting large datasets, but most of them are very unsatisfactory. Here we first show some of the issues, then demonstrate how Datashader helps make large datasets truly practical.

We’ll use part of the well-studied NYC Taxi trip database, with the locations of all NYC taxi pickups and dropoffs from the month of January 2015. Although we know what the data is, let’s approach it as if we are doing data mining, and see what it takes to understand the dataset from scratch.

NOTE: This dataset is also explorable through the Datashader example dashboard.

Load NYC Taxi data#

These data have been transformed from the original database to a parquet file. It should take about 5 seconds to load (compared to 10-20 seconds when stored in the inefficient CSV file format).

import dask.dataframe as dd

usecols = ['dropoff_x', 'dropoff_y', 'pickup_x', 'pickup_y', 'dropoff_hour', 'pickup_hour', 'passenger_count']
%%time
df = dd.read_parquet('data/nyc_taxi_wide.parq', engine='fastparquet')[usecols].persist()

print(len(df))
11842094
CPU times: user 260 ms, sys: 200 ms, total: 460 ms
Wall time: 459 ms

As you can see, this file contains about 12 million pickup and dropoff locations (in Web Mercator coordinates), with passenger counts.

df.tail()
dropoff_x dropoff_y pickup_x pickup_y dropoff_hour pickup_hour passenger_count
11842089 -8232492.0 4979234.5 -8232297.5 4980859.5 19 19 2
11842090 -8234856.5 4971131.0 -8235721.0 4972331.0 19 19 2
11842091 -8234202.5 4981092.5 -8235340.5 4975470.0 19 19 1
11842092 -8235618.5 4973722.0 -8237594.0 4973844.0 19 19 1
11842093 -8234151.5 4977120.0 -8233228.5 4977946.5 19 19 1

1000-point scatterplot: undersampling#

Any plotting program should be able to handle a plot of 1000 datapoints. Here the points are initially overplotting each other, but if you hit the Reset button (toolbars at the top right of plot) to zoom in a bit, nearly all of them should be clearly visible in the following Bokeh plot of a random 1000-point sample. If you know what to look for, you can even see the outline of Manhattan Island and Central Park from the pattern of dots. We’ve included geographic map data here to help get you situated, though for a genuine data mining task in an abstract data space you might not have any such landmarks. In any case, because this plot is discarding 99.99% of the data, it reveals very little of what might be contained in the dataset, a problem called undersampling.

import numpy as np
import hvplot.dask # noqa
import holoviews as hv
from holoviews import opts
from holoviews.streams import PlotSize
plot_width  = int(750)
plot_height = int(plot_width//1.2)
x_range, y_range =(-8242000,-8210000), (4965000,4990000)
PlotSize.scale=2.0

opts.defaults(
    opts.Points(width=plot_width, height=plot_height, size=5, color='blue'),
    opts.Overlay(width=plot_width, height=plot_height, xaxis=None, yaxis=None),
    opts.RGB(width=plot_width, height=plot_height),
    opts.Histogram(responsive=True, min_height=250))
samples = df.sample(frac=1e-4)
samples.hvplot.points('dropoff_x', 'dropoff_y', tiles='EsriStreet')

10,000-point scatterplot: overplotting#

We can of course plot more points to reduce the amount of undersampling. However, even if we only try to plot 0.1% of the data, ignoring the other 99.9%, we will find major problems with overplotting, such that the true density of dropoffs in central Manhattan is impossible to see due to occlusion:

df.sample(frac=1e-3).hvplot.points('dropoff_x', 'dropoff_y', tiles='EsriStreet')

Overplotting is reduced if you zoom in on a particular region. However, then the problem switches to back to serious undersampling, as the too-sparsely sampled datapoints get revealed for zoomed-in regions, even though much more data is available.

100,000-point scatterplot: saturation#

If you make the dot size smaller, you can reduce the overplotting that occurs when you try to combat undersampling. Even so, with enough opaque data points, overplotting will be unavoidable in popular dropoff locations. So you can then adjust the alpha (opacity) parameter of most plotting programs, so that multiple points need to overlap before full color saturation is achieved. With enough data, such a plot can approximate the probability density function for dropoffs, showing where dropoffs were most common:

df.sample(frac=1e-2).hvplot.points('dropoff_x', 'dropoff_y', tiles='EsriStreet', alpha=0.1, size=1)