Nyc taxi

View a running version of this notebook. | Download this project.


Plotting very large datasets meaningfully, using datashader

There are a variety of approaches for plotting large datasets, but most of them are very unsatisfactory. Here we first show some of the issues, then demonstrate how the datashader library helps make large datasets truly practical.

We'll use part of the well-studied NYC Taxi trip database, with the locations of all NYC taxi pickups and dropoffs from the month of January 2015. Although we know what the data is, let's approach it as if we are doing data mining, and see what it takes to understand the dataset from scratch.

NOTE: This dataset is also explorable through the Datashader example dashboard. From inside the examples directory, run: DS_DATASET=nyc_taxi panel serve --show dashboard.ipynb

Load NYC Taxi data

These data have been transformed from the original database to a parquet file. It should take about 5 seconds to load (compared to 10-20 seconds when stored in the inefficient CSV file format...).

In [1]:
import dask.dataframe as dd

usecols = ['dropoff_x','dropoff_y','pickup_x','pickup_y','dropoff_hour','pickup_hour','passenger_count']
%time df = dd.read_parquet('data/nyc_taxi_wide.parq')[usecols].persist()
df.tail()
CPU times: user 1.94 s, sys: 1.37 s, total: 3.31 s
Wall time: 3.31 s
Out[1]:
dropoff_x dropoff_y pickup_x pickup_y dropoff_hour pickup_hour passenger_count
11842089 -8232492.0 4979234.5 -8232297.5 4980859.5 19 19 2
11842090 -8234856.5 4971131.0 -8235721.0 4972331.0 19 19 2
11842091 -8234202.5 4981092.5 -8235340.5 4975470.0 19 19 1
11842092 -8235618.5 4973722.0 -8237594.0 4973844.0 19 19 1
11842093 -8234151.5 4977120.0 -8233228.5 4977946.5 19 19 1

As you can see, this file contains about 12 million pickup and dropoff locations (in Web Mercator coordinates), with passenger counts.

1000-point scatterplot: undersampling

Any plotting program should be able to handle a plot of 1000 datapoints. Here the points are initially overplotting each other, but if you hit the Reset button (top right of plot) to zoom in a bit, nearly all of them should be clearly visible in the following Bokeh plot of a random 1000-point sample. If you know what to look for, you can even see the outline of Manhattan Island and Central Park from the pattern of dots. We've included geographic map data here to help get you situated, though for a genuine data mining task in an abstract data space you might not have any such landmarks. In any case, because this plot is discarding 99.99% of the data, it reveals very little of what might be contained in the dataset, a problem called undersampling.

In [2]:
import numpy as np
import holoviews as hv
from holoviews import opts
from holoviews.element.tiles import StamenTerrain
hv.extension('bokeh')