Non-geographical Analysis#

Most of the datashader examples use geographic data, because it is so easily interpreted, but datashading will help exploration of any data dimensions. Here let’s start by plotting trip_distance versus fare_amount for the 12-million-point NYC taxi dataset from nyc_taxi.ipynb.

import hvplot.dask # noqa
from holoviews import opts
opts.defaults(
    opts.Scatter(width=800, height=500, color='blue'),
    opts.RGB(width=800, height=500),
    opts.Curve(width=800))

Load NYC Taxi data#

These data have been transformed from the original database to a parquet file. It should take about 5 seconds to load (compared to 10-20 seconds when stored in the inefficient CSV file format).

import dask.dataframe as dd

usecols = ['trip_distance','fare_amount','tip_amount','passenger_count']
%%time
df = dd.read_parquet('data/nyc_taxi_wide.parq', engine='fastparquet')[usecols].persist()
CPU times: user 486 ms, sys: 76.3 ms, total: 562 ms
Wall time: 561 ms
df.tail()
trip_distance fare_amount tip_amount passenger_count
11842089 1.0 5.5 1.25 2
11842090 0.8 6.0 2.00 2
11842091 3.4 13.5 0.00 1
11842092 1.3 10.5 2.25 1
11842093 0.7 5.5 0.00 1

1000 points reveals the expected linear relationship#

samples = df.sample(frac=1e-4)
samples.hvplot.scatter('trip_distance', 'fare_amount', xlabel='Distance, miles',
                       ylabel='Fare, $', xlim=(0,15), ylim=(0,40), s=5)

10,000 points show more detailed, systematic patterns in fares and times#

Perhaps there are different metering options, along with granularity in how times and fares are counted; in any case, the times and fares do not uniformly populate any region of this space:

samples = df.sample(frac=1e-3)
samples.hvplot.scatter('trip_distance', 'fare_amount', xlabel='Distance, miles',
                       ylabel='Fare, $', xlim=(0,15), ylim=(0,40), s=1)

Datashader reveals additional detail, especially when zooming in#

You can now see that there are a lot of points below the linear boundary, representing long trips for very little cost (presumably GPS errors?).

df.hvplot.scatter('trip_distance', 'fare_amount', rasterize=True, cnorm='eq_hist', dynspread=True,
                  threshold=1, max_px=1, xlabel='Distance, miles', ylabel='Fare, $', xlim=(0,15), ylim=(0,40))

Here we’re using a histogram-equalized color mapping function (cnorm='eq_hist') to reveal density differences across this space. If we used the default linear mapping, we can mainly see that there are a lot of values near the origin, but all the rest are colored the same minimum (defaulting to light blue) color:

df.hvplot.scatter('trip_distance', 'fare_amount', rasterize=True, dynspread=True, threshold=1,
                  max_px=1, xlabel='Distance, miles', ylabel='Fare, $', xlim=(0,15), ylim=(0,40))

Fares are discretized to the nearest 50 cents, making patterns less visible, but there is both an upward trend in tips as fares increase (as expected), but also a large number of tips higher than the fare itself, which is surprising:

df.hvplot.scatter('tip_amount', 'fare_amount', rasterize=True, cnorm='eq_hist', dynspread=True,
                  threshold=1, max_px=1, xlabel='Tip, $', ylabel='Fare, $', xlim=(0,25), ylim=(0,20))

Interestingly, tips go down when the number of passengers is greater than 1:

df.hvplot.scatter('passenger_count', 'tip_amount', rasterize=True, cnorm='log', x_sampling=0.5,
                  y_sampling=0.5, xlabel='Passengers', ylabel='Tip, $', xlim=(-0.5, 6.5), ylim=(0, 60))

Here we’ve reduced the resolution along the x axis so that instead of getting isolated points for this inherently discrete data, you can see more-visible horizontal line segments.

The above plots use the Hvplot library, which builds Bokeh, Plotly, and Matplotlib plots from high-level specifications.