Non-geographical Analysis#

Most of the datashader examples use geographic data, because it is so easily interpreted, but datashading will help exploration of any data dimensions. Here let’s start by plotting trip_distance versus fare_amount for the 12-million-point NYC taxi dataset from nyc_taxi.ipynb.

import hvplot.dask # noqa
from holoviews import opts
opts.defaults(
    opts.Scatter(width=800, height=500, color='blue'),
    opts.RGB(width=800, height=500),
    opts.Curve(width=800))

Load NYC Taxi data#

These data have been transformed from the original database to a parquet file. It should take about 5 seconds to load (compared to 10-20 seconds when stored in the inefficient CSV file format).

import dask.dataframe as dd

usecols = ['trip_distance','fare_amount','tip_amount','passenger_count']
%%time
df = dd.read_parquet('data/nyc_taxi_wide.parq', engine='fastparquet')[usecols].persist()
CPU times: user 486 ms, sys: 76.3 ms, total: 562 ms
Wall time: 561 ms
df.tail()
trip_distance fare_amount tip_amount passenger_count
11842089 1.0 5.5 1.25 2
11842090 0.8 6.0 2.00 2
11842091 3.4 13.5 0.00 1
11842092 1.3 10.5 2.25 1
11842093 0.7 5.5 0.00 1

1000 points reveals the expected linear relationship#

samples = df.sample(frac=1e-4)
samples.hvplot.scatter('trip_distance', 'fare_amount', xlabel='Distance, miles',
                       ylabel='Fare, $', xlim=(0,15), ylim=(0,40), s=5)

10,000 points show more detailed, systematic patterns in fares and times#

Perhaps there are different metering options, along with granularity in how times and fares are counted; in any case, the times and fares do not uniformly populate any region of this space:

samples = df.sample(frac=1e-3)
samples.hvplot.scatter('trip_distance', 'fare_amount', xlabel='Distance, miles',
                       ylabel='Fare, $', xlim=(0,15), ylim=(0,40), s=1)

Datashader reveals additional detail, especially when zooming in#

You can now see that there are a lot of points below the linear boundary, representing long trips for very little cost (presumably GPS errors?).

df.hvplot.scatter('trip_distance', 'fare_amount', rasterize=True, cnorm='eq_hist', dynspread=True,
                  threshold=1, max_px=1, xlabel='Distance, miles', ylabel='Fare, $', xlim=(0,15), ylim=(0,40))