Non-geographical Analysis#

datashaderpanel
Published: January 29, 2016 · Updated: August 20, 2024


Most of the datashader examples use geographic data, because it is so easily interpreted, but datashading will help exploration of any data dimensions. Here let’s start by plotting trip_distance versus fare_amount for the 12-million-point NYC taxi dataset from nyc_taxi.ipynb.

import hvplot.dask # noqa
from holoviews import opts
opts.defaults(
    opts.Scatter(width=800, height=500, color='blue'),
    opts.RGB(width=800, height=500),
    opts.Curve(width=800))

Load NYC Taxi data#

These data have been transformed from the original database to a parquet file. It should take about 5 seconds to load (compared to 10-20 seconds when stored in the inefficient CSV file format).

import dask.dataframe as dd

usecols = ['trip_distance','fare_amount','tip_amount','passenger_count']
%%time
df = dd.read_parquet('data/nyc_taxi_wide.parq', engine='fastparquet')[usecols].persist()
CPU times: user 456 ms, sys: 123 ms, total: 579 ms
Wall time: 579 ms
df.tail()
trip_distance fare_amount tip_amount passenger_count
11842089 1.0 5.5 1.25 2
11842090 0.8 6.0 2.00 2
11842091 3.4 13.5 0.00 1
11842092 1.3 10.5 2.25 1
11842093 0.7 5.5 0.00 1

1000 points reveals the expected linear relationship#

samples = df.sample(frac=1e-4)
samples.hvplot.scatter('trip_distance', 'fare_amount', xlabel='Distance, miles',
                       ylabel='Fare, $', xlim=(0,15), ylim=(0,40), s=5)

10,000 points show more detailed, systematic patterns in fares and times#

Perhaps there are different metering options, along with granularity in how times and fares are counted; in any case, the times and fares do not uniformly populate any region of this space:

samples = df.sample(frac=1e-3)
samples.hvplot.scatter('trip_distance', 'fare_amount', xlabel='Distance, miles',
                       ylabel='Fare, $', xlim=(0,15), ylim=(0,40), s=1)

Datashader reveals additional detail, especially when zooming in#

You can now see that there are a lot of points below the linear boundary, representing long trips for very little cost (presumably GPS errors?).

df.hvplot.scatter('trip_distance', 'fare_amount', rasterize=True, cnorm='eq_hist', dynspread=True,
                  threshold=1, max_px=1, xlabel='Distance, miles', ylabel='Fare, $', xlim=(0,15), ylim=(0,40))