Non-geographical Analysis#
Most of the datashader examples use geographic data, because it is so easily interpreted, but datashading will help exploration of any data dimensions. Here let’s start by plotting trip_distance
versus fare_amount
for the 12-million-point NYC taxi dataset from nyc_taxi.ipynb.
import hvplot.dask # noqa
from holoviews import opts
opts.defaults(
opts.Scatter(width=800, height=500, color='blue'),
opts.RGB(width=800, height=500),
opts.Curve(width=800))
Load NYC Taxi data#
These data have been transformed from the original database to a parquet file. It should take about 5 seconds to load (compared to 10-20 seconds when stored in the inefficient CSV file format).
import dask.dataframe as dd
usecols = ['trip_distance','fare_amount','tip_amount','passenger_count']
%%time
df = dd.read_parquet('data/nyc_taxi_wide.parq', engine='fastparquet')[usecols].persist()
CPU times: user 456 ms, sys: 123 ms, total: 579 ms
Wall time: 579 ms
df.tail()
trip_distance | fare_amount | tip_amount | passenger_count | |
---|---|---|---|---|
11842089 | 1.0 | 5.5 | 1.25 | 2 |
11842090 | 0.8 | 6.0 | 2.00 | 2 |
11842091 | 3.4 | 13.5 | 0.00 | 1 |
11842092 | 1.3 | 10.5 | 2.25 | 1 |
11842093 | 0.7 | 5.5 | 0.00 | 1 |
1000 points reveals the expected linear relationship#
samples = df.sample(frac=1e-4)
samples.hvplot.scatter('trip_distance', 'fare_amount', xlabel='Distance, miles',
ylabel='Fare, $', xlim=(0,15), ylim=(0,40), s=5)
10,000 points show more detailed, systematic patterns in fares and times#
Perhaps there are different metering options, along with granularity in how times and fares are counted; in any case, the times and fares do not uniformly populate any region of this space:
samples = df.sample(frac=1e-3)
samples.hvplot.scatter('trip_distance', 'fare_amount', xlabel='Distance, miles',
ylabel='Fare, $', xlim=(0,15), ylim=(0,40), s=1)
Datashader reveals additional detail, especially when zooming in#
You can now see that there are a lot of points below the linear boundary, representing long trips for very little cost (presumably GPS errors?).
df.hvplot.scatter('trip_distance', 'fare_amount', rasterize=True, cnorm='eq_hist', dynspread=True,
threshold=1, max_px=1, xlabel='Distance, miles', ylabel='Fare, $', xlim=(0,15), ylim=(0,40))