Non-geographical Analysis#

hvplotbokehdatashaderpanelparamdask
Published: January 29, 2016 · Modified: August 20, 2024


Most of the datashader examples use geographic data, because it is so easily interpreted, but datashading will help exploration of any data dimensions. Here let’s start by plotting trip_distance versus fare_amount for the 12-million-point NYC taxi dataset from nyc_taxi.ipynb.

import hvplot.dask # noqa
from holoviews import opts
opts.defaults(
    opts.Scatter(width=800, height=500, color='blue'),
    opts.RGB(width=800, height=500),
    opts.Curve(width=800))

Load NYC Taxi data#

These data have been transformed from the original database to a parquet file. It should take about 5 seconds to load (compared to 10-20 seconds when stored in the inefficient CSV file format).

import dask.dataframe as dd

usecols = ['trip_distance','fare_amount','tip_amount','passenger_count']
%%time
df = dd.read_parquet('data/nyc_taxi_wide.parq', engine='fastparquet')[usecols].persist()
CPU times: user 456 ms, sys: 123 ms, total: 579 ms
Wall time: 579 ms
df.tail()
trip_distance fare_amount tip_amount passenger_count
11842089 1.0 5.5 1.25 2
11842090 0.8 6.0 2.00 2
11842091 3.4 13.5 0.00 1
11842092 1.3 10.5 2.25 1
11842093 0.7 5.5 0.00 1

1000 points reveals the expected linear relationship#

samples = df.sample(frac=1e-4)
samples.hvplot.scatter('trip_distance', 'fare_amount', xlabel='Distance, miles',
                       ylabel='Fare, $', xlim=(0,15), ylim=(0,40), s=5)

10,000 points show more detailed, systematic patterns in fares and times#

Perhaps there are different metering options, along with granularity in how times and fares are counted; in any case, the times and fares do not uniformly populate any region of this space:

samples = df.sample(frac=1e-3)
samples.hvplot.scatter('trip_distance', 'fare_amount', xlabel='Distance, miles',
                       ylabel='Fare, $', xlim=(0,15), ylim=(0,40), s=1)

Datashader reveals additional detail, especially when zooming in#

You can now see that there are a lot of points below the linear boundary, representing long trips for very little cost (presumably GPS errors?).

df.hvplot.scatter('trip_distance', 'fare_amount', rasterize=True, cnorm='eq_hist', dynspread=True,
                  threshold=1, max_px=1, xlabel='Distance, miles', ylabel='Fare, $', xlim=(0,15), ylim=(0,40))

Here we’re using a histogram-equalized color mapping function (cnorm='eq_hist') to reveal density differences across this space. If we used the default linear mapping, we can mainly see that there are a lot of values near the origin, but all the rest are colored the same minimum (defaulting to light blue) color:

df.hvplot.scatter('trip_distance', 'fare_amount', rasterize=True, dynspread=True, threshold=1,
                  max_px=1, xlabel='Distance, miles', ylabel='Fare, $', xlim=(0,15), ylim=(0,40))

Fares are discretized to the nearest 50 cents, making patterns less visible, but there is both an upward trend in tips as fares increase (as expected), but also a large number of tips higher than the fare itself, which is surprising:

df.hvplot.scatter('tip_amount', 'fare_amount', rasterize=True, cnorm='eq_hist', dynspread=True,
                  threshold=1, max_px=1, xlabel='Tip, $', ylabel='Fare, $', xlim=(0,25), ylim=(0,20))

Interestingly, tips go down when the number of passengers is greater than 1:

df.hvplot.scatter('passenger_count', 'tip_amount', rasterize=True, cnorm='log', x_sampling=0.5,
                  y_sampling=0.5, xlabel='Passengers', ylabel='Tip, $', xlim=(-0.5, 6.5), ylim=(0, 60))

Here we’ve reduced the resolution along the x axis so that instead of getting isolated points for this inherently discrete data, you can see more-visible horizontal line segments.

The above plots use the Hvplot library, which builds Bokeh, Plotly, and Matplotlib plots from high-level specifications.

This web page was generated from a Jupyter notebook and not all interactivity will work on this website.