NYC Buildings#

datashaderhvplotspatialpandas

Published: January 27, 2021 · Modified: February 28, 2025

Many plotting libraries can handle collections of polygons, including Bokeh and HoloViews. However, because browser-based libraries like Bokeh and Plotly send all the polygon data to Javascript running in the browser, they can struggle when either the collections or the individual polygons themselves get large. Even natively in Python, typical formats like Shapely for representing polygons scale poorly to large polygon collections, because each polygon is wrapped up as a full, separate Python object, leading to a lot of duplicated storage overhead when many polygons of the same type are defined.

If you want to work with lots of polygons, here you can see how to use SpatialPandas and Dask to represent polygons efficiently in memory, and hvPlot and Datashader to render them quickly in a web browser.

This example plots the outlines of all one million+ buildings in New York City. See nyc.gov for the original data and its description.

import hvplot.dask # noqa
import hvplot.pandas # noqa
import datashader as ds
import colorcet as cc
import spatialpandas as spd
import spatialpandas.io

from holoviews import opts
from holoviews.streams import PlotSize
from dask.distributed import Client
from IPython.display import display

client = Client()
client

Client

Client-73573ed9-f5f0-11ef-8a87-7c1e52098285

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info

LocalCluster

f393e82b

Dashboard: http://127.0.0.1:8787/status	Workers: 4
Total threads: 4	Total memory: 15.62 GiB
Status: running	Using processes: True

Scheduler Info

Scheduler

Scheduler-ad88b300-73a6-45b7-be31-e8da80b3fa99

Comm: tcp://127.0.0.1:40369	Workers: 4
Dashboard: http://127.0.0.1:8787/status	Total threads: 4
Started: Just now	Total memory: 15.62 GiB

Workers

Worker: 0

Comm: tcp://127.0.0.1:46333	Total threads: 1
Dashboard: http://127.0.0.1:34221/status	Memory: 3.90 GiB
Nanny: tcp://127.0.0.1:33547
Local directory: /tmp/dask-scratch-space/worker-u7y1sv9x

Worker: 1

Comm: tcp://127.0.0.1:40979	Total threads: 1
Dashboard: http://127.0.0.1:35427/status	Memory: 3.90 GiB
Nanny: tcp://127.0.0.1:40965
Local directory: /tmp/dask-scratch-space/worker-fa3zysxa

Worker: 2

Comm: tcp://127.0.0.1:32893	Total threads: 1
Dashboard: http://127.0.0.1:46347/status	Memory: 3.90 GiB
Nanny: tcp://127.0.0.1:43423
Local directory: /tmp/dask-scratch-space/worker-gk8dr4be

Worker: 3

Comm: tcp://127.0.0.1:33353	Total threads: 1
Dashboard: http://127.0.0.1:40299/status	Memory: 3.90 GiB
Nanny: tcp://127.0.0.1:34583
Local directory: /tmp/dask-scratch-space/worker-hf5x4fkd

opts.defaults(opts.Polygons(height=500, xaxis=None, yaxis=None))

ddf = spd.io.read_parquet_dask('./data/nyc_buildings.parq').persist()
print(len(ddf))
ddf.head(3)

	geometry	type	name
hilbert_distance
1078494	MultiPolygon([[[-8277494.696406237, 4938583.71...	<NA>	<NA>
1087198	MultiPolygon([[[-8277324.377585323, 4938712.53...	<NA>	<NA>
1189938	MultiPolygon([[[-8276894.684350859, 4938973.11...	industrial	<NA>

Here you can see that we have 1.1 million “MultiPolygons”, some of which have a type and name declared.

To get a look at this data, let’s plot all the polygons, overlaid on a tiled map of the region:

%%time
display(ddf.hvplot.polygons(tiles='CartoLight', rasterize=True, aggregator='any'))

CPU times: user 24.4 s, sys: 1.29 s, total: 25.7 s
Wall time: 30.6 s

At this scale, the plot looks like a bunch of dots or large colored areas, because each building is smaller than a pixel in the plot. But if you have a live Python server running, you can use the Bokeh tools to zoom in and have the plot dynamically redrawn, showing you the full outline of each polygon.

Alternatively, we can re-plot the data over a smaller selection of the location that can show us the indidvidual polygons using the xlim and ylim parameters:

%%time
display(ddf.hvplot.polygons(tiles='CartoLight', rasterize=True, aggregator='any',
        xlim=(-8231400, -8230900), ylim=(4971900, 4972600)))

CPU times: user 457 ms, sys: 44.4 ms, total: 502 ms
Wall time: 2.41 s

Now let’s make use of the category information. To get a manageable number of types, we’ll compute the top 10 most common categories and drop everything else:

vc = ddf['type'].value_counts(sort=True).compute()
vc

type
garage              100301
shed                  1583
school                1228
apartments            1218
place_of_worship      1184
                     ...  
town_hall                1
transportation           1
wash                     1
water_tower              1
water_well               1
Name: count, Length: 164, dtype: int64[pyarrow]

cats = list(vc.iloc[:10].index.values) + ['unknown']
ddf['type'] = ddf['type'].replace({None: 'unknown'})
ddf = ddf[ddf['type'].isin(cats)]
ddf = ddf.categorize('type')

SpatialPandas lets us build a spatial index for accessing spatially organized regions more quickly, so let’s do that:

%%time
ddf = ddf.build_sindex().persist()

CPU times: user 24.1 ms, sys: 1 ms, total: 25.1 ms
Wall time: 23.8 ms

Now we can view each category separately with a selector widget:

%%time
display(ddf.hvplot.polygons(rasterize=True, tiles='CartoLight', groupby='type', aggregator='any'))

CPU times: user 3.63 s, sys: 311 ms, total: 3.94 s
Wall time: 31 s

If you look at each one, you can see that unfortunately most of the categories are unknown, but there are interesting patterns (e.g. almost no garages in Manhattan, and apparently all the sheds are in New Jersey).

Since these buildings don’t normally overlap, we can actually combine them all into a single plot using color to show all of the categories (though we have to construct a color key manually):

colors    = cc.glasbey_bw_minc_20_maxl_70
color_key = {cat: tuple(int(e*255.) for e in colors[i]) for i, cat in enumerate(cats)}

Now we put it all together, showing the color-coded plot:

%%time
plot = ddf.hvplot.polygons(tiles='CartoLight', datashade=True,
                           aggregator=ds.by('type', ds.any()),
                           cmap=color_key, responsive=True, legend='bottom_right')

display(plot)

CPU times: user 3.17 s, sys: 408 ms, total: 3.58 s
Wall time: 14.7 s

If you zoom into an area of interest, you’ll see an overlay of the building at that location, with the color of the polygon telling you the type of building it is.

Finally, we’ll make this notebook into a shareable app (run with anaconda-project run dashboard):

import panel as pn

text = """
# [1 million buildings in NYC](https://examples.holoviz.org/nyc_buildings)
## Rendered using [Datashader](https://datashader.org) and [hvPlot](https://hvplot.holoviz.org).
"""

pn.Column(text, pn.panel(plot, sizing_mode='stretch_both')).servable();

This web page was generated from a Jupyter notebook and not all interactivity will work on this website.