Exploring AIS vessel-traffic data#
This Jupyter notebook demonstrates how to use the Datashader-based rendering in HoloViews to explore and analyze US Coast Guard Automatic Identification System (AIS) vessel-location data. AIS data includes vessels identified by their Maritime Mobile Service Identity numbers along with other data such as vessel type. Data is provided here for January 2020 (200 million datapoints), but additional months and years of data can be downloaded for US coastal areas from marinecadastre.gov, and with slight modifications the same code here should work for AIS data available for other regions. This notebook also illustrates a workflow for visualizing large categorical datasets in general, letting users interact with individual datapoints even though the data itself is never sent to the browser for plotting.
import os, requests, numpy as np, pandas as pd, holoviews as hv, holoviews.operation.datashader as hd
import dask.dataframe as dd, panel as pn, colorcet as cc, datashader as ds
import spatialpandas as sp, spatialpandas.io, spatialpandas.geometry, spatialpandas.dask
from PIL import Image
from holoviews.util.transform import lon_lat_to_easting_northing as ll2en
from dask.diagnostics import ProgressBar
hv.extension('bokeh', 'matplotlib', width=100)
Vessel categories#
AIS pings come with an associated integer VesselType
, which broadly labels what sort of vessel it is. Different types of vessels are used for different purposes and behave differently, as we will see below when we color-code the location of each ping by the VesselType
using Datshader.
Here, we’ll use type names defined in a separate file constructed using lists of 100+ AIS Vessel Types. We’ve further collapsed those types into a smaller number of vessel categories:
vessel_types=pd.read_csv("AIS_categories.csv")
vessel_types.iloc[34:37]
num | desc | category | category_desc | |
---|---|---|---|---|
34 | 34 | Diving ops | 5 | Diving |
35 | 35 | Military ops | 6 | Military |
36 | 36 | Sailing | 7 | Sailing |
We’ll further reduce the category
to the 6 most common, with the rest as Other
.
categories = {r.num: r.category if r.category in [0,2,3,19,12,18] else 21 for i, r in vessel_types.iterrows()}
categories[np.NaN] = 0
def category_desc(val):
"""Return description for the category with the indicated integer value"""
return vessel_types[vessel_types.category==val].iloc[0].category_desc
vessel_mapping = dict(zip(vessel_types.num.to_list(), vessel_types.category.to_list()))
We can print the resulting categories by number, with their description:
groups = {categories[i]: category_desc(categories[i]) for i in vessel_types.num.unique()}
print(" ".join([f"{k}:{v}" for k,v in sorted(groups.items())]))
0:Unknown 2:Fishing 3:Towing 12:Tug 18:Passenger 19:Cargo 21:Other
We’ll map these categories to colors defined by colorcet and construct a color key and legend to use for plotting:
colors = cc.glasbey_bw_minc_20_minl_30
color_key = {list(groups.keys())[i]:tuple(int(e*255.) for e in v) for i,v in
enumerate(colors[:(len(groups))][::-1])}
legend = hv.NdOverlay({groups[k]: hv.Points([0,0], label=str(groups[k])).opts(
color=cc.rgb_to_hex(*v), size=0)
for k, v in color_key.items()})
legend.options(xaxis='bare',yaxis='bare', title='', toolbar=None)
Load traffic data#
Next we will load the data from disk, either directly from a spatially indexed Parquet file (if previously cached) or from the raw CSV files. We’ll also project the location data to the coordinate system we will use later for plotting. There’s a lot of data to process, so we’ll use Dask to ensure that we use all the cores available on this machine. Dask breaks a dataset into partitions that can be processed in parallel, so here we define a function for dealing with one partition’s worth of data, along with a schema showing what the final dataframe’s columnar structure will be.
def convert_partition(df):
east, north = ll2en(df.LON.astype('float32'), df.LAT.astype('float32'))
return sp.GeoDataFrame({
'geometry': sp.geometry.PointArray((east, north)),
'MMSI': df.MMSI.fillna(0).astype('int32'),
'category': df.VesselType.replace(categories).astype('int32')})
example = sp.GeoDataFrame({
'geometry': sp.geometry.PointArray([], dtype='float32'),
'MMSI': np.array([], dtype='int32'),
'category': np.array([], dtype='int32')})
Next we will define the function that will load our data, reading a much-smaller (and much faster to load) previously cached Parquet-format file from disk if available. To use files covering other date ranges, just download them and change basedir
and/or basename
to match them.
basedir = './data/AIS_2020_01_broadcast.parq/ship_traffic/'
basename = 'AIS_2020_01'
index = 'MMSI'
dfcols = ['MMSI', 'LON', 'LAT', 'VesselType']
vesselcols = ['MMSI', 'IMO', 'CallSign', 'VesselName', 'VesselType', 'Length', 'Width']
def load_data(spatial_index=False):
cache_file = basedir+basename+'_broadcast.parq'
vessels_file = basedir+basename+'_vessels.parq'
if (os.path.exists(cache_file) and os.path.exists(vessels_file)):
print('Reading vessel info file')
vessels = dd.read_parquet(vessels_file).compute()
print('Reading parquet file')
gdf = sp.io.read_parquet_dask(cache_file).persist()
else:
csvs = basedir+basename+'*.csv'
with ProgressBar():
print('Writing vessel info file')
df = dd.read_csv(csvs, usecols=vesselcols, assume_missing=True)
vessels = df.groupby(index).last().reset_index().compute()
vessels[index] = vessels[index].astype('int32')
vessels.to_parquet(vessels_file)
print('Reading CSV files')
gdf = dd.read_csv(csvs, usecols=dfcols, assume_missing=True)
gdf = gdf.map_partitions(convert_partition, meta=example).persist()
print('Writing parquet file')
gdf = gdf.pack_partitions_to_parquet(cache_file, npartitions=64).persist()
with ProgressBar():
if spatial_index:
print('Building spatial index') # Takes a couple of minutes for 1 month's data
gdf = gdf.build_sindex().persist()
gdf['category'] = gdf['category'].astype('category').cat.as_known()
return gdf, vessels
Now let’s actually load the data, using the disk cache and memory cache if available. If you set spatial_index
to True
above it should speed up selection of individual points in the final plot, though load_data
will then take several minutes rather than a few seconds.
%%time
df, vessels = pn.state.as_cached('df', load_data)
Reading vessel info file
Reading parquet file
[ ] | 0% Completed | 174.27 us
[### ] | 9% Completed | 102.89 ms
[######### ] | 24% Completed | 204.37 ms
[############### ] | 38% Completed | 306.04 ms
[###################### ] | 56% Completed | 408.88 ms
[############################## ] | 76% Completed | 511.75 ms
[###################################### ] | 95% Completed | 612.85 ms
[########################################] | 100% Completed | 713.72 ms
CPU times: user 11.4 s, sys: 1.67 s, total: 13 s
Wall time: 3.81 s
Here we can see that this is a collection of points (latitude and longitude projected to Web Mercator) with an associated integer MMSI and category value:
df.head(1)
geometry | MMSI | category | |
---|---|---|---|
hilbert_distance | |||
4673612 | Point([-19029536.0, 775638.5]) | 737898816 | 0 |
Predefined locations#
We’ll provide interactive plots later that let you zoom in anywhere you like, but first let’s highlight a few specific areas of interest for those without a live Python process to interact with:
def ranges(lon_range, lat_range):
x_range, y_range = ll2en(lon_range, lat_range)
return dict(x=tuple(x_range), y=tuple(y_range))
x_range, y_range = ll2en([-54,-132], [15,51])
bounds = dict(x=tuple(x_range), y=tuple(y_range))
loc = {
'Continental US': ranges((-132.0, -54.0), (15.0, 51.0)),
'Vancouver Area': ranges((-126.0, -120.7), (47.5, 49.5)),
'NY and NJ': ranges(( -75.6, -71.3), (39.4, 41.1)),
'Gulf of Mexico': ranges(( -98.0, -81.0), (23.8, 32.0)),
'Gulf Coast': ranges(( -98.0, -87.0), (25.2, 31.0)),
'Louisiana Coast': ranges(( -91.5, -87.8), (28.4, 30.1)),
'Mississipi Delta': ranges(( -90.1, -89.2), (28.65,29.15)),
'Louisiana East Bay': ranges(( -89.37, -89.28),(28.86,28.9)),
'Bering Sea': ranges((-171.0, -159.0), (52.0, 56.0)),
'Hawaii': ranges((-160.0, -154.5), (19.5, 22.1)),
'LA to San Diego': ranges((-120.5, -117.0), (32.6, 34.1)),
'Great Lakes': ranges(( -89.0, -77.0), (41.2, 46.1)),
'Chesapeake Bay': ranges(( -78.0, -71.0), (36.4, 39.6)),
'Pamlico Sound, NC': ranges(( -80.0, -72.5), (33.1, 36.8)),
'Savannah, GA': ranges(( -81.2, -80.3), (31.85,32.25)),
'Florida': ranges(( -90.0, -74.5), (23.3, 31.0)),
'Puerto Rico': ranges(( -68.1, -64.2), (17.4, 19.5))}
We can easily render these to PNGs using HoloViews to call Datashader and render the results using Matplotlib:
hv.output(backend='matplotlib')
hv.opts.defaults(
hv.opts.RGB(xaxis=None, yaxis=None, axiswise=True, bgcolor='black'),
hv.opts.Layout(hspace=0.0, vspace=0.1, sublabel_format=None, framewise=True, fig_size=400))
plots = [hd.datashade(hv.Points(df), color_key=color_key, cmap=cc.fire, width=1000, height=600,
dynamic=True, x_range=ranges['x'], y_range=ranges['y']).relabel(region)
for region, ranges in loc.items()]
hv.Layout(plots).cols(1)