An open source point cloud data infrastructure: introduction, and land.

I’ve recently been exploring the Point Data Abstraction Library (PDAL) and Entwine ecosystem a lot. For work, and for proving some ideas that have been kicking around for a few years now (see my EGU poster from 2016, or FOSS4G talk in 2017). These are not new ideas – other people have had them for a lot longer – and we’ve all evolved in parallel. Right now, a few things seem to be converging based on the work of many visionaries.

My recent workshop at FOSS4G SotM Oceania made something click for me – let me explain:

The concept of point clouds as a foundation data service has been on some seriously large minds, resulting in infrastructure like Opentopography, Lidar.io, and Geoscience Australia’s ELVIS. All have a really great stab at the idea in different ways, but miss components. At the time of writing a user cannot make a web request and have a product turn up in their GIS or other workflow without doing some extra wrangling . I made a stab at exactly that using the OGC Web Processing Service as an API (PyWPS to be precise); and it worked. It wasn’t scalable the way I built it, there was no UI to search for data, and the web service backend was barely ready to prototype. But it did that one thing all the other services didn’t: you could make a web request and have data show up in QGIS (or other tools) – using tools which already exist.

There were two more drivers for the use of WPS as an API: anyone could build a user interface. Also, anyone (for example OpenTopography) could federate the data – that is, add a tool for querying the metadata catalogue to their service, but have the data products made elsewhere.

Unfortunately that prototype died with my departure from that role (you can still see the code here), and awaits awakening (read funding to work on it).

Having spent years pushing the WPS bandwagon, Entwine + PDAL have now added a seriously fiesty cat into the bag. 

Time to rethink a bit!

Why is the PDAL + Entwine ecosystem disruptive?

Organising data into Entwine octrees gives the ability to explore the data in 3D, in a web browser. Here’s an example – roughly 30 billion points covering 1600 square kilometres. The entire Australian Capital Territory in a single index, using the Potree viewer (with apologies to Safari users – try Chrome or firefox) :

(A full browser window version can also be explored – and thanks to the ACT Government for making the data and imagery openly accessible)

…but why? anyone following my work, or Hobu Inc, or Georepublic or several others for the past two years will have seen that, or something like it before using Potree (or perhaps CesiumJS). What’s special about this one? And haven’t we now seen trillions of points in a web browser?

Proving the concept – on land

Let me demonstrate. In order to do so, we’ll grab a building from OpenStreetMap and take a look at it in QGIS. We’ll also remove everything except a building we’re interested in – in this case the National Museum of Australia (note – if you were cleverer at OSM, you could directly query for the exact polygon you’re interested in. Also this post led directly to an OSM update for this building. Open data wins all round).

National Museum of Australia, from OpenStreetMap (http://openstreetmap.org)

I can then take bounding box for the OSM subset and form a pretty simple configuration file to clip out some point cloud data from the resource we’ve just explored in our browser. Let’s save the JSON below as ‘entwine-nma-bbox.json’

{
"pipeline": [
    {
        "type": "readers.ept",
        "filename": "http://act-2015-rgb.s3.amazonaws.com/",
        "bounds": "([692738, 692967], [6092255,6092562])"
    },
    “nma-bbox-extract.laz”
 ]
}

…which is great! I can straight away subset some data and have it delivered to my machine using the straightforward:

pdal pipeline entwine-nma-bbox.json

Using CloudCompare, we’ll inspect the result:

Clipped subset of the ACT 2015 LiDAR survey, as LAZ

Let the awesomeness of this sink in for a moment. What we’ve done is viewed a state-scale LiDAR dataset in our web browser, chosen a chunk we liked, and grabbed it.

We didn’t pay millions for an enterprise license, we didn’t sneakernet a case full of hard drives around, we can do this right now – try it and see!

Moar!

Because amazeballs is never enough in the year of 2019, let’s do it. What if my polygon is complex? Or I want to do something else?

OK, we will:

  • Export the polygon we got from OSM as WKT and use it to collect just LIDAR points of the building our polygon describes; and
  • While we’re at it, we want to know the building height. We don’t care about its absolute elevation, that’s for those abstruse geodesists and we’re all about Smart Cities/BIM.
{
  "pipeline": [
    {
        "type": "readers.ept",
        "filename": "http://act-2015-rgb.s3.amazonaws.com/“,
        "bounds": "([692738, 692967], [6092255,6092562])"
    },
    {
        “type”:”filters.hag"
    },
    {
        "type": "filters.ferry",
        "dimensions": "HeightAboveGround=>Z"
    },
    {
        “type”:”filters.crop”,
        "polygon" : "POLYGON((692809.631984224 6092391.60313353,692799.247843938 6092386.80906168,692813.997862914 6092370.69030712,692799.341142632 6092377.98433249,692783.122811184 6092368.42091656,692802.595638057 6092342.54609695,692819.659141064 6092342.28095851,692809.111903783 6092297.96009778,692795.402837776 6092292.12737559,692814.358454172 6092275.68559426,692844.691312863 6092275.59153365,692843.847865631 6092281.24725395,692860.29008336 6092282.99292535,692874.120409721 6092288.11269778,692884.989834177 6092294.32790882,692894.130786815 6092302.25586516,692900.514437 6092308.88889506,692901.715205042 6092312.70303068,692908.866089757 6092311.3958145,692910.507918866 6092305.63421591,692918.085205178 6092318.72279534,692924.159199383 6092335.11739437,692925.673620082 6092357.8354457,692921.966686887 6092378.49008827,692935.097127742 6092382.35961037,692936.517882442 6092379.02205761,692941.636641646 6092375.01716298,692947.256857479 6092387.63711741,692943.518383503 6092389.81462658,692939.362550937 6092394.59795051,692937.431706935 6092402.58529782,692947.260107573 6092409.46639951,692884.146918496 6092459.02597591,692881.195478416 6092464.39387366,692812.639114262 6092516.44429533,692812.599341542 6092519.26398135,692812.595950649 6092523.35913342,692810.755024584 6092525.77343595,692808.039261596 6092527.27423773,692802.715048715 6092528.48680142,692801.081794346 6092529.97554437,692795.928006554 6092531.07348086,692791.936524957 6092532.91230113,692789.439346739 6092535.27404822,692787.699481856 6092538.58510367,692782.912090064 6092537.24478343,692777.807251062 6092537.65360033,692772.698533279 6092539.58289217,692769.087121801 6092542.60103523,692758.270201297 6092528.63830555,692759.087787452 6092526.02394309,692771.309133484 6092522.26678656,692774.829029429 6092522.20260691,692781.531999808 6092522.4920648,692782.621073981 6092521.93607932,692785.623441834 6092519.79658239,692809.282973204 6092502.04458437,692827.537887849 6092487.70417834,692814.464804925 6092469.50603599,692808.829360296 6092474.03240084,692790.616820108 6092449.94026553,692833.413719661 6092432.74435001,692837.629857358 6092436.30532033,692843.029760678 6092433.10461002,692840.635557791 6092430.07064991,692856.282313714 6092423.54332098,692850.343301717 6092415.58010867,692894.045225888 6092381.1629562,692900.092838284 6092377.19369717,692891.710326343 6092371.54675326,692895.549863927 6092356.66013804,692896.363214271 6092349.17392137,692890.545769686 6092335.83680528,692883.279967828 6092324.97219394,692878.808247682 6092326.89902172,692874.874576584 6092319.10377162,692863.337346334 6092311.87074282,692853.12725545 6092308.4047461,692840.050371227 6092308.30721371,692837.435988798 6092321.34755383,692822.568932161 6092314.55194212,692831.52798418 6092352.68091387,692838.868724292 6092353.866695,692829.687931178 6092375.54841466,692825.500126318 6092370.3332669,692812.182559382 6092384.93429695,692817.715402173 6092387.52379717,692813.274988467 6092398.57229397,692812.190919008 6092400.63747517,692808.694012639 6092403.47562527,692807.43522688 6092399.0745384,692807.899677916 6092396.96712323,692809.631984224 6092391.60313353))"
    },
    "cropped-data-from-entwine.laz"
  ]
}

…save the JSON as `entwine-nma-polygon.json` and run with:

pdal pipeline entwine-nma-polygon.json

The National Museum main building, as building height.

In this example, we still transfer all the data inside the bounding box across the internet to the machine running PDAL, then discard all the points outside the polygon. We could also reverse the operation – clip first, then operate on points we’ve kept (you might want to do this if your initial bounding box is huge).

What else can we do? Create a raster output? Sure.

Here’s an example doing just that – a rasterised surface model, displayed in QGIS, using height above ground (building height – because again, BIM is this Tuesday, and geomatics was last Wednesday):

Rasterised National Museum building height

…and another, this time a mesh model displayed in MeshLab:

Mesh model of the National Museum and surrounding ground

In the tradition of more, what else can we do? Add some custom Python code? OK. Return only ground points? Fine. Whatever we can do with PDAL, we can do here.

Speaking for the trees, too

BIM and smart cities might be hot past next Tuesday, but landscape ecology is our forever.

So let’s look at another datasource – 7610 square kilometres of pretty flat Mallee country in Australia, collected by Geoscience Australia and hosted in their ELVIS system (again, with apologies to Safari users):

As a quick aside – now we’re covering some ground, literally. We’ve jumped from 1600 to 7610 square kilometres. There’s a pattern here!

…and back on topic, let’s do something with it. Choosing a pretty arbitrary 1 x 1 km region of interest we’ll create a 10 metre resolution ‘tree percentage’ map:

{
  "pipeline": [
    {
      "type":"readers.ept",
      "filename":"http://lower-mdb-2013-rgb.s3.amazonaws.com/",
      "bounds":"([566655, 567655], [6254307,6255307])"
    },
    {
        "type":"filters.python",
        "script":"treepercentage.py",
        "function":"treepercentage",
        "module":"anything",
        "pdalargs":"{\"cellsize\":10,
                     \"spatialreference\":\"EPSG:28354\",
                     \"outputfile\":\"treeness-new.tiff\"}"
    },
    {
        "type":"filters.range",
        "limits":"Classification[2:2]"
    },
    {
        "type":"writers.gdal",
        "gdalopts":"t_srs=EPSG:28355",
        "resolution":10,
        "filename":"trees-dem.tiff",
        "output_type":"idw"
    }
  ]
}

What happens here? The Python code below grids the LiDAR into a user-defined grid, counts how many points are in each cell, counts how many of those points have Classification = 5 (tall vegetation) and computes a normalised tree percentage. It then writes out a raster of the results. By the way, feel free to suggest betterfasterways to do the same tasks, as long as imported libraries can be kept minimal

## Python script to make a treeness map...
import numpy as np
from scipy import interpolate
import rasterio
from rasterio.transform import from_origin

# grid the trees and not trees
# we want to know how much area the trees cover per input grid cell

def treepercentage(ins, outs):
    #extract our PDAL arguments
    cellsize = pdalargs["cellsize"]
    spatialreference = pdalargs["spatialreference"]
    outputfile = pdalargs["outputfile"]

    #get indices of all the tree points
    treeInds = np.where(ins["Classification"] == 5)

    #get tree dimensions - not necessary, but makes things easier to read
    treesX = ins["X"][treeInds]
    treesY = ins["Y"][treeInds]
    treesZ = ins["Z"][treeInds]

    #get our bounding box dimensions
    boxdims=[]
    boxdims.append(np.max(ins["X"])-np.min(ins["X"]))
    boxdims.append(np.max(ins["Y"])-np.min(ins["Y"]))

    #find out how many histogram bins we can put in the bounding box
    x_bins=int(np.ceil(boxdims[0]+1) * (1/cellsize))
    y_bins=int(np.ceil(boxdims[1]+1) * (1/cellsize))

    # create some histogram bins
    x_grid_points = np.linspace(np.floor(np.min(ins["X"])),np.ceil(np.max(ins["X"])),x_bins)
    y_grid_points = np.linspace(np.floor(np.min(ins["Y"])),np.ceil(np.max(ins["Y"])),y_bins)

    #create 1D X and Y coordinate arrays from the histogram grid.
    # there are probably N better ways, this is the way I though of first
    x_gridded_coords = []
    y_gridded_coords = []
    for i in y_grid_points:
        for j in x_grid_points:
            y_gridded_coords.append(i)
            x_gridded_coords.append(j)

    #using a 2D historgam, count how many points fall in each grid cell
    counts_all, _, _ = np.histogram2d(ins["Y"], ins["X"], bins=(y_grid_points,x_grid_points))

    #then count how many points labelled 'high vegetation' fall in each grid cell
    counts_t, _, _ = np.histogram2d(treesY, treesX, bins=(y_grid_points,x_grid_points))

    #make a proportion of trees per grid cell area

    treePercent = np.nan_to_num( counts_t / counts_all )

    # set up to write out the grid as a GeoTIFF
    geotransform = from_origin(np.min(x_gridded_coords), np.max(y_gridded_coords), cellsize, cellsize)

    # and write it.
    with rasterio.open(outputfile, 'w', driver='GTiff',
                            height=y_bins, width=x_bins,
                            count=1, dtype='float64',
                            transform=geotransform,
                            crs=spatialreference) as dataset:
        dataset.write(np.flip(treePercent,0), 1)

    #return
    return True

Here’s what it looks like in QGIS, after applying some gdalwarp smoothing love. Dark blue means nearly no points were classified as trees, yellow means nearly all the points in a cell were classified as trees.

We also push out a DTM for good measure – that’s why the range filter and gdal writer after the Python code block exist.

The entire operation (data query, processing, writing out two rasters) took 37 seconds – this is time to spare within a single synchronous HTTP request. Clearly, scaling to 100 x 100km might slow things down a little.

Curiosity is fine – but the real application here? In a pretty quick and hacky brew of some JSON and Python, we have a validation tool using LIDAR datasets for satellite-derived tree coverage. 10 metres roughly corresponds to Sentinel-2, we change the cell size to 25 and we have correspondence with DEA ARD, 30 and we have USGS Landsat. I hope the picture is getting clearer.

Summary: A fundamental shift

The fundamental shift here in my eyes is that we have the largest number of parts required in one place that I’ve ever seen for a now-and-future point cloud data infrastructure. Using a single data archive we can tick off exploratory visualisation and real data processing, all without data loss, and with access to complete metadata – required in an open science/open data landscape. We also have compression, and a spatial index in 3D.

Whether you’re a c++ creator, Pythonista extraordinaire or a ‘scientific coder’ and command line hack like myself, this is a warp factor 10 to real data application and utility. It’s getting close to effectively cracking a massive problem – how to manage all these awesome data we collect!

I can’t write this post without a huge shout out to Hobu Inc, and all their supporters/funders – who have really nursed an incredible set of tools into being, and made them open for us all to build with. And I am only scratching at the surface here.

Next…

Like what you see here? Also check out part 2: oceans, and stay tuned for part 3: objects – coming soon.

The sales pitch

Spatialised is a fully independent, full time consulting business. The tutorials, use cases and write-ups here are free for you to use, without ads or tracking.

If you find the content here useful to your business or research, you can support production of more words and open source geo-recipes via Paypal or Patreon; or hire me to do stuff. Enjoy!