I recently finished a proof-of-concept project with FrontierSI for creating some building blocks aimed at scaleable, programmatic quality assurance for hydrographic surveys. It was a fantastic opportunity to work on the foundations of some huge future projects on large scale hydrographic data collection.
I set out thinking ‘great, I can PDAL all the things!’ – but it was a little more complex.
Hydrographic data come in myriad formats. And I’m a latecomer to the party – generations of effort have already gone into methods for making better, more accessible hydrographic data. Not to mention, heaps of toolkits for data QA already exist – so why build more?
The project had a couple of interesting conditions:
- Open source
…which meant that existing work was thin on the ground. MB-System exists, and is scriptable – but is also fairly inaccessible to the local scientific coding community – who are generally pythonistas and not C++ers. It also needed to handle data formats which are generally upstream of MB-System (I’d be happy to be corrected here!) – the results of processes occurring at MB-System levels.
We started looking at the Hydroffice QCTools – but ran into an immediate issue: the project code was not available. It was compiled into NOAA’s Pydro package; but only for Windows. By the end of the project, one of the significant wins was a discussion with Hydroffice resulting in… open code! Go see it here: https://github.com/hydroffice/hyo2_qc
…by the time that happened, it was too late to reconfigure everything – and we’d already grown a really nice little toolkit using Python to glue together Fiona, Rasterio, Shapely and PDAL.
What did we make?
Designing the system from the ground up, we also built for flexible scalability, exploring docker as a platform for a little while, before settling on Conda as a way to manage all the dependencies for multiple operating systems. I think it will eventually dockerise as well.
We started with a brief of ingesting at least two data formats to perform three tests:
- does the data have CRS information?
- do the data cover the planned survey region?
- does the data density match the reported density?
It ended up being able to ingest five different formats (ASCII, LAZ, LAS, GeoTIFF, BAG); and go above and beyond with tests:
- does the data have CRS information (except ASCII, assumed to always be lon/lat/metres)
- do the data cover any of the planned region?
- if so, how much? and what is the area of intersection?
- if not, what is the distance between survey and planned regions?
- do the data density match the reported density?
- what does the metadata say? what do the data say using both grid metrics and concave hull bounds and number of datapoints/pixels?
The results live here: https://github.com/frontiersi/qa4mbes-data-pipeline
…and you can see the tests in action using Jupyter notebooks here: https://github.com/frontiersi/qa4mbes-data-pipeline/tree/master/notebooks – along with some background sketching for ideas.
With some luck we’ll be able to collaborate with the Hydroffice QCTools team for future phases of work; and develop an open toolkit that will help the global hydography community work better, and easier! (I have some ideas about data formatting, also…). Working with FrontierSI on an agile project was easy and professional, and I’d be happy to do so again!
The sales pitch
Spatialised is a fully independent consulting business currently in hibernation / very much slow down mode. The tutorials and write-ups here are free for you to use, without ads or tracking.
If you find the content here useful to your business or research or billion dollar startup idea, you can support production of ideas and open source geo-recipes via Paypal or Patreon; or hire me to do stuff; or hire me to talk about stuff; or buy stuff from the store; or just give me a seat on your advisory board and a 1% stake.