What we’ll be covering
What GeoTrellis is, what it can do
Demo of GeoTrellis in action
Talk about what the next steps for GeoTrellis are, look into some of the possible use cases for GeoTrellis that we’re excited about, and talk about our roadmap
Feel free to ask questions throughout!
Where did
come from?
2011 - 2013
2013 - Present
What is
?
GeoTrellis
a Scala library for geospatial data types and operations.
enables Spark with geospatial capabilities
storage and query raster data from HDFS, S3, Accumulo, and Cassandra (HBase soon)
Geo +
Rasters +
Rasters, some Vector +
v1.0 Q4 2016
Rasters, Vector, VectorTiles, Point Cloud +
ROADMAP v1.1
w/
Vector Data with GeoTrellis (non-Spark)
Wraps JTS
GeoJson, WKT, WKB reading/writing
Reprojection (Proj4j)
Kriging Interpolation
Rasters with GeoTrellis (non-Spark)Read GeoTiffs
Map Algebra (local, focal, zonal)
Polygonal Summaries
Generally transform and combine raster data
Kernel Density, rasterization, vectorization
Get histograms
Render via color breaks
GeoTrellis & Spark
Ingest data to local file system, HDFS, Accumulo, S3, or Cassandra
Distributed computations of Spatial and Spatio-temporal raster data
Map algebra on distributed tile sets
General ways to transform and combine distributed tile sets
BACKGROUND
PROCESSING GEOSPATIAL DATA @ SCALE
PROCESSING GEOSPATIAL DATA @ SCALE
Geospatial Data
Core of GIS (Geographic information system)
Raster (images, weather data)
Vector (points of interest, country boundries)
Geospatial Data
Core of GIS (Geographic information system)
Raster (images, weather data)
Vector (points of interest, country boundries)
VectorTiles, Point Cloud
Raster Data
Raster Data
Raster Data
Raster Data
Vector Data (Points)
Vector Data (Lines)
Vector Data (Polygons)
Source: https://ryouready.wordpress.com/2009/11/16/infomaps-using-r-visualizing-german-unemployment-rates-by-color-on-a-map/
Vector Data
PROCESSING GEOSPATIAL DATA @ SCALE
Contains
Contains
Heatmap (Kernel Density)
Zonal Statistics
Feature Extraction (Image Segmentation)
Source: http://www.professeurs.polymtl.ca/christopher.pal/
Map Algebra
Local Operation
Focal Operation
Map Algebra in GeoTrellis
PROCESSING GEOSPATIAL DATAWITH
Polygonal Summary Statistics
PROCESSING GEOSPATIAL DATA @ SCALE
NED 1/3 arc second
NED 1/3 arc second
NED 1/3 arc second
NED 1/3 arc second
NED 1/3 arc second
• 170 X 180 km
• 2gb each.
• 11 bands
• 700 scenes per day
• 1.4 TB / day
• 255,500 scenes / year
• 0.25 PB / year
Landsat 8
Landsat 8 on
• All Landsat 8 scenes from 2015 and beyond.• Selection of cloud-free scenes from 2013 and 2014.
Landsat 8 on
645,763 scenes
Landsat 8 on
≈1 Petabyte
64 GB
32 Landsat 8 Scenes
This many people’s phones could hold all the Landsat 8 AWS is holding.
PROCESSING GEOSPATIAL DATA @ SCALE
Project to build a better search engine, back in the early 2000’s.
Worked for small datasets, but was not scalable.
The Google papers
After reading the papers, Nutch developers added a distributed file system and MapReduce model to Nutch.
In 2006, those portions were spun out of Nutch to form…
Apache Hadoop
Heavily supported by Yahoo, which moved it’s large data processing to Hadoop.
by 2007, Twitter, Facebook, LinkedIn and many others were doing serious work with Hadoop
2008 Hadoop graduated to a top level Apache project
Hadoop
Source: http://cs.calvin.edu/courses/cs/374/exercises/12/lab/MapReduceWordCount.png
Matei Zaharia
Worked with Hadoop at UC Berklee
Noticed Hadoop was not a good fit for Machine Learning algorithms and other iterative models.
So in 2009, he created…
Open sourced in 2010 under BSD license
Maintained by UC Berkeley’s AMPLab
Donated to the Apache Software Foundation in 2013 and relicensed as Apache 2.0
Graduated to a top level Apache project in 2014
Apache Spark
Apache Spark
a distributed computation engine.
An API that lets you work with distributed data as a collection.
Written in Scala, with language bindings for use with Java, Python, and R.
Hey Flyers Fans, what is the total count of Landsat 8 Scenes on your phones?
Data Node
Data Node
Data Node
Name Node
Master
Tablet Server
Tablet Server
Tablet Server
Accumulo
BigTable clone (columnar database)
Records stored on HDFS
Lexicographically sorted table index
Apache Accumulo
Created by the NSA in 2008
Donated to the Apache Foundation in 2011
Graduated to a top level project in 2012
2006
(Sec. 929) Prohibits any DOD component from utilizing the cloud computing database developed by the National Security Agency (NSA) and known as "Accumulo" after the end of FY2013, unless the DOD CIO certifies that: (1) there are no viable commercial open source databases that have such security features, or (2) Accumulo itself has become a successful open source database project. Requires DOD and intelligence community officials to coordinate the use by DOD components of cloud computing infrastructure and services offered by the intelligence community for purposes other than intelligence analysis.
(Sec. 929) Prohibits any DOD component from utilizing the cloud computing database developed by the National Security Agency (NSA) and known as "Accumulo" after the end of FY2013, unless the DOD CIO certifies that: (1) there are no viable commercial open source databases that have such security features, or (2) Accumulo itself has become a successful open source database project. Requires DOD and intelligence community officials to coordinate the use by DOD components of cloud computing infrastructure and services offered by the intelligence community for purposes other than intelligence analysis.
Hey Flyers Fans, what is the total count of Landsat 8 Scenes on your phones per month?
PROCESSING GEOSPATIAL DATA @ SCALE
Hey Flyers Fans, can you take the average pixel value of each scene’s band and derive a EPSG:3857 tile set of PNGs to be served on web
maps?
Hey Flyers Fans, can you take the average pixel value of each scene’s band and derive a EPSG:3857 tile set of PNGs to be served on web
maps?
How does
work?
Polygonal Summaries
Polygonal Summaries
SPACE FILLING CURVES
Hey Flyers Fans, what is the total count of Landsat 8 Scenes on your phones per month, per country?
Hey Flyers Fans, what is the total count of Landsat 8 Scenes on your phones per country?
SPACE FILLING CURVES
Z curve
Hilbert Curve
Space Filling Curves
Range Decomposition
70 -> 75 92 -> 99 116 -> 121
on
on
s3 key layerName/zoom/[SFC Index (Hilbert or Z order)]
s3 valueAvro Encoded Seq[(K, V)] where
K = Key Type (e.g. SpatialKey)V = Value Type (e.g. Tile)
Hey Flyers Fans, what is the total count of Landsat 8 Scenes on your phones A) per month, B) per country,
C) per both?
Why
?
Sharding raster data across the cluster
Caching operation results across cluster
HDFS support
Advanced fault tolerance
Advanced task scheduling
Source: http://cs.calvin.edu/courses/cs/374/exercises/12/lab/MapReduceWordCount.png
Say we have a large set of imagery, and would like to apply two filters to each band:
First, we want to apply a simple threshold filter: if a value is above 10,000, we want to discard it
Second, we would like to apply a 5 x 5 median filter.
Example Problem: Filtering
(1, 1) (2, 1)(0, 1)
(0, 0) (1, 0) (2, 0)
(1, 2) (2, 2)(0, 2)
(1, 1) (2, 1)(0, 1)
(0, 0) (1, 0) (2, 0)
(1, 2) (2, 2)(0, 2)
Node 1
Node 2
Node 3
(1, 1) (2, 1)(0, 1)
(0, 0) (1, 0) (2, 0)
(1, 2) (2, 2)(0, 2)
Node 1
Node 2
Node 3
(1, 1) (2, 1)(0, 1)
(0, 0) (1, 0) (2, 0)
(1, 2) (2, 2)(0, 2)
Node 1
Node 2
Node 3
(1, 1) (2, 1)(0, 1)
Node 1
Node 2
Node 3
(1, 1) (2, 1)(0, 1)
Node 1
Node 2
Node 3
(c, r)
Example Problem: Querying
We want to retrieve all imagery for the city of Rio de Janeiro taken in March 2016, find the maximum NDVI values for each pixel and save it as a GeoTiff.
What are uses of
?
100 spot instance m3.xlarge workers @ $0.04 / hr = $4.00 / hr
400 CPUs / ≈1.5 TB memory
1 master m3.xlarge on-demand instance @ $0.26 / hr
EMR cluster charge, $0.07 / hr
$4.37 / hr
Rendering elevation with hillshade + NLCD on AWS EMR
NED 1/3 arc second + NLCD
NED 1/3 arc second + NLCD
NED 1/3 arc second + NLCD
GLOBAL CIRCULATION MODELS
Models for predicting world temperature and precipitation.
GLOBAL CIRCULATION MODELS
NASA NEX Downscaled Climate Projections (NEX-DCP30)
• Monthly data over conterminous US
• Historical from 1950 - 2006
• 4 RCP scenarios from 2006 - 2099
• 8190 netCDF files on S3 - s3://nasanex/NEX-DCP30
• 15.3 TB in compressed GeoTiff tiles.
• RCP 8.5, max for datatype/model combo: 90.92 GB
Landsat NDVI/NDWI change detection demo
Static vs Dynamic
serving static data pre-processed through a batch transformation pipeline vs serving data dynamically
processed on-demand from unprocessed source data
Static vs Dynamic
GeoTrellis systems tend to have two major components:
A batch pre-processing pipeline, which processes large amounts of data into some static data at rest.
A dynamic pipeline which processes data at the time the user requests it.
“Raw” Data
Served Data
Processing Pipeline
“Raw” Data
Served Data
Completely dynamic
Application Data
Processing at request time
“Raw” Data
Served Data
Completely static
Batch data pre-processing
Application Data
“Raw” Data
Served Data
Application Data
Mix of static and dynamic
Batch data pre-processing Processing at request time
“Raw” Data
Served Data
Application Data
Mix of static and dynamic
Ingest/ETL Server
“Raw” Data
Served Data
Application Data
More static
Faster to serve, less flexibility
“Raw” Data
Served Data
Application Data
More dynamic
More flexible, slower to serve
Ingesting Landsat data
Landsat images are pulled off of S3 or Google’s public Earth Engine storage.
In an Spark job run on EMR, these images are reprojected, tiled, indexed, and saved off to Accumulo or HDFS.
The indexed tile set is now ready to be used by the server application.
Landsat GeoTiffs
on S3
PNGs, JSON
EPSG:3857 tiled imagery in Accumulo
Ingest/ETL Server
Landsat GeoTiffs
on S3
PNGs, JSON
EPSG:3857 tiled imagery in Accumulo
Ingest/ETL Server
Landsat GeoTiffs
on S3
PNGs, JSON
EPSG:3857 tiled imagery in Accumulo
Ingest/ETL Server
DEPLOYMENT
Example Deployment
Servicing User Requests
ROAD MAP
Release Schedule
v1.0 Q4 2016
v1.1 Q2 2017
Graduation
Rasters, Vector, VectorTiles, Point Cloud +
ROADMAP v1.1
w/
DOCUMENTATION!
IMPROVED DEPLOYMENT WITH
Integration work
VECTOR TILES
Image: osm2vectortile
POINT CLOUD
MACHINE LEARNING PIPELINES
http://blog.tomnod.com/finding-pools-with-deep-learning
Top Related