One of the tasks in the Building Inspector is fixing building footprints. The user is presented a map with an overlaid shape (red dots). The purpose is to draw the correct shape (or shapes, since the red overlay may cover multiple building footprints).
Multiple people receive the same map and overlay. This notebook describes a process to find the resulting consensus (or mean) shape.
Below is an example showing the map, the original polygon shown to each user (red dots) and the resulting polygons drawn (yellow).
IRuby.html '<iframe src="http://jsfiddle.net/mgiraldo/pdkCb/3/embedded/result/" width=500 height=400></iframe>'
It is hard to see but there are 11 yellow polygons: one rectangle in the lower left part, one for the upper right part (both wrong), and 9 for the complete L-shaped building.
The process to find the geometry that best summarizes what is drawn by users has to take into account:
The process described in this notebook makes use of the DBSCAN clustering algorithm to find an unknown amount of dense regions of points and determine the resulting geometries from there. The input to this process will be a GeoJSON FeatureCollection containing all the polygons drawn by contributors that are associated to a given red overlay. the expected output is a list of geo point arrays with the summary shapes determined by the algorithm.
All the necessary code is included and should be executable by any machine that has the required Ruby gems installed. This code was tested on Ruby 2.1.0.
First, we need the RGeo package along with its GeoJSON component:
require 'rgeo'
require 'rgeo-geojson'
true
We will use a Ruby implementation of the DBSCAN clustering algorithm.
require 'dbscan'
true
For visualization convenience in this notebook we will also use the awesome Nyaplot, a D3-powered visualization library. I had to manually build it according to the instructions since it is not yet in RubyGems.org.
require 'nyaplot'
true
Initialize Nyaplot to work in this IRuby Notebook:
Nyaplot.init_iruby
This is the GeoJSON that describes the shapes that have been drawn by the different contributors:
Note: this GeoJSON will not validate in GeoJSONLint because first and last points do not match
geomstr = '{"type":"FeatureCollection","features":[{"type":"Feature","properties":{"user_id":638},"geometry":{"type":"Polygon","coordinates":[[[-73.98620970547199,40.7356342514617],[-73.98627072572708,40.735547874977094],[-73.98632504045963,40.73557226364293],[-73.98622445762157,40.73570995781772],[-73.9861835539341,40.73569268254945],[-73.98621775209902,40.735640856717666]]]}},{"type":"Feature","properties":{"user_id":666},"geometry":{"type":"Polygon","coordinates":[[[-73.98620769381522,40.73563526765495],[-73.9862660318613,40.735547874977094],[-73.98632504045963,40.735570739351566],[-73.98622579872608,40.73570944972167],[-73.98618154227734,40.73569217445325],[-73.98621775209902,40.73563933242788]]]}},{"type":"Feature","properties":{"session_id":"79e7ee062a9e0333926e3e1fdc3e92db"},"geometry":{"type":"Polygon","coordinates":[[[-73.98632369935513,40.735570739351566],[-73.98622512817383,40.73570944972167],[-73.98618154227734,40.73569014206842],[-73.98621909320354,40.735640856717666],[-73.98620970547199,40.73563526765495],[-73.98627005517483,40.73554889117169]]]}},{"type":"Feature","properties":{"session_id":"3d3003b26bb6b2f3b9577924b9ed5e0e"},"geometry":{"type":"Polygon","coordinates":[[[-73.98621842265129,40.7356423810074],[-73.98620903491974,40.73563577575159],[-73.98627139627934,40.735547874977094],[-73.98632436990738,40.735571755545806],[-73.98622579872608,40.73570995781772],[-73.98618087172508,40.735689633972214]]]}},{"type":"Feature","properties":{"user_id":596},"geometry":{"type":"Polygon","coordinates":[[[-73.98626938462257,40.73554889117167],[-73.98632369935513,40.735572771740024],[-73.98622445762157,40.73570894162559],[-73.98618154227734,40.73569065016463],[-73.98621775209902,40.735640856717666],[-73.98620836436749,40.735634251461676]]]}},{"type":"Feature","properties":{"session_id":"0afaf74383ce51aceba02fc49ce5a9e3"},"geometry":{"type":"Polygon","coordinates":[[[-73.98621775209902,40.73563984052446],[-73.98620836436749,40.73563272717173],[-73.98626938462257,40.735550415463514],[-73.98632235825062,40.73557124744871],[-73.98622360456956,40.73570641325812],[-73.98618768252459,40.73568957578454]]]}},{"type":"Feature","properties":{"user_id":538},"geometry":{"type":"Polygon","coordinates":[[[-73.98632571101189,40.735571755545806],[-73.98622378706932,40.73570995781772],[-73.98618288338184,40.73569268254945],[-73.98621775209902,40.73564034862108],[-73.9862110465765,40.7356362838482],[-73.98627005517483,40.735550923560815]]]}},{"type":"Feature","properties":{"user_id":580},"geometry":{"type":"Polygon","coordinates":[[[-73.98632436990738,40.73557124744871],[-73.98626066744328,40.7356581319994],[-73.98625999689102,40.7356581319994],[-73.98620903491974,40.735634759558316],[-73.98626804351805,40.735547874977094]]]}},{"type":"Feature","properties":{"user_id":580},"geometry":{"type":"Polygon","coordinates":[[[-73.98626133799553,40.7356581319994],[-73.98622579872608,40.73570944972167],[-73.98618154227734,40.73569166635704],[-73.98621842265129,40.73563984052446]]]}},{"type":"Feature","properties":{"user_id":548},"geometry":{"type":"Polygon","coordinates":[[[-73.98620970547199,40.73563475955834],[-73.98627005517483,40.73554990736624],[-73.98632369935513,40.735571755545806],[-73.98622360456956,40.73570641325812],[-73.9861848950386,40.735689633972214],[-73.98621842265129,40.735640856717666]]]}},{"type":"Feature","properties":{"session_id":"53056025663f6d6564a39975971cb87c"},"geometry":{"type":"Polygon","coordinates":[[[-73.98621909320354,40.735638316234656],[-73.98620836436749,40.7356362838482],[-73.98620769381522,40.73563577575159],[-73.98627005517483,40.73554939926897],[-73.98632302880287,40.73557023125444],[-73.98622360456956,40.73570641325812],[-73.98617953062057,40.735689633972214]]]}}]}'
"{\"type\":\"FeatureCollection\",\"features\":[{\"type\":\"Feature\",\"properties\":{\"user_id\":638},\"geometry\":{\"type\":\"Polygon\",\"coordinates\":[[[-73.98620970547199,40.7356342514617],[-73.98627072572708,40.735547874977094],[-73.98632504045963,40.73557226364293],[-73.98622445762157,40.73570995781772],[-73.9861835539341,40.73569268254945],[-73.98621775209902,40.735640856717666]]]}},{\"type\":\"Feature\",\"properties\":{\"user_id\":666},\"geometry\":{\"type\":\"Polygon\",\"coordinates\":[[[-73.98620769381522,40.73563526765495],[-73.9862660318613,40.735547874977094],[-73.98632504045963,40.735570739351566],[-73.98622579872608,40.73570944972167],[-73.98618154227734,40.73569217445325],[-73.98621775209902,40.73563933242788]]]}},{\"type\":\"Feature\",\"properties\":{\"session_id\":\"79e7ee062a9e0333926e3e1fdc3e92db\"},\"geometry\":{\"type\":\"Polygon\",\"coordinates\":[[[-73.98632369935513,40.735570739351566],[-73.98622512817383,40.73570944972167],[-73.98618154227734,40.73569014206842],[-73.98621909320354,40.735640856717666],[-73.98620970547199,40.73563526765495],[-73.98627005517483,40.73554889117169]]]}},{\"type\":\"Feature\",\"properties\":{\"session_id\":\"3d3003b26bb6b2f3b9577924b9ed5e0e\"},\"geometry\":{\"type\":\"Polygon\",\"coordinates\":[[[-73.98621842265129,40.7356423810074],[-73.98620903491974,40.73563577575159],[-73.98627139627934,40.735547874977094],[-73.98632436990738,40.735571755545806],[-73.98622579872608,40.73570995781772],[-73.98618087172508,40.735689633972214]]]}},{\"type\":\"Feature\",\"properties\":{\"user_id\":596},\"geometry\":{\"type\":\"Polygon\",\"coordinates\":[[[-73.98626938462257,40.73554889117167],[-73.98632369935513,40.735572771740024],[-73.98622445762157,40.73570894162559],[-73.98618154227734,40.73569065016463],[-73.98621775209902,40.735640856717666],[-73.98620836436749,40.735634251461676]]]}},{\"type\":\"Feature\",\"properties\":{\"session_id\":\"0afaf74383ce51aceba02fc49ce5a9e3\"},\"geometry\":{\"type\":\"Polygon\",\"coordinates\":[[[-73.98621775209902,40.73563984052446],[-73.98620836436749,40.73563272717173],[-73.98626938462257,40.735550415463514],[-73.98632235825062,40.73557124744871],[-73.98622360456956,40.73570641325812],[-73.98618768252459,40.73568957578454]]]}},{\"type\":\"Feature\",\"properties\":{\"user_id\":538},\"geometry\":{\"type\":\"Polygon\",\"coordinates\":[[[-73.98632571101189,40.735571755545806],[-73.98622378706932,40.73570995781772],[-73.98618288338184,40.73569268254945],[-73.98621775209902,40.73564034862108],[-73.9862110465765,40.7356362838482],[-73.98627005517483,40.735550923560815]]]}},{\"type\":\"Feature\",\"properties\":{\"user_id\":580},\"geometry\":{\"type\":\"Polygon\",\"coordinates\":[[[-73.98632436990738,40.73557124744871],[-73.98626066744328,40.7356581319994],[-73.98625999689102,40.7356581319994],[-73.98620903491974,40.735634759558316],[-73.98626804351805,40.735547874977094]]]}},{\"type\":\"Feature\",\"properties\":{\"user_id\":580},\"geometry\":{\"type\":\"Polygon\",\"coordinates\":[[[-73.98626133799553,40.7356581319994],[-73.98622579872608,40.73570944972167],[-73.98618154227734,40.73569166635704],[-73.98621842265129,40.73563984052446]]]}},{\"type\":\"Feature\",\"properties\":{\"user_id\":548},\"geometry\":{\"type\":\"Polygon\",\"coordinates\":[[[-73.98620970547199,40.73563475955834],[-73.98627005517483,40.73554990736624],[-73.98632369935513,40.735571755545806],[-73.98622360456956,40.73570641325812],[-73.9861848950386,40.735689633972214],[-73.98621842265129,40.735640856717666]]]}},{\"type\":\"Feature\",\"properties\":{\"session_id\":\"53056025663f6d6564a39975971cb87c\"},\"geometry\":{\"type\":\"Polygon\",\"coordinates\":[[[-73.98621909320354,40.735638316234656],[-73.98620836436749,40.7356362838482],[-73.98620769381522,40.73563577575159],[-73.98627005517483,40.73554939926897],[-73.98632302880287,40.73557023125444],[-73.98622360456956,40.73570641325812],[-73.98617953062057,40.735689633972214]]]}}]}"
We decode the GeoJSON into a RGeo::GeoJSON
structure (see the RGeo::GeoJSON docs):
geocollection = RGeo::GeoJSON.decode(geomstr, :json_parser => :json)
#<RGeo::GeoJSON::FeatureCollection:0x80d363fc>
We wrap this in a function for convenience:
def parse_geojson(json)
RGeo::GeoJSON.decode(json, :json_parser => :json)
end
:parse_geojson
This structure is now a group of features, each with an RGeo::Geos::CAPIPolygonImpl geometry describing each polygon, among other properties (see the RGeo docs):
geocollection.first.geometry
#<RGeo::Geos::CAPIPolygonImpl:0x80d3be74 "POLYGON ((-73.98620970547199 40.7356342514617, -73.98627072572708 40.735547874977094, -73.98632504045963 40.73557226364293, -73.98622445762157 40.73570995781772, -73.9861835539341 40.73569268254945, -73.98621775209902 40.735640856717666, -73.98620970547199 40.7356342514617))">
The main logic behind this process is as follows:
[1] different polygons might also have similar centroids but we're skipping this for now :)
Since DBSCAN works with number arrays, we need to convert the complex RGeo structures. Below a simple centroid-extraction function:
def get_centroid(poly_feature)
return if (poly_feature.geometry.geometry_type.type_name != "Polygon")
c = poly_feature.geometry.centroid
return [c.x, c.y]
end
:get_centroid
Let's test it with the first polygon in the collection:
centroid = get_centroid(geocollection.first)
[-73.98625268168838, 40.73562601945317]
Now we need a convenience function to get all the centroids of the collection. We will make it a hash because we later need to be able to go back to this list to extract its corresponding set of polygons and a hash was the way I found most convenient:
def get_all_centroids(geom)
centroids = {}
geom.each_with_index do |poly,index|
next if (poly.geometry.geometry_type.type_name != "Polygon")
centroids[index] = get_centroid(poly)
end
return centroids
end
:get_all_centroids
Test again:
centroids = get_all_centroids(geocollection)
{0=>[-73.98625268168838, 40.73562601945317], 1=>[-73.98625173238652, 40.735625569382876], 2=>[-73.9862518966646, 40.73562642272427], 3=>[-73.986252242017, 40.735626656082445], 4=>[-73.98625152460835, 40.735626229414], 5=>[-73.98625207318744, 40.73562418649854], 6=>[-73.98625258509149, 40.7356272053874], 7=>[-73.98626592099406, 40.735602617283476], 8=>[-73.9862216645921, 40.73567482334759], 9=>[-73.98625254867669, 40.735624721075084], 10=>[-73.98625077341322, 40.73562552211442]}
A simple plot of all the centroids using Nyaplot:
plot = Nyaplot::Plot.new
plot.width(400)
plot.height(400)
plot.zoom(true)
plot.rotate_x_label(-60)
points_x = centroids.map { |p| p[1][0] }
points_y = centroids.map { |p| p[1][1] }
df = Nyaplot::DataFrame.new({x:points_x,y:points_y})
# add some padding
xmin = points_x.min - 1e-5
xmax = points_x.max + 1e-5
ymin = points_y.min - 1e-5
ymax = points_y.max + 1e-5
plot.xrange([xmin,xmax])
plot.yrange([ymin,ymax])
# end padding
sc = plot.add_with_df(df, :scatter, :x, :y)
plot.show
dists = []
done = {}
centroids.each_with_index do |cc1,i|
centroids.each_with_index do |cc2,j|
c1 = cc1[1]
c2 = cc2[1]
dists.push({:dist=>Math.hypot(c1[0]-c2[0],c1[1]-c2[1]),:from=>i,:to=>j,:from_centroid=>c1,:to_centroid=>c2}) if (c1 != c2 && !done[[c2,c1]])
done[[c1,c2]] = true
end
end
dists = dists.sort_by!{|k| k[:dist]}
dist_df = Nyaplot::DataFrame.new(dists)
dist | from | to | from_centroid | to_centroid |
---|---|---|---|---|
4.1680249477628687e-07 | 2 | 3 | [-73.9862518966646, 40.73562642272427] | [-73.986252242017, 40.735626656082445] |
4.1927880127312373e-07 | 2 | 4 | [-73.9862518966646, 40.73562642272427] | [-73.98625152460835, 40.735626229414] |
6.476388201422145e-07 | 3 | 6 | [-73.986252242017, 40.735626656082445] | [-73.98625258509149, 40.7356272053874] |
6.919630457708901e-07 | 1 | 4 | [-73.98625173238652, 40.735625569382876] | [-73.98625152460835, 40.735626229414] |
7.154453870992346e-07 | 5 | 9 | [-73.98625207318744, 40.73562418649854] | [-73.98625254867669, 40.735624721075084] |
7.736974578659688e-07 | 0 | 3 | [-73.98625268168838, 40.73562601945317] | [-73.986252242017, 40.735626656082445] |
8.346982305084655e-07 | 3 | 4 | [-73.986252242017, 40.735626656082445] | [-73.98625152460835, 40.735626229414] |
8.690102573992017e-07 | 1 | 2 | [-73.98625173238652, 40.735625569382876] | [-73.9862518966646, 40.73562642272427] |
8.825474076951457e-07 | 0 | 2 | [-73.98625268168838, 40.73562601945317] | [-73.9862518966646, 40.73562642272427] |
9.601375433120375e-07 | 1 | 10 | [-73.98625173238652, 40.735625569382876] | [-73.98625077341322, 40.73562552211442] |
1.0317784775226883e-06 | 4 | 10 | [-73.98625152460835, 40.735626229414] | [-73.98625077341322, 40.73562552211442] |
1.0423498270990819e-06 | 2 | 6 | [-73.9862518966646, 40.73562642272427] | [-73.98625258509149, 40.7356272053874] |
1.0505890250970463e-06 | 0 | 1 | [-73.98625268168838, 40.73562601945317] | [-73.98625173238652, 40.735625569382876] |
1.1759752372883875e-06 | 0 | 4 | [-73.98625268168838, 40.73562601945317] | [-73.98625152460835, 40.735626229414] |
1.1772662180684655e-06 | 1 | 9 | [-73.98625173238652, 40.735625569382876] | [-73.98625254867669, 40.735624721075084] |
1.1898617396243328e-06 | 0 | 6 | [-73.98625268168838, 40.73562601945317] | [-73.98625258509149, 40.7356272053874] |
... | ... | ... | ... | ... |
8.468969718424401e-05 | 7 | 8 | [-73.98626592099406, 40.735602617283476] | [-73.9862216645921, 40.73567482334759] |
We can see here how the centroids reflect the three different basic shapes drawn by contributors above: the lone centroids for the upper-right and lower-left rectangles and the group of nine centroids for the L-shaped polygons in the "center".
The problem now is finding a good minimum distance between centroids:
Let's create a table to see just how close/far these centroids are from each other (standard euclidean distance: $\sqrt{((\Delta x)^2+(\Delta y)^2)}$). Notice that, since geographic metric units have a lot of significant digits (numbers to the right of the decimal point), we are dealing with distances smaller than $10^{-6}$:
From the table (which is sorted by closest points first) we can see that the top 9 results are under $10^{-7}$ units away from each other (0.000001).
To understand how clusters are formed, it is useful to understand how the DBSCAN clustering algorithm works:
DBSCAN requires two parameters: ε (eps) and the minimum number of points (min_points) required to form a dense region. It starts with an arbitrary starting point that has not been visited. This point's ε-neighborhood is retrieved, and if it contains sufficiently many points, a cluster is started. Otherwise, the point is labeled as noise. Note that this point might later be found in a sufficiently sized ε-environment of a different point and hence be made part of a cluster.
If a point is found to be a dense part of a cluster, its ε-neighborhood is also part of that cluster. Hence, all points that are found within the ε-neighborhood are added, as is their own ε-neighborhood when they are also dense. This process continues until the density-connected cluster is completely found. Then, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise.
By playing around with different sets of polygons I came to a general ε of $1.8(10^{-6})$ and a min_points
of 2 for centroid clusters (polygon vertex clusters have different input values as we will see below).
This is the resulting centroid-clustering function:
def cluster_centroids(centroids, epsilon=1.8e-06, min_points=2)
dbscan = DBSCAN( centroids.map{|c| c[1]}, :epsilon => epsilon, :min_points => min_points, :distance => :euclidean_distance )
return dbscan.results.select{|k,v| k != -1} # omit the non-cluster
end
:cluster_centroids
Let's test it:
centroid_clusters = cluster_centroids(centroids)
{0=>[[-73.98625268168838, 40.73562601945317], [-73.98625173238652, 40.735625569382876], [-73.9862518966646, 40.73562642272427], [-73.986252242017, 40.735626656082445], [-73.98625152460835, 40.735626229414], [-73.98625258509149, 40.7356272053874], [-73.98625254867669, 40.735624721075084], [-73.98625207318744, 40.73562418649854], [-73.98625077341322, 40.73562552211442]]}
The function returns a hash with whose [-1]
key (if any) contains all the points that did not belong to a cluster and [0..n]
contain the different clusters. In this example there is only one cluster, centroid_clusters[0]
and the rejected [-1]
non-cluster.
Let's define a cluster plotting function and plot this (notice the "disappearance" of the two outliers that are being ignored by the function):
def plot_clusters(clusters)
plot = Nyaplot::Plot.new
plot.width(300)
plot.height(400)
plot.zoom(true)
plot.rotate_x_label(-60)
pts = clusters.map{|c| c[1]}.flatten(1)
# add some padding
xmin = pts.map {|p| p[0]}.min - 1e-5
xmax = pts.map {|p| p[0]}.max + 1e-5
ymin = pts.map {|p| p[1]}.min - 1e-5
ymax = pts.map {|p| p[1]}.max + 1e-5
plot.xrange([xmin,xmax])
plot.yrange([ymin,ymax])
# now plot
clusters.each do |cluster|
if cluster[0] != -1 # ignore cluster -1 because not enough points
cluster_x = cluster[1].map { |c| c[0] }
cluster_y = cluster[1].map { |c| c[1] }
names = cluster[1].map { |c| cluster[0] }
df = Nyaplot::DataFrame.new({x:cluster_x,y:cluster_y,cluster:names})
sc = plot.add_with_df(df, :scatter, :x, :y)
sc.tooltip_contents([:cluster])
color = "#"+ "%06x" % (rand * 0xffffff)
sc.color(color)
end
end
plot.show
return plot
end
:plot_clusters
plot = plot_clusters(centroid_clusters)
plot.show
Now we need to:
Below a function that retrieves the polygons for a given centroid cluster based on the structures we have built so far:
# given a list of centroids (lon,lat), find their poly's index in the centroid list (index => lon,lat)
def get_polys_for_centroid_cluster(cluster, centroids, original_polys)
polys = []
cluster.each do |cl|
index = centroids.select {|k,v| v == cl}.keys.first
polys.push(original_polys[index]) if index != -1
end
return polys
end
:get_polys_for_centroid_cluster
Applying this to the only cluster that has useful centroids:
cluster_polygons = get_polys_for_centroid_cluster(centroid_clusters[0], centroids, geocollection)
[#<RGeo::GeoJSON::Feature:0x80d3bde8 id=nil geom="POLYGON ((-73.98620970547199 40.7356342514617, -73.98627072572708 40.735547874977094, -73.98632504045963 40.73557226364293, -73.98622445762157 40.73570995781772, -73.9861835539341 40.73569268254945, -73.98621775209902 40.735640856717666, -73.98620970547199 40.7356342514617))">, #<RGeo::GeoJSON::Feature:0x80d3b550 id=nil geom="POLYGON ((-73.98620769381522 40.73563526765495, -73.9862660318613 40.735547874977094, -73.98632504045963 40.735570739351566, -73.98622579872608 40.73570944972167, -73.98618154227734 40.73569217445325, -73.98621775209902 40.73563933242788, -73.98620769381522 40.73563526765495))">, #<RGeo::GeoJSON::Feature:0x80d3afc4 id=nil geom="POLYGON ((-73.98632369935513 40.735570739351566, -73.98622512817383 40.73570944972167, -73.98618154227734 40.73569014206842, -73.98621909320354 40.735640856717666, -73.98620970547199 40.73563526765495, -73.98627005517483 40.73554889117169, -73.98632369935513 40.735570739351566))">, #<RGeo::GeoJSON::Feature:0x80d3aa9c id=nil geom="POLYGON ((-73.98621842265129 40.7356423810074, -73.98620903491974 40.73563577575159, -73.98627139627934 40.735547874977094, -73.98632436990738 40.735571755545806, -73.98622579872608 40.73570995781772, -73.98618087172508 40.735689633972214, -73.98621842265129 40.7356423810074))">, #<RGeo::GeoJSON::Feature:0x80d3a59c id=nil geom="POLYGON ((-73.98626938462257 40.73554889117167, -73.98632369935513 40.735572771740024, -73.98622445762157 40.73570894162559, -73.98618154227734 40.73569065016463, -73.98621775209902 40.735640856717666, -73.98620836436749 40.735634251461676, -73.98626938462257 40.73554889117167))">, #<RGeo::GeoJSON::Feature:0x80d37dc4 id=nil geom="POLYGON ((-73.98632571101189 40.735571755545806, -73.98622378706932 40.73570995781772, -73.98618288338184 40.73569268254945, -73.98621775209902 40.73564034862108, -73.9862110465765 40.7356362838482, -73.98627005517483 40.735550923560815, -73.98632571101189 40.735571755545806))">, #<RGeo::GeoJSON::Feature:0x80d36e4c id=nil geom="POLYGON ((-73.98620970547199 40.73563475955834, -73.98627005517483 40.73554990736624, -73.98632369935513 40.735571755545806, -73.98622360456956 40.73570641325812, -73.9861848950386 40.735689633972214, -73.98621842265129 40.735640856717666, -73.98620970547199 40.73563475955834))">, #<RGeo::GeoJSON::Feature:0x80d3a0b0 id=nil geom="POLYGON ((-73.98621775209902 40.73563984052446, -73.98620836436749 40.73563272717173, -73.98626938462257 40.735550415463514, -73.98632235825062 40.73557124744871, -73.98622360456956 40.73570641325812, -73.98618768252459 40.73568957578454, -73.98621775209902 40.73563984052446))">, #<RGeo::GeoJSON::Feature:0x80d3644c id=nil geom="POLYGON ((-73.98621909320354 40.735638316234656, -73.98620836436749 40.7356362838482, -73.98620769381522 40.73563577575159, -73.98627005517483 40.73554939926897, -73.98632302880287 40.73557023125444, -73.98622360456956 40.73570641325812, -73.98617953062057 40.735689633972214, -73.98621909320354 40.735638316234656))">]
We need a method to extract the vertices from each polygon (in a DBSCAN-compatible format):
def get_points(poly_feature)
geom = poly_feature.geometry
return false if (geom.geometry_type.type_name != "Polygon")
pts = []
points = geom.exterior_ring.points
points.each do |point|
pts.push([point.x,point.y])
end
return pts
end
:get_points
Now let's plot what we have so far (vertices from the same polygon are the same color):
def plot_polys(polys)
plot = Nyaplot::Plot.new
plot.width(500)
plot.height(500)
plot.zoom(true)
plot.rotate_x_label(-60)
polys.each do |poly|
plot_poly(poly, plot)
end
plot.show
end
def plot_poly(poly, plot = nil)
showplot = false
if plot == nil
showplot = true
plot = Nyaplot::Plot.new
plot.width(500)
plot.height(500)
plot.zoom(true)
plot.rotate_x_label(-60)
end
points = get_points(poly)
points_x = points.map { |p| p[0] }
points_y = points.map { |p| p[1] }
df = Nyaplot::DataFrame.new({x:points_x,y:points_y})
sc = plot.add_with_df(df, :scatter, :x, :y)
color = "#"+ "%06x" % (rand * 0xffffff)
sc.color(color)
plot.show if showplot
end
:plot_poly
plot_polys(cluster_polygons)
Let's cluster these points. Below is a function that extracts the points from a list of polygons:
def get_all_poly_points(polys)
points = []
polys.each do |poly|
points.push(get_points(poly))
end
return points
end
:get_all_poly_points
cluster_poly_points = get_all_poly_points(cluster_polygons)
[[[-73.98620970547199, 40.7356342514617], [-73.98627072572708, 40.735547874977094], [-73.98632504045963, 40.73557226364293], [-73.98622445762157, 40.73570995781772], [-73.9861835539341, 40.73569268254945], [-73.98621775209902, 40.735640856717666], [-73.98620970547199, 40.7356342514617]], [[-73.98620769381522, 40.73563526765495], [-73.9862660318613, 40.735547874977094], [-73.98632504045963, 40.735570739351566], [-73.98622579872608, 40.73570944972167], [-73.98618154227734, 40.73569217445325], [-73.98621775209902, 40.73563933242788], [-73.98620769381522, 40.73563526765495]], [[-73.98632369935513, 40.735570739351566], [-73.98622512817383, 40.73570944972167], [-73.98618154227734, 40.73569014206842], [-73.98621909320354, 40.735640856717666], [-73.98620970547199, 40.73563526765495], [-73.98627005517483, 40.73554889117169], [-73.98632369935513, 40.735570739351566]], [[-73.98621842265129, 40.7356423810074], [-73.98620903491974, 40.73563577575159], [-73.98627139627934, 40.735547874977094], [-73.98632436990738, 40.735571755545806], [-73.98622579872608, 40.73570995781772], [-73.98618087172508, 40.735689633972214], [-73.98621842265129, 40.7356423810074]], [[-73.98626938462257, 40.73554889117167], [-73.98632369935513, 40.735572771740024], [-73.98622445762157, 40.73570894162559], [-73.98618154227734, 40.73569065016463], [-73.98621775209902, 40.735640856717666], [-73.98620836436749, 40.735634251461676], [-73.98626938462257, 40.73554889117167]], [[-73.98632571101189, 40.735571755545806], [-73.98622378706932, 40.73570995781772], [-73.98618288338184, 40.73569268254945], [-73.98621775209902, 40.73564034862108], [-73.9862110465765, 40.7356362838482], [-73.98627005517483, 40.735550923560815], [-73.98632571101189, 40.735571755545806]], [[-73.98620970547199, 40.73563475955834], [-73.98627005517483, 40.73554990736624], [-73.98632369935513, 40.735571755545806], [-73.98622360456956, 40.73570641325812], [-73.9861848950386, 40.735689633972214], [-73.98621842265129, 40.735640856717666], [-73.98620970547199, 40.73563475955834]], [[-73.98621775209902, 40.73563984052446], [-73.98620836436749, 40.73563272717173], [-73.98626938462257, 40.735550415463514], [-73.98632235825062, 40.73557124744871], [-73.98622360456956, 40.73570641325812], [-73.98618768252459, 40.73568957578454], [-73.98621775209902, 40.73563984052446]], [[-73.98621909320354, 40.735638316234656], [-73.98620836436749, 40.7356362838482], [-73.98620769381522, 40.73563577575159], [-73.98627005517483, 40.73554939926897], [-73.98632302880287, 40.73557023125444], [-73.98622360456956, 40.73570641325812], [-73.98617953062057, 40.735689633972214], [-73.98621909320354, 40.735638316234656]]]
The better ε value I found for these points is a bit more complicated. If it is too big, the L-shape will be lost: points in that corner will be clustered together. After fiddling around I found a decent value of of $6(10^{-6})$.
An important aspect to account for here is that the GeoJSON spec requires that the coordinate array has to begin and end with the same point. Therefore this point would be counted twice if we leave the array as-is. Below the resulting clustering function, corresponding test, and plot:
def cluster_points(points, epsilon=6e-06, min_points=2)
dbscan = DBSCAN( points.flatten(1), :epsilon => epsilon, :min_points => min_points, :distance => :euclidean_distance )
return dbscan.results.select{|k,v| k != -1} # omit the non-cluster
end
:cluster_points
# exclude first item in each poly since it is same as last
unique_points = cluster_poly_points.map{|poly| poly[1..-1]}
vertex_clusters = cluster_points(unique_points)
{0=>[[-73.98627072572708, 40.735547874977094], [-73.9862660318613, 40.735547874977094], [-73.98627005517483, 40.73554889117169], [-73.98627139627934, 40.735547874977094], [-73.98626938462257, 40.73554889117167], [-73.98627005517483, 40.735550923560815], [-73.98627005517483, 40.73554990736624], [-73.98626938462257, 40.735550415463514], [-73.98627005517483, 40.73554939926897]], 1=>[[-73.98632504045963, 40.73557226364293], [-73.98632504045963, 40.735570739351566], [-73.98632369935513, 40.735570739351566], [-73.98632436990738, 40.735571755545806], [-73.98632369935513, 40.735572771740024], [-73.98632571101189, 40.735571755545806], [-73.98632369935513, 40.735571755545806], [-73.98632235825062, 40.73557124744871], [-73.98632302880287, 40.73557023125444]], 2=>[[-73.98622445762157, 40.73570995781772], [-73.98622579872608, 40.73570944972167], [-73.98622512817383, 40.73570944972167], [-73.98622579872608, 40.73570995781772], [-73.98622445762157, 40.73570894162559], [-73.98622378706932, 40.73570995781772], [-73.98622360456956, 40.73570641325812], [-73.98622360456956, 40.73570641325812], [-73.98622360456956, 40.73570641325812]], 3=>[[-73.9861835539341, 40.73569268254945], [-73.98618154227734, 40.73569217445325], [-73.98618154227734, 40.73569014206842], [-73.98618087172508, 40.735689633972214], [-73.98618154227734, 40.73569065016463], [-73.98618288338184, 40.73569268254945], [-73.9861848950386, 40.735689633972214], [-73.98618768252459, 40.73568957578454], [-73.98617953062057, 40.735689633972214]], 4=>[[-73.98621775209902, 40.735640856717666], [-73.98621775209902, 40.73563933242788], [-73.98621909320354, 40.735640856717666], [-73.98621842265129, 40.7356423810074], [-73.98621775209902, 40.73564034862108], [-73.98621842265129, 40.735640856717666], [-73.98621775209902, 40.73563984052446], [-73.98621909320354, 40.735638316234656], [-73.98621775209902, 40.735640856717666]], 5=>[[-73.98620970547199, 40.7356342514617], [-73.98620769381522, 40.73563526765495], [-73.98620970547199, 40.73563526765495], [-73.98620903491974, 40.73563577575159], [-73.98620836436749, 40.735634251461676], [-73.9862110465765, 40.7356362838482], [-73.98620970547199, 40.73563475955834], [-73.98620836436749, 40.73563272717173], [-73.98620836436749, 40.7356362838482], [-73.98620769381522, 40.73563577575159]]}
plot = plot_clusters(vertex_clusters)
plot.show
Now we iterate through each vertex cluster and:
For this we need some extra functions in the Array
object to find the mean value:
class Array
def sum
inject(0.0) { |result, el| result + el }
end
def mean
sum / size
end
end
:mean
Now we need a function that receives the vertex clusters and returns the average vertex for each cluster:
def get_mean_poly(clusters)
means = {}
clusters.each do |cluster|
next if cluster[0] == -1 # ignore cluster -1
lon = cluster[1].map {|c| c[0]}.mean
lat = cluster[1].map {|c| c[1]}.mean
means[cluster[0]] = [lon,lat]
end
return means
end
:get_mean_poly
We test this function with our vertex clusters and plot both (mean vertices as yellow diamonds):
mean_poly = get_mean_poly(vertex_clusters)
{0=>[-73.9862696826458, 40.73554911699269], 1=>[-73.98632407188416, 40.73557147326963], 2=>[-73.98622447129412, 40.73570855047738], 3=>[-73.98618267156186, 40.73569075660959], 4=>[-73.98621819913387, 40.73564040507624], 5=>[-73.98620896786451, 40.735635064416286]}
# plot clusters with overlaid (yellow) mean points
plot = plot_clusters(vertex_clusters)
# add means
m_x = mean_poly.map { |m| m[1][0] }
m_y = mean_poly.map { |m| m[1][1] }
sc = plot.add(:scatter, m_x, m_y)
color = "#ffff00"
sc.color(color)
sc.shape('diamond')
plot.show
So far we have a set of points that seem to be the most likely vertices of the mean polygon drawn by our contributors. However, there are many ways in which these points could be connected to each other.
DISCLAIMER:
What follows is a very primitive process that I used to determine the most likely connection between those points. This process is the best I could come up with given my limited math knowledge and time. If you have a better idea of how to do this in Ruby please tweet me at @mgiraldo.
/DISCLAIMER
Before going through with connections we need to validate that we have a reasonable amount of clusters to work with: some vertices may be drawn far away enough for them to not cluster properly and therefore no cluster will be produced. We do this by determining the mean vertices in each polygon ($\bar{m}$) and comparing it with the cluster count ($\sum c$). Right now: $\bar{m}\leq\sum c$ , so we should have at least as many clusters as we have average points per polygon.
Not perfect but works most of the time:
def validate_clusters(clusters, unique_points)
average = (unique_points.flatten.count.to_f / (unique_points.size * 2).to_f).round
return clusters.select{|k,v| k!=-1}.size >= average
end
:validate_clusters
validate_clusters(vertex_clusters, unique_points)
true
Now that this has been verified we proceed to connect.
The general process to connect mean vertices to each other is:
Below all the corresponding functions:
def find_connected_point(point, original_points)
original_points.each do |poly|
poly.each_with_index do |p,index|
return poly[index+1] if point === p
end
end
return
end
:find_connected_point
def find_cluster_for_point(point, clusters)
clusters.each do |cluster|
cluster[1].each do |p|
return cluster[0] if point === p && cluster[0] != -1
end
end
return
end
:find_cluster_for_point
def connect_clusters(clusters, original_points)
connections = {}
# for each cluster
clusters.each do |cluster|
# for each point in cluster
if cluster[0] != -1 # exclude invalid cluster
cluster_votes = {} # to weigh connection popularity (diff pts might be connected to diff clusters)
cluster[1].each do |point|
# find original point connected to it
connection = find_connected_point(point, original_points)
connected_cluster = find_cluster_for_point(connection, clusters)
# if original point belongs to another cluster
if connected_cluster != nil && connected_cluster != cluster[0]
# vote for the cluster
cluster_votes[connected_cluster] = 0 if cluster_votes[connected_cluster] == nil
cluster_votes[connected_cluster] += 1
end
end
connections[cluster[0]] = cluster_votes.sort_by{|k, v| v}
next if connections[cluster[0]].size == 0
connections[cluster[0]] = connections[cluster[0]].reverse[0][0]
end
end
return connections
end
:connect_clusters
connections = connect_clusters(vertex_clusters, cluster_poly_points)
{0=>1, 1=>2, 2=>3, 3=>4, 4=>5, 5=>0}
As can be seen above this is a directed cycle graph and the end result is a clean path from the first vertex to the last one.
The fact that the points are sorted (0 to 1, 1 to 2, 2 to 3, and so on) is somewhat coincidential. Below is a basic function that checks the graph and returns a sorted list of clusters (the order we need to follow to draw the mean polygon):
def sort_connections(connections)
# does some simple check for non-circularity
sorted = []
seen = {}
as_list = connections.select{|k,v| k}
done = false
first = as_list.first[0]
from = first
while !done do
to = connections[from]
done = true if seen[to] || to == nil || to.size == 0
seen[to] = true
from = to
sorted.push(to)
done = true if seen.size == connections.size
end
return nil if seen.size != connections.size
return sorted
end
:sort_connections
# testing sort function
sort_connections(connections)
[1, 2, 3, 4, 5, 0]
Now we can proceed to build our final mean polygon:
def connect_mean_poly(mean_poly, connections)
connected = []
sorted = sort_connections(connections)
return nil if sorted == nil
sorted.each do |c|
connected.push([mean_poly[c][0], mean_poly[c][1]])
end
# for GeoJSON, last == first
first = sorted[0]
connected.push([mean_poly[first][0], mean_poly[first][1]])
return connected
end
:connect_mean_poly
final_polygon = connect_mean_poly(mean_poly, connections)
[[-73.98632407188416, 40.73557147326963], [-73.98622447129412, 40.73570855047738], [-73.98618267156186, 40.73569075660959], [-73.98621819913387, 40.73564040507624], [-73.98620896786451, 40.735635064416286], [-73.9862696826458, 40.73554911699269], [-73.98632407188416, 40.73557147326963]]
Let's see how all this looks like:
plot = plot_clusters(vertex_clusters)
m_x = final_polygon.map { |m| m[0] }
m_y = final_polygon.map { |m| m[1] }
sc = plot.add(:scatter, m_x, m_y)
color = "#ffff00"
sc.color(color)
sc.shape('diamond')
# add the MEAN POLYGON
final_polygon.each_with_index do |c, i|
next if i >= final_polygon.size-1
from = [ final_polygon[i][0], final_polygon[i+1][0] ]
to = [ final_polygon[i][1], final_polygon[i+1][1] ]
plot.add(:line, from, to)
end
plot.show
To wrap it all up we create a single consensus function that receives a GeoJSON string and returns a list of mean polygons:
def calculate_polygonfix_consensus(geojson)
output = []
geom = parse_geojson(geojson)
centroids = get_all_centroids(geom)
centroid_clusters = cluster_centroids(centroids)
centroid_clusters.each do |ccluster|
next if ccluster[0] == -1
cluster = ccluster[1] # only the set of latlons
sub_geom = get_polys_for_centroid_cluster(cluster, centroids, geom)
next if sub_geom.size == 0
original_points = get_all_poly_points(sub_geom)
next if original_points == nil
unique_points = original_points.map{|poly| poly[1..-1]}
vertex_clusters = cluster_points(unique_points)
next if !validate_clusters(vertex_clusters, unique_points)
mean_poly = get_mean_poly(vertex_clusters)
next if mean_poly == {}
connections = connect_clusters(vertex_clusters, original_points)
next if connections == nil || connections == {}
poly = connect_mean_poly(mean_poly, connections)
next if poly == nil || poly.count == 0
output.push(poly)
end
return output
end
:calculate_polygonfix_consensus
consensus = calculate_polygonfix_consensus(geomstr)
[[[-73.98632407188416, 40.73557147326963], [-73.98622447129412, 40.73570855047738], [-73.98618267156186, 40.73569075660959], [-73.98621819913387, 40.73564040507624], [-73.98620896786451, 40.735635064416286], [-73.9862696826458, 40.73554911699269], [-73.98632407188416, 40.73557147326963]]]
The GeoJSON of all this might look something like:
geo_json = {:type => "FeatureCollection", :features => consensus.map { |f| {:type => "Feature", :properties => { :id => 1 }, :geometry => { :type => "Polygon", :coordinates =>[f] } } } }.to_json
puts geo_json
{"type":"FeatureCollection","features":[{"type":"Feature","properties":{"id":1},"geometry":{"type":"Polygon","coordinates":[[[-73.98632407188416,40.73557147326963],[-73.98622447129412,40.73570855047738],[-73.98618267156186,40.73569075660959],[-73.98621819913387,40.73564040507624],[-73.98620896786451,40.735635064416286],[-73.9862696826458,40.73554911699269],[-73.98632407188416,40.73557147326963]]]}}]}
Now let's plots the resulting GeoJSON on the original map (purple):
IRuby.html '<iframe src="http://jsfiddle.net/mgiraldo/m4XeU/1/embedded/result/" width=500 height=400></iframe>'
Voilà! The mean polygon looks good!
This is a first step towards finding geometric consensus from a list of user contributions to a given starting geometry and a map. It is a work in progress and hopefully other ideas can be added to improve this algorithm.
This code is part of NYPL Labs' Building Inspector. Explore and fork the GitHub repository.
This notebook was created by Mauricio Giraldo Arteaga.