Finding shape consensus among multiple geo polygons

One of the tasks in the Building Inspector is fixing building footprints. The user is presented a map with an overlaid shape (red dots). The purpose is to draw the correct shape (or shapes, since the red overlay may cover multiple building footprints).

Multiple people receive the same map and overlay. This notebook describes a process to find the resulting consensus (or mean) shape.

Below is an example showing the map, the original polygon shown to each user (red dots) and the resulting polygons drawn (yellow).

In [1]:
IRuby.html '<iframe src="http://jsfiddle.net/mgiraldo/pdkCb/3/embedded/result/" width=500 height=400></iframe>'
Out[1]:

It is hard to see but there are 11 yellow polygons: one rectangle in the lower left part, one for the upper right part (both wrong), and 9 for the complete L-shaped building.

Requirements

The process to find the geometry that best summarizes what is drawn by users has to take into account:

  1. an overlay may span multiple polygons (red dots covering more than one building)
  2. polygons may have any number of vertices greater or equal to three
  3. users will not always draw the polygons the same way (eg: use more or fewer points)

The process described in this notebook makes use of the DBSCAN clustering algorithm to find an unknown amount of dense regions of points and determine the resulting geometries from there. The input to this process will be a GeoJSON FeatureCollection containing all the polygons drawn by contributors that are associated to a given red overlay. the expected output is a list of geo point arrays with the summary shapes determined by the algorithm.

All the necessary code is included and should be executable by any machine that has the required Ruby gems installed. This code was tested on Ruby 2.1.0.

Process

First, we need the RGeo package along with its GeoJSON component:

In [2]:
require 'rgeo'
require 'rgeo-geojson'
Out[2]:
true
In [3]:
require 'dbscan'
Out[3]:
true

For visualization convenience in this notebook we will also use the awesome Nyaplot, a D3-powered visualization library. I had to manually build it according to the instructions since it is not yet in RubyGems.org.

In [4]:
require 'nyaplot'
Out[4]:
true

Initialize Nyaplot to work in this IRuby Notebook:

In [5]:
Nyaplot.init_iruby
Out[5]:

This is the GeoJSON that describes the shapes that have been drawn by the different contributors:

Note: this GeoJSON will not validate in GeoJSONLint because first and last points do not match

In [6]:
geomstr = '{"type":"FeatureCollection","features":[{"type":"Feature","properties":{"user_id":638},"geometry":{"type":"Polygon","coordinates":[[[-73.98620970547199,40.7356342514617],[-73.98627072572708,40.735547874977094],[-73.98632504045963,40.73557226364293],[-73.98622445762157,40.73570995781772],[-73.9861835539341,40.73569268254945],[-73.98621775209902,40.735640856717666]]]}},{"type":"Feature","properties":{"user_id":666},"geometry":{"type":"Polygon","coordinates":[[[-73.98620769381522,40.73563526765495],[-73.9862660318613,40.735547874977094],[-73.98632504045963,40.735570739351566],[-73.98622579872608,40.73570944972167],[-73.98618154227734,40.73569217445325],[-73.98621775209902,40.73563933242788]]]}},{"type":"Feature","properties":{"session_id":"79e7ee062a9e0333926e3e1fdc3e92db"},"geometry":{"type":"Polygon","coordinates":[[[-73.98632369935513,40.735570739351566],[-73.98622512817383,40.73570944972167],[-73.98618154227734,40.73569014206842],[-73.98621909320354,40.735640856717666],[-73.98620970547199,40.73563526765495],[-73.98627005517483,40.73554889117169]]]}},{"type":"Feature","properties":{"session_id":"3d3003b26bb6b2f3b9577924b9ed5e0e"},"geometry":{"type":"Polygon","coordinates":[[[-73.98621842265129,40.7356423810074],[-73.98620903491974,40.73563577575159],[-73.98627139627934,40.735547874977094],[-73.98632436990738,40.735571755545806],[-73.98622579872608,40.73570995781772],[-73.98618087172508,40.735689633972214]]]}},{"type":"Feature","properties":{"user_id":596},"geometry":{"type":"Polygon","coordinates":[[[-73.98626938462257,40.73554889117167],[-73.98632369935513,40.735572771740024],[-73.98622445762157,40.73570894162559],[-73.98618154227734,40.73569065016463],[-73.98621775209902,40.735640856717666],[-73.98620836436749,40.735634251461676]]]}},{"type":"Feature","properties":{"session_id":"0afaf74383ce51aceba02fc49ce5a9e3"},"geometry":{"type":"Polygon","coordinates":[[[-73.98621775209902,40.73563984052446],[-73.98620836436749,40.73563272717173],[-73.98626938462257,40.735550415463514],[-73.98632235825062,40.73557124744871],[-73.98622360456956,40.73570641325812],[-73.98618768252459,40.73568957578454]]]}},{"type":"Feature","properties":{"user_id":538},"geometry":{"type":"Polygon","coordinates":[[[-73.98632571101189,40.735571755545806],[-73.98622378706932,40.73570995781772],[-73.98618288338184,40.73569268254945],[-73.98621775209902,40.73564034862108],[-73.9862110465765,40.7356362838482],[-73.98627005517483,40.735550923560815]]]}},{"type":"Feature","properties":{"user_id":580},"geometry":{"type":"Polygon","coordinates":[[[-73.98632436990738,40.73557124744871],[-73.98626066744328,40.7356581319994],[-73.98625999689102,40.7356581319994],[-73.98620903491974,40.735634759558316],[-73.98626804351805,40.735547874977094]]]}},{"type":"Feature","properties":{"user_id":580},"geometry":{"type":"Polygon","coordinates":[[[-73.98626133799553,40.7356581319994],[-73.98622579872608,40.73570944972167],[-73.98618154227734,40.73569166635704],[-73.98621842265129,40.73563984052446]]]}},{"type":"Feature","properties":{"user_id":548},"geometry":{"type":"Polygon","coordinates":[[[-73.98620970547199,40.73563475955834],[-73.98627005517483,40.73554990736624],[-73.98632369935513,40.735571755545806],[-73.98622360456956,40.73570641325812],[-73.9861848950386,40.735689633972214],[-73.98621842265129,40.735640856717666]]]}},{"type":"Feature","properties":{"session_id":"53056025663f6d6564a39975971cb87c"},"geometry":{"type":"Polygon","coordinates":[[[-73.98621909320354,40.735638316234656],[-73.98620836436749,40.7356362838482],[-73.98620769381522,40.73563577575159],[-73.98627005517483,40.73554939926897],[-73.98632302880287,40.73557023125444],[-73.98622360456956,40.73570641325812],[-73.98617953062057,40.735689633972214]]]}}]}'
Out[6]:
"{\"type\":\"FeatureCollection\",\"features\":[{\"type\":\"Feature\",\"properties\":{\"user_id\":638},\"geometry\":{\"type\":\"Polygon\",\"coordinates\":[[[-73.98620970547199,40.7356342514617],[-73.98627072572708,40.735547874977094],[-73.98632504045963,40.73557226364293],[-73.98622445762157,40.73570995781772],[-73.9861835539341,40.73569268254945],[-73.98621775209902,40.735640856717666]]]}},{\"type\":\"Feature\",\"properties\":{\"user_id\":666},\"geometry\":{\"type\":\"Polygon\",\"coordinates\":[[[-73.98620769381522,40.73563526765495],[-73.9862660318613,40.735547874977094],[-73.98632504045963,40.735570739351566],[-73.98622579872608,40.73570944972167],[-73.98618154227734,40.73569217445325],[-73.98621775209902,40.73563933242788]]]}},{\"type\":\"Feature\",\"properties\":{\"session_id\":\"79e7ee062a9e0333926e3e1fdc3e92db\"},\"geometry\":{\"type\":\"Polygon\",\"coordinates\":[[[-73.98632369935513,40.735570739351566],[-73.98622512817383,40.73570944972167],[-73.98618154227734,40.73569014206842],[-73.98621909320354,40.735640856717666],[-73.98620970547199,40.73563526765495],[-73.98627005517483,40.73554889117169]]]}},{\"type\":\"Feature\",\"properties\":{\"session_id\":\"3d3003b26bb6b2f3b9577924b9ed5e0e\"},\"geometry\":{\"type\":\"Polygon\",\"coordinates\":[[[-73.98621842265129,40.7356423810074],[-73.98620903491974,40.73563577575159],[-73.98627139627934,40.735547874977094],[-73.98632436990738,40.735571755545806],[-73.98622579872608,40.73570995781772],[-73.98618087172508,40.735689633972214]]]}},{\"type\":\"Feature\",\"properties\":{\"user_id\":596},\"geometry\":{\"type\":\"Polygon\",\"coordinates\":[[[-73.98626938462257,40.73554889117167],[-73.98632369935513,40.735572771740024],[-73.98622445762157,40.73570894162559],[-73.98618154227734,40.73569065016463],[-73.98621775209902,40.735640856717666],[-73.98620836436749,40.735634251461676]]]}},{\"type\":\"Feature\",\"properties\":{\"session_id\":\"0afaf74383ce51aceba02fc49ce5a9e3\"},\"geometry\":{\"type\":\"Polygon\",\"coordinates\":[[[-73.98621775209902,40.73563984052446],[-73.98620836436749,40.73563272717173],[-73.98626938462257,40.735550415463514],[-73.98632235825062,40.73557124744871],[-73.98622360456956,40.73570641325812],[-73.98618768252459,40.73568957578454]]]}},{\"type\":\"Feature\",\"properties\":{\"user_id\":538},\"geometry\":{\"type\":\"Polygon\",\"coordinates\":[[[-73.98632571101189,40.735571755545806],[-73.98622378706932,40.73570995781772],[-73.98618288338184,40.73569268254945],[-73.98621775209902,40.73564034862108],[-73.9862110465765,40.7356362838482],[-73.98627005517483,40.735550923560815]]]}},{\"type\":\"Feature\",\"properties\":{\"user_id\":580},\"geometry\":{\"type\":\"Polygon\",\"coordinates\":[[[-73.98632436990738,40.73557124744871],[-73.98626066744328,40.7356581319994],[-73.98625999689102,40.7356581319994],[-73.98620903491974,40.735634759558316],[-73.98626804351805,40.735547874977094]]]}},{\"type\":\"Feature\",\"properties\":{\"user_id\":580},\"geometry\":{\"type\":\"Polygon\",\"coordinates\":[[[-73.98626133799553,40.7356581319994],[-73.98622579872608,40.73570944972167],[-73.98618154227734,40.73569166635704],[-73.98621842265129,40.73563984052446]]]}},{\"type\":\"Feature\",\"properties\":{\"user_id\":548},\"geometry\":{\"type\":\"Polygon\",\"coordinates\":[[[-73.98620970547199,40.73563475955834],[-73.98627005517483,40.73554990736624],[-73.98632369935513,40.735571755545806],[-73.98622360456956,40.73570641325812],[-73.9861848950386,40.735689633972214],[-73.98621842265129,40.735640856717666]]]}},{\"type\":\"Feature\",\"properties\":{\"session_id\":\"53056025663f6d6564a39975971cb87c\"},\"geometry\":{\"type\":\"Polygon\",\"coordinates\":[[[-73.98621909320354,40.735638316234656],[-73.98620836436749,40.7356362838482],[-73.98620769381522,40.73563577575159],[-73.98627005517483,40.73554939926897],[-73.98632302880287,40.73557023125444],[-73.98622360456956,40.73570641325812],[-73.98617953062057,40.735689633972214]]]}}]}"

We decode the GeoJSON into a RGeo::GeoJSON structure (see the RGeo::GeoJSON docs):

In [7]:
geocollection = RGeo::GeoJSON.decode(geomstr, :json_parser => :json)
Out[7]:
#<RGeo::GeoJSON::FeatureCollection:0x80d363fc>

We wrap this in a function for convenience:

In [8]:
def parse_geojson(json)
  RGeo::GeoJSON.decode(json, :json_parser => :json)
end
Out[8]:
:parse_geojson

This structure is now a group of features, each with an RGeo::Geos::CAPIPolygonImpl geometry describing each polygon, among other properties (see the RGeo docs):

In [9]:
geocollection.first.geometry
Out[9]:
#<RGeo::Geos::CAPIPolygonImpl:0x80d3be74 "POLYGON ((-73.98620970547199 40.7356342514617, -73.98627072572708 40.735547874977094, -73.98632504045963 40.73557226364293, -73.98622445762157 40.73570995781772, -73.9861835539341 40.73569268254945, -73.98621775209902 40.735640856717666, -73.98620970547199 40.7356342514617))">

Algorithm

The main logic behind this process is as follows:

  1. cluster all the polygons by their centroids (similar-shaped polygons should have similar centroids[1], clustering will let us identify outliers)
  2. only use clusters that have three or more centroids (three or more people drew similar-shaped polygons)
  3. for each cluster:
    1. cluster the vertices of its polygons
    2. find the mean vertex describing each cluster
    3. connect those mean vertices in the most likely order
    4. verify that the connected polygon makes sense (will explain better below)

[1] different polygons might also have similar centroids but we're skipping this for now :)

Since DBSCAN works with number arrays, we need to convert the complex RGeo structures. Below a simple centroid-extraction function:

In [10]:
def get_centroid(poly_feature)
  return if (poly_feature.geometry.geometry_type.type_name != "Polygon")
  c = poly_feature.geometry.centroid
  return [c.x, c.y]
end
Out[10]:
:get_centroid

Let's test it with the first polygon in the collection:

In [11]:
centroid = get_centroid(geocollection.first)
Out[11]:
[-73.98625268168838, 40.73562601945317]

Now we need a convenience function to get all the centroids of the collection. We will make it a hash because we later need to be able to go back to this list to extract its corresponding set of polygons and a hash was the way I found most convenient:

In [12]:
def get_all_centroids(geom)
  centroids = {}
  geom.each_with_index do |poly,index|
    next if (poly.geometry.geometry_type.type_name != "Polygon")
    centroids[index] = get_centroid(poly)
  end
  return centroids
end
Out[12]:
:get_all_centroids

Test again:

In [13]:
centroids = get_all_centroids(geocollection)
Out[13]:
{0=>[-73.98625268168838, 40.73562601945317], 1=>[-73.98625173238652, 40.735625569382876], 2=>[-73.9862518966646, 40.73562642272427], 3=>[-73.986252242017, 40.735626656082445], 4=>[-73.98625152460835, 40.735626229414], 5=>[-73.98625207318744, 40.73562418649854], 6=>[-73.98625258509149, 40.7356272053874], 7=>[-73.98626592099406, 40.735602617283476], 8=>[-73.9862216645921, 40.73567482334759], 9=>[-73.98625254867669, 40.735624721075084], 10=>[-73.98625077341322, 40.73562552211442]}

A simple plot of all the centroids using Nyaplot:

In [14]:
plot = Nyaplot::Plot.new
plot.width(400)
plot.height(400)
plot.zoom(true)
plot.rotate_x_label(-60)
points_x = centroids.map { |p| p[1][0] }
points_y = centroids.map { |p| p[1][1] }
df = Nyaplot::DataFrame.new({x:points_x,y:points_y})
# add some padding
xmin = points_x.min - 1e-5
xmax = points_x.max + 1e-5
ymin = points_y.min - 1e-5
ymax = points_y.max + 1e-5
plot.xrange([xmin,xmax])
plot.yrange([ymin,ymax])
# end padding
sc = plot.add_with_df(df, :scatter, :x, :y)
plot.show
Out[14]:
In [15]:
dists = []
done = {}
centroids.each_with_index do |cc1,i|
  centroids.each_with_index do |cc2,j|
    c1 = cc1[1]
    c2 = cc2[1]
    dists.push({:dist=>Math.hypot(c1[0]-c2[0],c1[1]-c2[1]),:from=>i,:to=>j,:from_centroid=>c1,:to_centroid=>c2}) if (c1 != c2 && !done[[c2,c1]]) 
    done[[c1,c2]] = true
  end
end
dists = dists.sort_by!{|k| k[:dist]}
dist_df = Nyaplot::DataFrame.new(dists)
Out[15]:
distfromtofrom_centroidto_centroid
4.1680249477628687e-0723[-73.9862518966646, 40.73562642272427][-73.986252242017, 40.735626656082445]
4.1927880127312373e-0724[-73.9862518966646, 40.73562642272427][-73.98625152460835, 40.735626229414]
6.476388201422145e-0736[-73.986252242017, 40.735626656082445][-73.98625258509149, 40.7356272053874]
6.919630457708901e-0714[-73.98625173238652, 40.735625569382876][-73.98625152460835, 40.735626229414]
7.154453870992346e-0759[-73.98625207318744, 40.73562418649854][-73.98625254867669, 40.735624721075084]
7.736974578659688e-0703[-73.98625268168838, 40.73562601945317][-73.986252242017, 40.735626656082445]
8.346982305084655e-0734[-73.986252242017, 40.735626656082445][-73.98625152460835, 40.735626229414]
8.690102573992017e-0712[-73.98625173238652, 40.735625569382876][-73.9862518966646, 40.73562642272427]
8.825474076951457e-0702[-73.98625268168838, 40.73562601945317][-73.9862518966646, 40.73562642272427]
9.601375433120375e-07110[-73.98625173238652, 40.735625569382876][-73.98625077341322, 40.73562552211442]
1.0317784775226883e-06410[-73.98625152460835, 40.735626229414][-73.98625077341322, 40.73562552211442]
1.0423498270990819e-0626[-73.9862518966646, 40.73562642272427][-73.98625258509149, 40.7356272053874]
1.0505890250970463e-0601[-73.98625268168838, 40.73562601945317][-73.98625173238652, 40.735625569382876]
1.1759752372883875e-0604[-73.98625268168838, 40.73562601945317][-73.98625152460835, 40.735626229414]
1.1772662180684655e-0619[-73.98625173238652, 40.735625569382876][-73.98625254867669, 40.735624721075084]
1.1898617396243328e-0606[-73.98625268168838, 40.73562601945317][-73.98625258509149, 40.7356272053874]
...............
8.468969718424401e-0578[-73.98626592099406, 40.735602617283476][-73.9862216645921, 40.73567482334759]

1. Clustering centroids

We can see here how the centroids reflect the three different basic shapes drawn by contributors above: the lone centroids for the upper-right and lower-left rectangles and the group of nine centroids for the L-shaped polygons in the "center".

The problem now is finding a good minimum distance between centroids:

  • big enough to cover nearby centroids but also
  • small enough to not group polygons that don't belong with each other

Let's create a table to see just how close/far these centroids are from each other (standard euclidean distance: $\sqrt{((\Delta x)^2+(\Delta y)^2)}$). Notice that, since geographic metric units have a lot of significant digits (numbers to the right of the decimal point), we are dealing with distances smaller than $10^{-6}$:

From the table (which is sorted by closest points first) we can see that the top 9 results are under $10^{-7}$ units away from each other (0.000001).

The DBSCAN algorithm

To understand how clusters are formed, it is useful to understand how the DBSCAN clustering algorithm works:

DBSCAN requires two parameters: ε (eps) and the minimum number of points (min_points) required to form a dense region. It starts with an arbitrary starting point that has not been visited. This point's ε-neighborhood is retrieved, and if it contains sufficiently many points, a cluster is started. Otherwise, the point is labeled as noise. Note that this point might later be found in a sufficiently sized ε-environment of a different point and hence be made part of a cluster.

If a point is found to be a dense part of a cluster, its ε-neighborhood is also part of that cluster. Hence, all points that are found within the ε-neighborhood are added, as is their own ε-neighborhood when they are also dense. This process continues until the density-connected cluster is completely found. Then, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise.

By playing around with different sets of polygons I came to a general ε of $1.8(10^{-6})$ and a min_points of 2 for centroid clusters (polygon vertex clusters have different input values as we will see below).

This is the resulting centroid-clustering function:

In [16]:
def cluster_centroids(centroids, epsilon=1.8e-06, min_points=2)
  dbscan = DBSCAN( centroids.map{|c| c[1]}, :epsilon => epsilon, :min_points => min_points, :distance => :euclidean_distance )
  return dbscan.results.select{|k,v| k != -1} # omit the non-cluster
end
Out[16]:
:cluster_centroids

Let's test it:

In [17]:
centroid_clusters = cluster_centroids(centroids)
Out[17]:
{0=>[[-73.98625268168838, 40.73562601945317], [-73.98625173238652, 40.735625569382876], [-73.9862518966646, 40.73562642272427], [-73.986252242017, 40.735626656082445], [-73.98625152460835, 40.735626229414], [-73.98625258509149, 40.7356272053874], [-73.98625254867669, 40.735624721075084], [-73.98625207318744, 40.73562418649854], [-73.98625077341322, 40.73562552211442]]}

The function returns a hash with whose [-1] key (if any) contains all the points that did not belong to a cluster and [0..n] contain the different clusters. In this example there is only one cluster, centroid_clusters[0] and the rejected [-1] non-cluster.

Let's define a cluster plotting function and plot this (notice the "disappearance" of the two outliers that are being ignored by the function):

In [18]:
def plot_clusters(clusters)
  plot = Nyaplot::Plot.new
  plot.width(300)
  plot.height(400)
  plot.zoom(true)
  plot.rotate_x_label(-60)
  pts = clusters.map{|c| c[1]}.flatten(1)
  # add some padding
  xmin = pts.map {|p| p[0]}.min - 1e-5
  xmax = pts.map {|p| p[0]}.max + 1e-5
  ymin = pts.map {|p| p[1]}.min - 1e-5
  ymax = pts.map {|p| p[1]}.max + 1e-5
  plot.xrange([xmin,xmax])
  plot.yrange([ymin,ymax])
  # now plot
  clusters.each do |cluster|
    if cluster[0] != -1 # ignore cluster -1 because not enough points
      cluster_x = cluster[1].map { |c| c[0] }
      cluster_y = cluster[1].map { |c| c[1] }
      names = cluster[1].map { |c| cluster[0] }
      df = Nyaplot::DataFrame.new({x:cluster_x,y:cluster_y,cluster:names})
      sc = plot.add_with_df(df, :scatter, :x, :y)
      sc.tooltip_contents([:cluster])
      color = "#"+ "%06x" % (rand * 0xffffff)
      sc.color(color)
    end
  end
  plot.show
  return plot
end
Out[18]:
:plot_clusters
In [19]:
plot = plot_clusters(centroid_clusters)
plot.show
Out[19]:

2. Clustering vertices

Now we need to:

  1. work backwards from the centroid clusters that have three or more centroids (only one in this case)
  2. find the polygons they belong to and, finally,
  3. find their vertices and cluster them

Below a function that retrieves the polygons for a given centroid cluster based on the structures we have built so far:

In [20]:
# given a list of centroids (lon,lat), find their poly's index in the centroid list (index => lon,lat)
def get_polys_for_centroid_cluster(cluster, centroids, original_polys)
  polys = []
  cluster.each do |cl|
    index = centroids.select {|k,v| v == cl}.keys.first
    polys.push(original_polys[index]) if index != -1
  end
  return polys
end
Out[20]:
:get_polys_for_centroid_cluster

Applying this to the only cluster that has useful centroids:

In [21]:
cluster_polygons = get_polys_for_centroid_cluster(centroid_clusters[0], centroids, geocollection)
Out[21]:
[#<RGeo::GeoJSON::Feature:0x80d3bde8 id=nil geom="POLYGON ((-73.98620970547199 40.7356342514617, -73.98627072572708 40.735547874977094, -73.98632504045963 40.73557226364293, -73.98622445762157 40.73570995781772, -73.9861835539341 40.73569268254945, -73.98621775209902 40.735640856717666, -73.98620970547199 40.7356342514617))">, #<RGeo::GeoJSON::Feature:0x80d3b550 id=nil geom="POLYGON ((-73.98620769381522 40.73563526765495, -73.9862660318613 40.735547874977094, -73.98632504045963 40.735570739351566, -73.98622579872608 40.73570944972167, -73.98618154227734 40.73569217445325, -73.98621775209902 40.73563933242788, -73.98620769381522 40.73563526765495))">, #<RGeo::GeoJSON::Feature:0x80d3afc4 id=nil geom="POLYGON ((-73.98632369935513 40.735570739351566, -73.98622512817383 40.73570944972167, -73.98618154227734 40.73569014206842, -73.98621909320354 40.735640856717666, -73.98620970547199 40.73563526765495, -73.98627005517483 40.73554889117169, -73.98632369935513 40.735570739351566))">, #<RGeo::GeoJSON::Feature:0x80d3aa9c id=nil geom="POLYGON ((-73.98621842265129 40.7356423810074, -73.98620903491974 40.73563577575159, -73.98627139627934 40.735547874977094, -73.98632436990738 40.735571755545806, -73.98622579872608 40.73570995781772, -73.98618087172508 40.735689633972214, -73.98621842265129 40.7356423810074))">, #<RGeo::GeoJSON::Feature:0x80d3a59c id=nil geom="POLYGON ((-73.98626938462257 40.73554889117167, -73.98632369935513 40.735572771740024, -73.98622445762157 40.73570894162559, -73.98618154227734 40.73569065016463, -73.98621775209902 40.735640856717666, -73.98620836436749 40.735634251461676, -73.98626938462257 40.73554889117167))">, #<RGeo::GeoJSON::Feature:0x80d37dc4 id=nil geom="POLYGON ((-73.98632571101189 40.735571755545806, -73.98622378706932 40.73570995781772, -73.98618288338184 40.73569268254945, -73.98621775209902 40.73564034862108, -73.9862110465765 40.7356362838482, -73.98627005517483 40.735550923560815, -73.98632571101189 40.735571755545806))">, #<RGeo::GeoJSON::Feature:0x80d36e4c id=nil geom="POLYGON ((-73.98620970547199 40.73563475955834, -73.98627005517483 40.73554990736624, -73.98632369935513 40.735571755545806, -73.98622360456956 40.73570641325812, -73.9861848950386 40.735689633972214, -73.98621842265129 40.735640856717666, -73.98620970547199 40.73563475955834))">, #<RGeo::GeoJSON::Feature:0x80d3a0b0 id=nil geom="POLYGON ((-73.98621775209902 40.73563984052446, -73.98620836436749 40.73563272717173, -73.98626938462257 40.735550415463514, -73.98632235825062 40.73557124744871, -73.98622360456956 40.73570641325812, -73.98618768252459 40.73568957578454, -73.98621775209902 40.73563984052446))">, #<RGeo::GeoJSON::Feature:0x80d3644c id=nil geom="POLYGON ((-73.98621909320354 40.735638316234656, -73.98620836436749 40.7356362838482, -73.98620769381522 40.73563577575159, -73.98627005517483 40.73554939926897, -73.98632302880287 40.73557023125444, -73.98622360456956 40.73570641325812, -73.98617953062057 40.735689633972214, -73.98621909320354 40.735638316234656))">]

We need a method to extract the vertices from each polygon (in a DBSCAN-compatible format):

In [22]:
def get_points(poly_feature)
  geom = poly_feature.geometry
  return false if (geom.geometry_type.type_name != "Polygon")
  pts = []
  points = geom.exterior_ring.points
  points.each do |point|
    pts.push([point.x,point.y])
  end
  return pts
end
Out[22]:
:get_points

Now let's plot what we have so far (vertices from the same polygon are the same color):

In [23]:
def plot_polys(polys)
  plot = Nyaplot::Plot.new
  plot.width(500)
  plot.height(500)
  plot.zoom(true)
  plot.rotate_x_label(-60)
  polys.each do |poly|
    plot_poly(poly, plot)
  end
  plot.show
end
def plot_poly(poly, plot = nil)
  showplot = false
  if plot == nil
    showplot = true
    plot = Nyaplot::Plot.new
    plot.width(500)
    plot.height(500)
    plot.zoom(true)
    plot.rotate_x_label(-60)
  end
  points = get_points(poly)
  points_x = points.map { |p| p[0] }
  points_y = points.map { |p| p[1] }
  df = Nyaplot::DataFrame.new({x:points_x,y:points_y})
  sc = plot.add_with_df(df, :scatter, :x, :y)
  color = "#"+ "%06x" % (rand * 0xffffff)
  sc.color(color)
  plot.show if showplot
end
Out[23]:
:plot_poly
In [24]:
plot_polys(cluster_polygons)
Out[24]:

Let's cluster these points. Below is a function that extracts the points from a list of polygons:

In [25]:
def get_all_poly_points(polys)
  points = []
  polys.each do |poly|
    points.push(get_points(poly))
  end
  return points
end
Out[25]:
:get_all_poly_points
In [26]:
cluster_poly_points = get_all_poly_points(cluster_polygons)
Out[26]:
[[[-73.98620970547199, 40.7356342514617], [-73.98627072572708, 40.735547874977094], [-73.98632504045963, 40.73557226364293], [-73.98622445762157, 40.73570995781772], [-73.9861835539341, 40.73569268254945], [-73.98621775209902, 40.735640856717666], [-73.98620970547199, 40.7356342514617]], [[-73.98620769381522, 40.73563526765495], [-73.9862660318613, 40.735547874977094], [-73.98632504045963, 40.735570739351566], [-73.98622579872608, 40.73570944972167], [-73.98618154227734, 40.73569217445325], [-73.98621775209902, 40.73563933242788], [-73.98620769381522, 40.73563526765495]], [[-73.98632369935513, 40.735570739351566], [-73.98622512817383, 40.73570944972167], [-73.98618154227734, 40.73569014206842], [-73.98621909320354, 40.735640856717666], [-73.98620970547199, 40.73563526765495], [-73.98627005517483, 40.73554889117169], [-73.98632369935513, 40.735570739351566]], [[-73.98621842265129, 40.7356423810074], [-73.98620903491974, 40.73563577575159], [-73.98627139627934, 40.735547874977094], [-73.98632436990738, 40.735571755545806], [-73.98622579872608, 40.73570995781772], [-73.98618087172508, 40.735689633972214], [-73.98621842265129, 40.7356423810074]], [[-73.98626938462257, 40.73554889117167], [-73.98632369935513, 40.735572771740024], [-73.98622445762157, 40.73570894162559], [-73.98618154227734, 40.73569065016463], [-73.98621775209902, 40.735640856717666], [-73.98620836436749, 40.735634251461676], [-73.98626938462257, 40.73554889117167]], [[-73.98632571101189, 40.735571755545806], [-73.98622378706932, 40.73570995781772], [-73.98618288338184, 40.73569268254945], [-73.98621775209902, 40.73564034862108], [-73.9862110465765, 40.7356362838482], [-73.98627005517483, 40.735550923560815], [-73.98632571101189, 40.735571755545806]], [[-73.98620970547199, 40.73563475955834], [-73.98627005517483, 40.73554990736624], [-73.98632369935513, 40.735571755545806], [-73.98622360456956, 40.73570641325812], [-73.9861848950386, 40.735689633972214], [-73.98621842265129, 40.735640856717666], [-73.98620970547199, 40.73563475955834]], [[-73.98621775209902, 40.73563984052446], [-73.98620836436749, 40.73563272717173], [-73.98626938462257, 40.735550415463514], [-73.98632235825062, 40.73557124744871], [-73.98622360456956, 40.73570641325812], [-73.98618768252459, 40.73568957578454], [-73.98621775209902, 40.73563984052446]], [[-73.98621909320354, 40.735638316234656], [-73.98620836436749, 40.7356362838482], [-73.98620769381522, 40.73563577575159], [-73.98627005517483, 40.73554939926897], [-73.98632302880287, 40.73557023125444], [-73.98622360456956, 40.73570641325812], [-73.98617953062057, 40.735689633972214], [-73.98621909320354, 40.735638316234656]]]

The better ε value I found for these points is a bit more complicated. If it is too big, the L-shape will be lost: points in that corner will be clustered together. After fiddling around I found a decent value of of $6(10^{-6})$.

An important aspect to account for here is that the GeoJSON spec requires that the coordinate array has to begin and end with the same point. Therefore this point would be counted twice if we leave the array as-is. Below the resulting clustering function, corresponding test, and plot:

In [27]:
def cluster_points(points, epsilon=6e-06, min_points=2)
  dbscan = DBSCAN( points.flatten(1), :epsilon => epsilon, :min_points => min_points, :distance => :euclidean_distance )
  return dbscan.results.select{|k,v| k != -1} # omit the non-cluster
end
Out[27]:
:cluster_points
In [28]:
# exclude first item in each poly since it is same as last
unique_points = cluster_poly_points.map{|poly| poly[1..-1]}
vertex_clusters = cluster_points(unique_points)
Out[28]:
{0=>[[-73.98627072572708, 40.735547874977094], [-73.9862660318613, 40.735547874977094], [-73.98627005517483, 40.73554889117169], [-73.98627139627934, 40.735547874977094], [-73.98626938462257, 40.73554889117167], [-73.98627005517483, 40.735550923560815], [-73.98627005517483, 40.73554990736624], [-73.98626938462257, 40.735550415463514], [-73.98627005517483, 40.73554939926897]], 1=>[[-73.98632504045963, 40.73557226364293], [-73.98632504045963, 40.735570739351566], [-73.98632369935513, 40.735570739351566], [-73.98632436990738, 40.735571755545806], [-73.98632369935513, 40.735572771740024], [-73.98632571101189, 40.735571755545806], [-73.98632369935513, 40.735571755545806], [-73.98632235825062, 40.73557124744871], [-73.98632302880287, 40.73557023125444]], 2=>[[-73.98622445762157, 40.73570995781772], [-73.98622579872608, 40.73570944972167], [-73.98622512817383, 40.73570944972167], [-73.98622579872608, 40.73570995781772], [-73.98622445762157, 40.73570894162559], [-73.98622378706932, 40.73570995781772], [-73.98622360456956, 40.73570641325812], [-73.98622360456956, 40.73570641325812], [-73.98622360456956, 40.73570641325812]], 3=>[[-73.9861835539341, 40.73569268254945], [-73.98618154227734, 40.73569217445325], [-73.98618154227734, 40.73569014206842], [-73.98618087172508, 40.735689633972214], [-73.98618154227734, 40.73569065016463], [-73.98618288338184, 40.73569268254945], [-73.9861848950386, 40.735689633972214], [-73.98618768252459, 40.73568957578454], [-73.98617953062057, 40.735689633972214]], 4=>[[-73.98621775209902, 40.735640856717666], [-73.98621775209902, 40.73563933242788], [-73.98621909320354, 40.735640856717666], [-73.98621842265129, 40.7356423810074], [-73.98621775209902, 40.73564034862108], [-73.98621842265129, 40.735640856717666], [-73.98621775209902, 40.73563984052446], [-73.98621909320354, 40.735638316234656], [-73.98621775209902, 40.735640856717666]], 5=>[[-73.98620970547199, 40.7356342514617], [-73.98620769381522, 40.73563526765495], [-73.98620970547199, 40.73563526765495], [-73.98620903491974, 40.73563577575159], [-73.98620836436749, 40.735634251461676], [-73.9862110465765, 40.7356362838482], [-73.98620970547199, 40.73563475955834], [-73.98620836436749, 40.73563272717173], [-73.98620836436749, 40.7356362838482], [-73.98620769381522, 40.73563577575159]]}
In [29]:
plot = plot_clusters(vertex_clusters)
plot.show
Out[29]:

3. Finding the mean polygon

Now we iterate through each vertex cluster and:

  1. find the mean vertex
  2. connect the mean vertices into a mean polygon

For this we need some extra functions in the Array object to find the mean value:

In [30]:
class Array
    def sum
      inject(0.0) { |result, el| result + el }
    end

    def mean 
      sum / size
    end
end
Out[30]:
:mean

Now we need a function that receives the vertex clusters and returns the average vertex for each cluster:

In [31]:
def get_mean_poly(clusters)
  means = {}
  clusters.each do |cluster|
    next if cluster[0] == -1 # ignore cluster -1
    lon = cluster[1].map {|c| c[0]}.mean
    lat = cluster[1].map {|c| c[1]}.mean
    means[cluster[0]] = [lon,lat]
  end
  return means
end
Out[31]:
:get_mean_poly

We test this function with our vertex clusters and plot both (mean vertices as yellow diamonds):

In [32]:
mean_poly = get_mean_poly(vertex_clusters)
Out[32]:
{0=>[-73.9862696826458, 40.73554911699269], 1=>[-73.98632407188416, 40.73557147326963], 2=>[-73.98622447129412, 40.73570855047738], 3=>[-73.98618267156186, 40.73569075660959], 4=>[-73.98621819913387, 40.73564040507624], 5=>[-73.98620896786451, 40.735635064416286]}
In [33]:
# plot clusters with overlaid (yellow) mean points
plot = plot_clusters(vertex_clusters)
# add means
m_x = mean_poly.map { |m| m[1][0] }
m_y = mean_poly.map { |m| m[1][1] }
sc = plot.add(:scatter, m_x, m_y)
color = "#ffff00"
sc.color(color)
sc.shape('diamond')
plot.show
Out[33]:

4. Connecting it all

So far we have a set of points that seem to be the most likely vertices of the mean polygon drawn by our contributors. However, there are many ways in which these points could be connected to each other.

DISCLAIMER:

What follows is a very primitive process that I used to determine the most likely connection between those points. This process is the best I could come up with given my limited math knowledge and time. If you have a better idea of how to do this in Ruby please tweet me at @mgiraldo.

/DISCLAIMER

Before going through with connections we need to validate that we have a reasonable amount of clusters to work with: some vertices may be drawn far away enough for them to not cluster properly and therefore no cluster will be produced. We do this by determining the mean vertices in each polygon ($\bar{m}$) and comparing it with the cluster count ($\sum c$). Right now: $\bar{m}\leq\sum c$ , so we should have at least as many clusters as we have average points per polygon.

Not perfect but works most of the time:

In [34]:
def validate_clusters(clusters, unique_points)
  average = (unique_points.flatten.count.to_f / (unique_points.size * 2).to_f).round
  return clusters.select{|k,v| k!=-1}.size >= average
end
Out[34]:
:validate_clusters
In [35]:
validate_clusters(vertex_clusters, unique_points)
Out[35]:
true

Now that this has been verified we proceed to connect.

The general process to connect mean vertices to each other is:

  1. for each mean vertex:
    1. find the cluster of vertices it represents (from_vertices)
    2. for each vertex in from_vertices:
      1. find the vertex it is connected to (to_vertex)
      2. find the cluster to_vertex belongs to (to_cluster)
      3. add a "vote" for to_cluster
    3. tally the votes
    4. the to_cluster with most votes is the connected cluster
  2. connect the clusters
  3. validate that the connection makes sense (eg: is a directed cycle graph)

Below all the corresponding functions:

In [36]:
def find_connected_point(point, original_points)
  original_points.each do |poly|
    poly.each_with_index do |p,index|
      return poly[index+1] if point === p
    end
  end
  return
end
Out[36]:
:find_connected_point
In [37]:
def find_cluster_for_point(point, clusters)
  clusters.each do |cluster|
    cluster[1].each do |p|
      return cluster[0] if point === p && cluster[0] != -1
    end
  end
  return
end
Out[37]:
:find_cluster_for_point
In [38]:
def connect_clusters(clusters, original_points)
  connections = {}
  # for each cluster
  clusters.each do |cluster|
    # for each point in cluster
    if cluster[0] != -1 # exclude invalid cluster
      cluster_votes = {} # to weigh connection popularity (diff pts might be connected to diff clusters)
      cluster[1].each do |point|
        # find original point connected to it
        connection = find_connected_point(point, original_points)
        connected_cluster = find_cluster_for_point(connection, clusters)
        # if original point belongs to another cluster
        if connected_cluster != nil && connected_cluster != cluster[0]
          # vote for the cluster
          cluster_votes[connected_cluster] = 0 if cluster_votes[connected_cluster] == nil
          cluster_votes[connected_cluster] += 1
        end
      end
      connections[cluster[0]] = cluster_votes.sort_by{|k, v| v}
      next if connections[cluster[0]].size == 0
      connections[cluster[0]] = connections[cluster[0]].reverse[0][0]
    end
  end
  return connections
end
Out[38]:
:connect_clusters
In [39]:
connections = connect_clusters(vertex_clusters, cluster_poly_points)
Out[39]:
{0=>1, 1=>2, 2=>3, 3=>4, 4=>5, 5=>0}

As can be seen above this is a directed cycle graph and the end result is a clean path from the first vertex to the last one.

The fact that the points are sorted (0 to 1, 1 to 2, 2 to 3, and so on) is somewhat coincidential. Below is a basic function that checks the graph and returns a sorted list of clusters (the order we need to follow to draw the mean polygon):

In [40]:
def sort_connections(connections)
  # does some simple check for non-circularity 
  sorted = []
  seen = {}
  as_list = connections.select{|k,v| k}
  done = false
  first = as_list.first[0]
  from = first
  while !done do
    to = connections[from]
    done = true if seen[to] || to == nil || to.size == 0
    seen[to] = true
    from = to
    sorted.push(to)
    done = true if seen.size == connections.size
  end
  return nil if seen.size != connections.size
  return sorted
end
Out[40]:
:sort_connections
In [41]:
# testing sort function
sort_connections(connections)
Out[41]:
[1, 2, 3, 4, 5, 0]

Now we can proceed to build our final mean polygon:

In [42]:
def connect_mean_poly(mean_poly, connections)
  connected = []
  sorted = sort_connections(connections)
  return nil if sorted == nil
  sorted.each do |c|
    connected.push([mean_poly[c][0], mean_poly[c][1]])
  end
  # for GeoJSON, last == first
  first = sorted[0]
  connected.push([mean_poly[first][0], mean_poly[first][1]])
  return connected
end
Out[42]:
:connect_mean_poly
In [43]:
final_polygon = connect_mean_poly(mean_poly, connections)
Out[43]:
[[-73.98632407188416, 40.73557147326963], [-73.98622447129412, 40.73570855047738], [-73.98618267156186, 40.73569075660959], [-73.98621819913387, 40.73564040507624], [-73.98620896786451, 40.735635064416286], [-73.9862696826458, 40.73554911699269], [-73.98632407188416, 40.73557147326963]]

Let's see how all this looks like:

In [44]:
plot = plot_clusters(vertex_clusters)
m_x = final_polygon.map { |m| m[0] }
m_y = final_polygon.map { |m| m[1] }
sc = plot.add(:scatter, m_x, m_y)
color = "#ffff00"
sc.color(color)
sc.shape('diamond')
# add the MEAN POLYGON
final_polygon.each_with_index do |c, i|
  next if i >= final_polygon.size-1
  from = [ final_polygon[i][0], final_polygon[i+1][0] ]
  to   = [ final_polygon[i][1], final_polygon[i+1][1] ]
  plot.add(:line, from, to)
end
plot.show
Out[44]:

To wrap it all up we create a single consensus function that receives a GeoJSON string and returns a list of mean polygons:

In [45]:
def calculate_polygonfix_consensus(geojson)
  output = []
  geom = parse_geojson(geojson)
  centroids = get_all_centroids(geom)
  centroid_clusters = cluster_centroids(centroids)
  centroid_clusters.each do |ccluster|
    next if ccluster[0] == -1
    cluster = ccluster[1] # only the set of latlons
    sub_geom = get_polys_for_centroid_cluster(cluster, centroids, geom)
    next if sub_geom.size == 0
    original_points = get_all_poly_points(sub_geom)
    next if original_points == nil
    unique_points = original_points.map{|poly| poly[1..-1]}
    vertex_clusters = cluster_points(unique_points)
    next if !validate_clusters(vertex_clusters, unique_points)
    mean_poly = get_mean_poly(vertex_clusters)
    next if mean_poly == {}
    connections = connect_clusters(vertex_clusters, original_points)
    next if connections == nil || connections == {}
    poly = connect_mean_poly(mean_poly, connections)
    next if poly == nil || poly.count == 0
    output.push(poly)
  end
  return output
end
Out[45]:
:calculate_polygonfix_consensus
In [46]:
consensus = calculate_polygonfix_consensus(geomstr)
Out[46]:
[[[-73.98632407188416, 40.73557147326963], [-73.98622447129412, 40.73570855047738], [-73.98618267156186, 40.73569075660959], [-73.98621819913387, 40.73564040507624], [-73.98620896786451, 40.735635064416286], [-73.9862696826458, 40.73554911699269], [-73.98632407188416, 40.73557147326963]]]

The GeoJSON of all this might look something like:

In [47]:
geo_json = {:type => "FeatureCollection", :features => consensus.map { |f| {:type => "Feature", :properties => { :id => 1 }, :geometry => { :type => "Polygon", :coordinates =>[f] } } } }.to_json
puts geo_json
{"type":"FeatureCollection","features":[{"type":"Feature","properties":{"id":1},"geometry":{"type":"Polygon","coordinates":[[[-73.98632407188416,40.73557147326963],[-73.98622447129412,40.73570855047738],[-73.98618267156186,40.73569075660959],[-73.98621819913387,40.73564040507624],[-73.98620896786451,40.735635064416286],[-73.9862696826458,40.73554911699269],[-73.98632407188416,40.73557147326963]]]}}]}

Now let's plots the resulting GeoJSON on the original map (purple):

In [48]:
IRuby.html '<iframe src="http://jsfiddle.net/mgiraldo/m4XeU/1/embedded/result/" width=500 height=400></iframe>'
Out[48]:

Voilà! The mean polygon looks good!

Conclusion

This is a first step towards finding geometric consensus from a list of user contributions to a given starting geometry and a map. It is a work in progress and hopefully other ideas can be added to improve this algorithm.

This code is part of NYPL Labs' Building Inspector. Explore and fork the GitHub repository.

This notebook was created by Mauricio Giraldo Arteaga.