Out[1]:

Project 4: Wrangle OpenStreetMap Data

This is the fourth project in the Udacity Nano Degree Data Analyst Program

What Is OpenStreetMap?

The OpenStreetMap is an open initiative to create and provide free geographic data such as street maps to anyone who wants them. It is supported by the OpenStreetMap Foundation which is a UK-registered not-for-profit organization. Currently, OpenStreetMap has an active base of over two million volunteers all across the globe.

Project Overview

The project requires me to choose any area of the world in https://www.openstreetmap.org and use data munging techniques, such as assessing the quality of the data for validity, accuracy, completeness, consistency and uniformity, to clean the data of the area I have chosen. Finally, choose either MongoDB or SQL as the data schema to finish the project.

Chosen Area:

New Delhi, National Capital of India

Explore The Union Territory Of New Delhi, India With An Interactive Map, Courtesy https://developer.mapquest.com.

Out[1]:

Early Observations

After getting my hands on the data, the first thing I did was to extract a sample from the original file I had downloaded. I wanted to explore the data and get familiar with it. Doing so with a smaller file was way more efficient. As the project required that I work on a file not less than 50MB, I had to make sure my sample doesn't fall below that limit. The code to extract a sample XML file can be found <a href= "make_sample.py">here</a>. I wanted to document the file size in the notebook. So, I wrote a small function that could print the file size in a tabular format. That code is here. In the cell below, you can see the file sizes of both the original and the sample file.

+----------+----------+
|   FILE   |   SIZE   |
+----------+----------+
| Original | 716.5 MB |
|  Sample  | 72.4 MB  |
+----------+----------+

After I had the sample file in place, I wanted to know things about the data right off the bat. The first thing I checked was the number of tags the XML file had. The code that counted the tags in the file is here.

+----------+--------+
| ELEMENTS | COUNT  |
+----------+--------+
|   osm    |   1    |
| relation |  619   |
|  member  |  2795  |
|   way    | 69426  |
|   tag    | 82169  |
|   node   | 340762 |
|    nd    | 421392 |
+----------+--------+

The next thing I explored in the dataset was the top level tags. These tags contained attributes like 'id', 'lat', 'lon', 'user', 'uid', 'version', 'changeset', 'timestamp'. The "uid" attribute consisted of a unique id number of every contributor who had worked on this dataset. I thought, it would be nice to know the count of such unique users. With the help of this code I was able to do so. The actual count of unique users in the sample dataset is printed in the cell below.

Number of unique users: 824

Next, I went on to explore the second level tags. Iterating over the attributes of the second level tags gave a dictionary. This dictionary contained keys "k" and "v". The key "k" contained values with colons like "addr:street" or "addr:postcode" or just simple lower cased values like "phone". I took a count of those values with this code. The result of the count is printed below.

+-------+-------------+--------------+-------+
| LOWER | LOWER_COLON | PROBLEMCHARS | OTHER |
+-------+-------------+--------------+-------+
| 80541 |     1590    |      0       |   38  |
+-------+-------------+--------------+-------+

In order to understand the dataset better, I examined it from various aspects. This allowed me later to write codes that helped manipulate and reshape the dataset according to the requirements of the project. If you are interested in knowing how I derived at the codes for this project, this link data_wrangling_project_4_notes.ipynb will give you access to the notes I wrote while preparing for this project. Although, I have tried to explain what is happening in codes through comments but I must still warn you that these notes are quite tedious and in many places lack any proper order.

The New Delhi OSM dataset was quite nicely maintained, nonetheless, there were some minor problem with it here and there. Let me take you through a step by step process of introducing you to those problems and how I was able to resolve them.

Problems Encountered

In this dataset, I was primarily concerned with three main problems I observed. The first problem was with the street names. Street addresses contained misspellings, inconsistent hyphenation (-), mixed casing and inconsistent use of abbrevation. Some examples of such problems are:
  • Roman Numerals -> South City II
  • Lower Casing -> janakpuri
  • Mixed Casing -> Shiv Arcade,aacharya niketan,mayor vihar ph-1
  • Misspelling -> Pahargan
  • Hyhpen -> Phi-02

The second problem I ran into was of invalid pin code entries in some places. In NCR Delhi, a valid pin code has 6 digits. However, some pin codes in the dataset had more than 6 digits or whitespaces or typos. Here are some examples of errors I found related to pin codes:

  • 100006
  • 1100002
  • 2013010
  • 110 021

The last problem was with phone numbers. The phone numbers themselves were entered correctly, that is, they had valid 10 and 8 digits with proper area codes. But, they had inconsistent whitespaces and hyphenation. Some examples of irregualar phone numbers are:

  • +91 11 3955 5000
  • 91-11-2687-6564
  • +91 9958080618
  • 0120 252 0242

Audit Strategy

All the three problems I encountered in the dataset were quite distinct from one another. So, my thought process was quite straight forward: handle every problem independently while keeping the solutions as simple as possible.

Normalising Street Names:

Fixing street names was a three step process:
  1. Confrim it is a street name
  2. Check the last word of the street name. If that word exists in an expected list of street names, let it pass. If not, send it for updation
  3. Substitute the incorrect street name with the correct one

All the disparate pieces of the code requied to make this process work can be found here (Mapping), here (audit street types) and here (update street type). I was able to successfully update street names by passing the data through these codes.

  • South City II --> South City 2
  • Phi-02 --> Phi 02
  • Shiv Arcade,aacharya niketan,mayor vihar ph-1 --> Shiv Arcade,Aacharya Niketan,Mayor Vihar Ph 1
  • Pahargan --> Paharganj

Normalising Postal Codes:

Fixing postal codes was also a three step process:
  1. Confirm it is a postal code
  2. Pass all the postal codes through a series of checks. If whitespace, extra digits or any basic discrepancy is found, update that postal code
  3. Lastly, send it through a validator that checks it is a six digit postal code. The validator returns a valid pin code. Otherwise, the pin code is collected in a list for further inspection

The code used to normalize postal codes can be found here. After processing the dataset with this code I was able to successfully resolve inconsistencies regarding pin codes.

  • 100006 --> 110006
  • 1100002 --> 110002
  • 2013010 --> 201301
  • 110 021 --> 110021

Normalising Phone Numbers:

Phone numbers did not need much fixing. However, their inconsistencies needed to be normalied. I did this in two steps:
  1. Confirm it is a phone number
  2. Return phone number after striping all their whitespaces and hyphens

The code used to update phone numbers is here. I was able to successfully update phone numbers after applying this code to the dataset.

  • +91 11 3955 5000 --> +911139555000
  • +91-120-3830000 --> +911203830000
  • +91 11 4309 0000 --> +911143090000

Reshaping Data And Writing Data To JSON

Before ingesting the dataset into a database management system, the dataset had to be remodeled into proper data structures. Besides that, I also had to make sure that as the data gets remodeled, it simultaneously gets cleansed of the problems I had discussed earlier. Finally, because I chose MongoDB as my prefered database, I also had to write the data into a JSON format. By putting all the pieces of the codes together I was able to write the final project code linked here and was able to pull this off. Here is an example of the JSON data from my dataset.
{'address': {'postcode': '110011', 'street': 'Aurangzeb Road'}, 'created': {'changeset': '20864091', 'timestamp': '2014-03-02T13:16:13Z', 'uid': '1960718', 'user': 'apm-wa', 'version': '3'}, 'id': '370584997', 'name': 'Claridges Hotel', 'operator': 'Claridges Hotels Pvt. Ltd.', 'phone': '+911139555000', 'pos': [28.6006254, 77.2165438], 'tourism': 'hotel', 'type': 'node', 'website': 'http://www.claridges.com/index.asp'}

JSON File Sizes

After writing the data into JSON, I checked the size of the file. Below you will find the file sizes of all the files related to this project, printed in a tabular format.

+----------+----------+
|   FILE   |   SIZE   |
+----------+----------+
| Original | 716.5 MB |
|  Sample  | 72.4 MB  |
|   JSON   | 83.8 MB  |
+----------+----------+

Creating a MongoDB Database

After the dataset was cleansed, remodeled and formatted, it was ready to be ingested into MongoDB. I was able to import the dataset from the Terminal by typing <code style = "color: green">mongoimport --db delhiOSM --collection OSM_data <new_delhi_sample.osm.json</code> command. Then, through Robo3T I was able confirm the validity of the data being in the system.

Queries

Some queries I ran with pymongo while MongoDB was running locally. I have also tried to present visualizations of statistical queries in a couple of cases.

Quick stats about the database
 {'db': 'delhiOSM', 'collections': 1, 'views': 0, 'objects': 410188, 'avgObjSize': 232.72264668883537, 'dataSize': 95460037.0, 'storageSize': 28884992.0, 'numExtents': 0, 'indexes': 1, 'indexSize': 3850240.0, 'ok': 1.0}
Number of documents in the database 410188
Total Number of nodes: 340762
Total Number of ways: 69411
Total Number of Unique Users: 818
Top 10 ameneties
            0                 1        2           3     4         5     6  \
_id    school  place_of_worship  parking  restaurant  bank  hospital  fuel   
count     102                31       29          25    24        22    20   

         7        8          9  
_id    atm  college  fast_food  
count   19       15         10  
Plot of top 10 ameneties
Top 10 users
              0          1         2         3        4       5         6  \
_id    Oberaffe  premkumar  saikumar  Naresh08  anushap  sdivya  anthony1   
count     26656      16446     15963     13601    13338   12995     12564   

                7              8         9  
_id    himabindhu  sathishshetty  Apreethi  
count       12315          12250     11338  
Plot of top 10 users

Suggestions For Improvement

The OpenStreetMap map editors give their users all the tools needed to curate, edit and update a geo spatial region in a systematic way. However, it depends on who is sitting on the other end of the editor and using those tools. As OpenStreetMap is fully run by a community of volunteers, these users could end up being anybody from professional mapper, surveyors, GIS professional to just casual, enthusiastic users. In my opinion, knowing what kind of user is contributing to the data, can help curb inconsistencies in the data to some extent. Therefore, a very high level suggestion to improve the quality of data would be to categorise the users as casual or professional. Tagging the user as casual or professional could give a hint about who the user could be. For new users this kind of tagging can be done at the time of joining, while exiting users can update their user profiles with this information.

In all fairness, like a casual user a professional user too is likely to make an error. So, tagging alone can not certify the quality of data being fed into the system. However, what it can help in is to run basic statistics based on user tags. For example, someone analysing data of her region of interest could easily compute statistics based on contributions made by professional versus casual users. Furthermore, because entries can be traced against the user tags, bad entries can be flagged. A proportion of professional versus casual users responsible for such entries can then be computed. This information can then be shared on the OpenStreetMap blog, Users’ Diaries, forums etc. These statistics will help the community to learn more about itself and improve the system.

Another suggestion would be to emphasize the role of local contributors. Community meet ups and events should be organized to encourage contributions from people who have knowledge about their locality. Data mainatained and updated by such contributors are much likely to be accurate.

Which brings me to my final suggestion. Besides taking volunteers only for mapping, the OpenStreetMap foundation should also look into taking volunteers for promotional and advertising activities as well. Responsibilities for such volunteers could include:

  1. Organising community meet ups
  2. Preparing promotional material
  3. Maintaining social media presence and Youtube channel
  4. Orientating of new users towards ethical mapping practices
  5. Urging technical users to make more mechanical edits than automated edits
  6. Encouraging contributions from local contributors

Conclusion

While working on the New-Delhi OSM data, I observed incorrect entries, mispellings and other irregularities. I also observed that in some places data was throughly maintained while in others it was not quite so. These limitations made the dataset unreliable for direct use. Nonetheless, those error or irregularities where not terribly threating and after familiarising myself with the dataset I was able to normalise them. This entire process of understanding the XML data structure, observing the errors, cleaning those errors, reshaping data structures, writing those structures into a JSON, ingesting the JSON into a database and finally making queries from that database gave me a chance to understand data wrangling quite intimately. Through this project I also got introduced to the OpenStreetMaps initiative. I felt that OpenStreetMaps is a tremendous initiative. I think, they have a great community of really smart contributors all around the world and their contributions deserve a big applause. They have set up a great base for themselves and with a little more funding and some time this initiative is going to go really far.