The OpenStreetMap is an open initiative to create and provide free geographic data such as street maps to anyone who wants them. It is supported by the OpenStreetMap Foundation which is a UK-registered not-for-profit organization. Currently, OpenStreetMap has an active base of over two million volunteers all across the globe.
The project requires me to choose any area of the world in https://www.openstreetmap.org and use data munging techniques, such as assessing the quality of the data for validity, accuracy, completeness, consistency and uniformity, to clean the data of the area I have chosen. Finally, choose either MongoDB or SQL as the data schema to finish the project.
New Delhi, National Capital of India
Explore The Union Territory Of New Delhi, India With An Interactive Map, Courtesy https://developer.mapquest.com.
After getting my hands on the data, the first thing I did was to extract a sample from the original file I had downloaded. I wanted to explore the data and get familiar with it. Doing so with a smaller file was way more efficient. As the project required that I work on a file not less than 50MB, I had to make sure my sample doesn't fall below that limit. The code to extract a sample XML file can be found <a href= "make_sample.py">here</a>. I wanted to document the file size in the notebook. So, I wrote a small function that could print the file size in a tabular format. That code is here. In the cell below, you can see the file sizes of both the original and the sample file.
After I had the sample file in place, I wanted to know things about the data right off the bat. The first thing I checked was the number of tags the XML file had. The code that counted the tags in the file is here.
The next thing I explored in the dataset was the top level tags. These tags contained attributes like 'id', 'lat', 'lon', 'user', 'uid', 'version', 'changeset', 'timestamp'
. The "uid"
attribute consisted of a unique id number of every contributor who had worked on this dataset. I thought, it would be nice to know the count of such unique users. With the help of this code I was able to do so. The actual count of unique users in the sample dataset is printed in the cell below.
Next, I went on to explore the second level tags. Iterating over the attributes of the second level tags gave a dictionary. This dictionary contained keys "k"
and "v"
. The key "k"
contained values with colons like "addr:street"
or "addr:postcode"
or just simple lower cased values like "phone"
. I took a count of those values with this code. The result of the count is printed below.
In order to understand the dataset better, I examined it from various aspects. This allowed me later to write codes that helped manipulate and reshape the dataset according to the requirements of the project. If you are interested in knowing how I derived at the codes for this project, this link data_wrangling_project_4_notes.ipynb will give you access to the notes I wrote while preparing for this project. Although, I have tried to explain what is happening in codes through comments but I must still warn you that these notes are quite tedious and in many places lack any proper order.
The New Delhi OSM dataset was quite nicely maintained, nonetheless, there were some minor problem with it here and there. Let me take you through a step by step process of introducing you to those problems and how I was able to resolve them.
- Roman Numerals -> South City II
- Lower Casing -> janakpuri
- Mixed Casing -> Shiv Arcade,aacharya niketan,mayor vihar ph-1
- Misspelling -> Pahargan
- Hyhpen -> Phi-02
The second problem I ran into was of invalid pin code entries in some places. In NCR Delhi, a valid pin code has 6 digits. However, some pin codes in the dataset had more than 6 digits or whitespaces or typos. Here are some examples of errors I found related to pin codes:
- 100006
- 1100002
- 2013010
- 110 021
The last problem was with phone numbers. The phone numbers themselves were entered correctly, that is, they had valid 10 and 8 digits with proper area codes. But, they had inconsistent whitespaces and hyphenation. Some examples of irregualar phone numbers are:
- +91 11 3955 5000
- 91-11-2687-6564
- +91 9958080618
- 0120 252 0242
All the disparate pieces of the code requied to make this process work can be found here (Mapping), here (audit street types) and here (update street type). I was able to successfully update street names by passing the data through these codes.
South City II --> South City 2
Phi-02 --> Phi 02
Shiv Arcade,aacharya niketan,mayor vihar ph-1 --> Shiv Arcade,Aacharya Niketan,Mayor Vihar Ph 1
Pahargan --> Paharganj
The code used to normalize postal codes can be found here. After processing the dataset with this code I was able to successfully resolve inconsistencies regarding pin codes.
100006 --> 110006
1100002 --> 110002
2013010 --> 201301
110 021 --> 110021
The code used to update phone numbers is here. I was able to successfully update phone numbers after applying this code to the dataset.
+91 11 3955 5000 --> +911139555000
+91-120-3830000 --> +911203830000
+91 11 4309 0000 --> +911143090000
{'address': {'postcode': '110011', 'street': 'Aurangzeb Road'},
'created': {'changeset': '20864091',
'timestamp': '2014-03-02T13:16:13Z',
'uid': '1960718',
'user': 'apm-wa',
'version': '3'},
'id': '370584997',
'name': 'Claridges Hotel',
'operator': 'Claridges Hotels Pvt. Ltd.',
'phone': '+911139555000',
'pos': [28.6006254, 77.2165438],
'tourism': 'hotel',
'type': 'node',
'website': 'http://www.claridges.com/index.asp'}