P3: OpenStreetMap Data Case Study. Dubai and Abu-Dhabi. Extraction.

The full version of the project: https://github.com/OlgaBelitskaya/nd002_p3/blob/master/Data_Analyst_ND_Project3_Dubai-AbuDhabi_MongoDB.ipynb

0. Code Resources

0.1. Code Library

Basic resources for this project are software libraries for python and MondoDB.

0.2. References

0.3. Code for Researching the Imported Files and Creating the Data.

Сode snippets of the courses "Intro to Relational Databases", "Data Wrangling with MongoDB" (udacity.com) have been used here for downloading, analysing and cleaning the data. On this basis, several useful functions were built for these goals too.

1. Map Area

1.1. The map

I have chosen the map sector of the dynamically developing area in the UAE. For displaying the area I have used the package "folium" and the coordinates of this area in dubai_abu-dhabi.osm.

In [11]:
import folium
# Setup the coordinates of the map center and the zoom option.
map_osm = folium.Map(location=[25.2048, 55.2708], zoom_start=8)
# Add labels with coordinates.
folium.LatLngPopup().add_to(map_osm)
# Setup the coordinates of the map area.
points=[[23.7350, 53.5800], [23.7350, 56.8870], [26.5390, 56.8870], [26.5390, 53.5800], [23.7350, 53.5800]]
# Setup the border line with options.
folium.PolyLine(points, color="red", weight=5, opacity=0.3).add_to(map_osm)
# Display the map.
map_osm
Out[11]:
Bounds: minlat="23.7350" minlon="53.5800" maxlat="26.5390" maxlon="56.8870".

1.2 Extract with Python

There are several ways to extract geodata. One of them is to do this with this python code cell. This set of commands allows us to upload a file in the format .osm using the coordinates of the rectangle corners.

In [10]:
import urllib
# file00 = urllib.URLopener()
# file00.retrieve("http://overpass-api.de/api/map? bbox=53.5800,23.7350,56.8870,26.5390", "dubai_abu-dhabi0.osm")

1.3 Extract from OpenStreetMaps.org

Another possible way is extracting data files in many different formats from the website:

https://mapzen.com/data/metro-extracts/metro/dubai_abu-dhabi/

The files dubai_abu-dhabi.osm, dubai_abu-dhabi_buildings.geojson, etc. were downloaded.

1.4. Size of downloaded files.

  • dubai_abu-dhabi0.osm: 404994999
  • dubai_abu-dhabi.osm: 394382598
  • dubai_abu-dhabi_admin.geojson: 1345560
  • dubai_abu-dhabi_roads.geojson: 86725595
  • dubai_abu-dhabi_waterareas.geojson: 2415039

1.5 Osm files

This is not so large piece of data to process (394,4 MB) in the dubai_abu-dhabi .osm file and for me it is a very interesting subject for reseach because of many reasons. For example, it is a constant and rapidly changing territory with awesome ideas about area development. Applying the special function (§ 0.3) I created the sample_dubai_abu-dhabi.osm file from the dubai_abu-dhabi .osm file.

Size of sample_dubai_abu-dhabi.osm: 3947501

1.6 Geojson files

It's possible to download from OpenStreetMap several type of files: .osm, .geojson, etc. For displaying the data in .geojson files the package "geopandas" also can be useful. As an example you can see the fragments of the data frame for administrative borders, roads and water areas.

The dimensionality of the data

  • dataframe for admin borders: (231, 6)
  • dataframe for roads: (130060, 13)
  • dataframe for water areas: (1510, 6)

Example for the administrative borders:

  • admin_level geometry id name osm_id type

0 2.0 POLYGON ((56.20800613403377 25.25621456273814,... 1.0 عمان -305138.0 administrative

1 2.0 (POLYGON ((53.97770508117634 25.22422729239028... 2.0 الإمارات العربيّة المتّحدة -307763.0 administrative

2 4.0 (POLYGON ((54.71539805585797 25.06933869038014... 3.0 دبي‎ -3766483.0 administrative

1.7 Shapefiles

For displaying the data in shapefiles it's possible to apply the package "basemap".

1.8 Json file

Applying the special function I created the dubai_abu-dhabi.osm.json from the dubai_abu-dhabi.osm file.

2. Data (OSM)

Let's discover the data in .osm files in details. It contains a lot of information of geographical objects.

2.1 Tags

OpenStreetMap represents physical features on the ground (e.g., roads or buildings) using tags attached to its basic data structures (its nodes, ways, and relations). Tags help describe an item and allow it to be found again by browsing or searching. They are chosen by the item's creator depending on the data point.

{'node': 1890178, 'nd': 2271372, 'bounds': 1, 'member': 9779, 'tag': 503027, 'relation': 2820, 'way': 234327, 'osm': 1}

2.2 Users

Map data is collected from zero by volunteers (users). We can get the number and the list of them for this piece of the data.

Number of users - 1895

['-KINGSMAN-', '0508799996', '08xavstj', '12DonW', '12Katniss', '13 digits', '25837', '3abdalla', '4b696d', '4rch', '66444098', '7dhd', '@kevin_bullock', 'AAANNNDDD', 'AC FootCap', 'AE35', 'AHMED ABASHAR', 'AKazariani', 'ASHRAF CHOOLAKKADAVIL', 'A_Sadath', 'AakankshaR', etc.]

Exploring the digital data in this file, we can get a large number of other statistics and information.

2.3 Keys

{'lower': 479404, 'lower_colon': 20602, 'other': 3001, 'problemchars': 20}

2.4 Street addresses

Number of street addresses - 1789.

2.5 Places

{'town': 31, 'city': 13, 'island': 67, 'hamlet': 97, 'other': 749, 'village': 608}

2.6 Names in English

The number of names in English: 3413.

The list of examples for this piece of the data:

['Tawi Madsus', 'Bani Umar', 'Jabal Nazifi', 'Umm Al Quwain', 'Le Mart', 'DoT Scrap Store', 'Buraimi Police station', 'CEREEN Textiles', 'Faseela Grocery', 'Al Muwaylah', 'Najman Grocery', 'Ras Huwayni', 'Al Musalall Grocery', 'Huwayniyah', 'Dariush Boulevard', 'OM YAMAN', 'Azerbayejan', 'Ayn al Mahab', 'Saad Pharmacy', 'Nabil house', 'Al Jowar', 'Zarub', 'Dubai Homeopathy Health Centre', etc.].

On this map it may be noted a large number of duplicate names in English.

2.7 Postal Codes

In UAE mail is usually delivered to a P.O Box. As we can see practically all postcodes are individual. Let's display the list of P.O Boxes and the number of users for each of them (1-2 in the most of cases).

The number of postcodes: 116.

The list of examples for this piece of the data:

{'00962': 1, '34121': 1, '7819': 1, '108100': 1, 'P.O. Box 5618, Abu Dhabi, U.A.E': 1, '8988': 1, '0': 1, '23117': 2, 'P O BOX 3766': 1, '103711': 2, '549': 1, '38495': 1, 'P.O. Box 4605': 1, 'Muhaisnah 4': 1, '20767': 1, '81730': 1, '2504': 1, 'PO Box 6770': 1, '8845': 1, 'PO Box 43377': 1, '97717': 1, '24857': 3, '232574': 1, 'P.O. Box 9770': 1, '60884': 1, '44263': 1, '277': 1, '16095': 1, 'P. O. Box 31166': 1, '502227': 1, '2666': 1, '41318': 1, 'P. O. Box 123234': 1, '00971': 1, '128358': 1, '79506': 1, '115443': 1, '500368': 1, '473828': 4, etc.}

2.8 Street names

Example of the street names:

{'07': set(['07']), '1': set(['20B Street, Safa 1', 'City Walk, Jumeirah 1', 'E 1', 'Hattan Street 1', 'aljurf ind 1']), '10': set(['Street 10', 'ind area 10']), '11': set(['shabiya -11']), '111': set(['P.O.Box 111']), '12': set(['District 12', 'Street 12']), '12A': set(['12A']), '12K': set(['District 12K']), '13': set(['Street 13', 'industrial 13', 'street 13\n']), '14': set(['11th street, Musaffah M 14', 'Musaffah Industrial Area Street 14']), '147': set(['147']), '15': set(['sweet 15']), '153': set(['Community 153']), '166': set(['166']), '17': set(['17']), '18': set(['Street 18', 'street 18']), '19': set(['19']), '19th)': set(["Sa'ada Street (19th)"]), '1D': set(['1D'])}

Example of the updated list of street names:

Al Nayhan => Al Nayhan Al Sufouh Rd => Al Sufouh Road JBR Rd => JBR Road Sheikh Rashed Bin Said Rd => Sheikh Rashed Bin Said Road Oud Metha Rd => Oud Metha Road Al Safouh Rd => Al Safouh Road

More accurate correction is possible by comparison with data from other map sites and in the studying of the real situation.

3. JSON & Mongo DB

3.1 Database

Using a set of commands the database was created in Mongo DB on the basis of the file dubai_abu-dhabi.osm.json.

Dropping collection: /Users/olgabelitskaya/large-repo/dubai_abu-dhabi Executing: mongoimport -h 127.0.0.1:27017 --db openstreetmap_dubai --collection /Users/olgabelitskaya/large-repo/dubai_abu-dhabi --file /Users/olgabelitskaya/large-repo/dubai_abu-dhabi.osm.json

3.2 Indicators of the dataset (queries & results).

  • dubai_abu_dhabi.find().count() ### Documents: 2124505
  • dubai_abu_dhabi.find({'type':'node'}).count() ### Nodes: 1890130
  • dubai_abu_dhabi.find({'type':'way'}).count() ### Ways: 234063

Let's have a look on one example of the document in this database. It represents the structure of the data in our case.

dubai_abu_dhabi.find_one()

{u'pos': [25.148038, 55.3862105], u'_id': ObjectId('581282954d13337626b8da7c'), u'type': u'node', u'id': u'21133779', u'created': {u'changeset': u'7291467', u'version': u'2', u'user': u'Tommy', u'timestamp': u'2011-02-15T02:24:42Z', u'uid': u'18885'}}

3.3 Users (queries & results).

len(dubai_abu_dhabi.distinct('created.user')) ### Number of users: 1885.

sorted(dubai_abu_dhabi.distinct('created.user'))[:50]

List of the first 50 user names: [u'-KINGSMAN-', u'0508799996', u'08xavstj', u'12DonW', u'12Katniss', u'13 digits', u'25837', u'3abdalla', u'4b696d', u'4rch', u'66444098', u'7dhd', u'@kevin_bullock', u'AAANNNDDD', u'AC FootCap', u'AE35', u'AHMED ABASHAR', u'AKazariani', u'ASHRAF CHOOLAKKADAVIL', u'A_Sadath', u'AakankshaR', u'Aal Ibra240380heem', u'Abbadi', u'Abdalmajeed Najmi', u'Abdelhadi Azaizeh', u'Abdul Noor Bank', u'Abdul Rahim Khan', u'Abdul wahab rashid', u'Abdulaziz AlSweda', u'Abdulla Shuqair', u'Abdullah Al Hany', u'Abdullah Alshareef', u'Abdullah Rana', u'Abdullah777', u'Abdurehman', u'AbeMazid', u'Abhin', u'Abiodun Babalola', u'Aboad Jasim', u'Abood Ad', u'Abrarbhai', u'Absamc', u'AbuFazal', u'Abud', u'Adel alsaad', u'Adib Yz', u'Adil Alsuleimani', u'Adley', u'Adm Vtc', u'Adnaan Abrahams']

The database allows to evaluate the contribution of each individual user in map editing.

dubai_abu_dhabi.find({"created.user": "Ahmed Silo"}).count() ### Number of notes for the user Ahmed Silo: 7.

With the help of simple manipulations in the database, the user can perform a selection of interesting information.

Let us list three most active editors of this map section:

In [ ]:
dubai_abu_dhabi.aggregate([{ "$group" : {"_id" : "$created.user", "count" : {"$sum" : 1} } }, 
                           { "$sort" : {"count" : -1} }, {"$limit" : 3} ] )

{u'_id': u'eXmajor', u'count': 492808}, {u'_id': u'chachafish', u'count': 156874}, {u'_id': u'Seandebasti', u'count': 125767}

The number of users with one note:

In [ ]:
dubai_abu_dhabi.aggregate( [ { "$group" : {"_id" : "$created.user", "count" : { "$sum" : 1} } }, 
                            { "$group" : {"_id" : "$count", "num_users": { "$sum" : 1} } },
                            { "$sort" : {"_id" : 1} }, { "$limit" : 1} ] )

{u'_id': 1, u'num_users': 646}.

3.4 Places (queries & results).

Three most common places:

In [ ]:
dubai_abu_dhabi.aggregate( [ { "$match" : { "address.place" : { "$exists" : 1} } }, 
                            { "$group" : { "_id" : "$address.place", "count" : { "$sum" : 1} } }, 
                            { "$sort" : { "count" : -1}}, {"$limit":3}] )

{u'_id': u'Yas Mall', u'count': 14}, {u'_id': u'Jumeirah Village Triangle', u'count': 10}, {u'_id': u'Deerfields Townsquare Shopping Centre', u'count': 2}.

The list of 10 most common types of buildings:

In [ ]:
dubai_abu_dhabi.aggregate([{'$match': {'building': { '$exists': 1}}}, 
                           {'$group': {'_id': '$building','count': {'$sum':1}}}, 
                           {'$sort': {'count': -1}}, {'$limit': 10}])

{u'_id': u'yes', u'count': 43834}, {u'_id': u'house', u'count': 4216}, {u'_id': u'apartments', u'count': 2910}, {u'_id': u'residential', u'count': 2606}, {u'_id': u'roof', u'count': 1026}, {u'_id': u'hangar', u'count': 825}, {u'_id': u'warehouse', u'count': 380}, {u'_id': u'mosque', u'count': 378}, {u'_id': u'garage', u'count': 314}, {u'_id': u'commercial', u'count': 313}

The list of 10 most common facilities:

In [ ]:
dubai_abu_dhabi.aggregate([{'$match': {'amenity': {'$exists': 1}}}, 
                           {'$group': {'_id': '$amenity', 'count': {'$sum': 1}}},
                           {'$sort': {'count': -1}}, {'$limit': 10}])

{u'_id': u'parking', u'count': 5602}, {u'_id': u'place_of_worship', u'count': 1443}, {u'_id': u'restaurant', u'count': 1372}, {u'_id': u'school', u'count': 489}, {u'_id': u'fast_food', u'count': 442}, {u'_id': u'fuel', u'count': 438}, {u'_id': u'cafe', u'count': 403}, {u'_id': u'bank', u'count': 317}, {u'_id': u'pharmacy', u'count': 311}, {u'_id': u'shelter', u'count': 247}.

The list of 3 most common zipcodes:

In [ ]:
dubai_abu_dhabi.aggregate( [ { "$match" : { "address.postcode" : { "$exists" : 1} } }, 
                            { "$group" : { "_id" : "$address.postcode", "count" : { "$sum" : 1} } },  
                            { "$sort" : { "count" : -1}}, {"$limit": 3}] )

{u'_id': u'811', u'count': 5}, {u'_id': u'473828', u'count': 4}, {u'_id': u'24857', u'count': 3}.

Counting zipcodes with one document:

In [ ]:
dubai_abu_dhabi.aggregate( [ { "$group" : {"_id" : "$address.postcode", "count" : { "$sum" : 1} } },
                            { "$group" : {"_id" : "$count", "count": { "$sum" : 1} } },
                            { "$sort" : {"_id" : 1} }, { "$limit" : 1} ] )

[{u'_id': 1, u'count': 85}]

3.5 Update values in Mongo DB (queries & results).

At the preliminary stage of familiarization with the information in the osm file problem and erroneous points in the dataset were found. Now we can replace the wrong values and decide many important tasks at the same time: check the data about concrete geoobjects, use additional information in fields, update values.

The example of the document with one wrong value: dubai_abu_dhabi.find_one({'address.street':'Twam St.'})

{u'_id': ObjectId('58339f5578c5c4115eb124ff'), u'address': {u'city': u'Al Ain', u'street': u'Twam St.'}, u'building': u'residential', u'created': {u'changeset': u'22394079', u'timestamp': u'2014-05-17T18:34:16Z', u'uid': u'2079950', u'user': u'Anna23', u'version': u'1'}, u'id': u'282551277', u'name': u"Maqam 2 Female Students' Accomodation", u'node_refs': [u'2864941597', u'2864941598', u'2864941599', u'2864941600', u'2864942001', u'2864941597'], u'type': u'way'}

The process of replacing:

dubai_abu_dhabi.update_one({'_id': ObjectId('58339c6b78c5c4115e9f47e9')}, {'$set': {'address.street': 'Al Sufouh Road'}}, upsert=False)

dubai_abu_dhabi.update_one({'_id': ObjectId('58339ba178c5c4115e97e58d')}, {'$set': {'address.street': 'Sheikh Rashed Bin Said Road'}}, upsert=False)

dubai_abu_dhabi.update_one({'_id': ObjectId('58339dab78c5c4115ea6fc12')}, {'$set': {'address.street': 'Oud Metha Road'}}, upsert=False)

dubai_abu_dhabi.update_one({'_id': ObjectId('58339c6b78c5c4115e9f47e4')}, {'$set': {'address.street': 'Jumeirah Beach Road'}}, upsert=False)

dubai_abu_dhabi.update_one({'_id': ObjectId('58339c6b78c5c4115e9f47e5')}, {'$set': {'address.street': 'Oud Metha Road'}}, upsert=False)

dubai_abu_dhabi.update_one({'_id': ObjectId('58339f7878c5c4115eb287b1')}, {'$set': {'address.street': 'Al Falak Street'}}, upsert=False)

dubai_abu_dhabi.update_one({'_id': ObjectId('58339f5578c5c4115eb124ff')}, {'$set': {'address.street': 'Twam Street'}}, upsert=False)

The example of checking the result: dubai_abu_dhabi.find_one({'_id': ObjectId('58339f5578c5c4115eb124ff')})

{u'_id': ObjectId('58339f5578c5c4115eb124ff'), u'address': {u'city': u'Al Ain', u'street': u'Twam Street'}, u'building': u'residential', u'created': {u'changeset': u'22394079', u'timestamp': u'2014-05-17T18:34:16Z', u'uid': u'2079950', u'user': u'Anna23', u'version': u'1'}, u'id': u'282551277', u'name': u"Maqam 2 Female Students' Accomodation", u'node_refs': [u'2864941597', u'2864941598', u'2864941599', u'2864941600', u'2864942001', u'2864941597'], u'type': u'way'}

The same manipulations we do with several mistakes and fields. The reader can see it in the full version of the project.

4. Problems and errors

  • One of the main problems of public maps - no duplication of all place names in other languages. If it were possible to automate the translation process by increasing a common database of map names in many languages, it would save users from many difficulties and mistakes.
  • The next problem - the presence of a large number of databases (including mapping) on the same map objects. Some intergraph procedures of already available data would relieve a lot of people from unnecessary work, save time and effort.
  • Obviously, the information about the number of buildings and their purpose is incomplete. Completeness of public maps can be increased by bringing in the process of mapping new users. For this goal enter the information should be as simple as possible: for example, a choice of the available options with automatic filling many fields for linked options (for example, linking the name of the street and the administrative area in which it is located).
  • There are a number of mistakes and typos as in every public data. For correction them well-known methods can be proposed: automatic comparison with existing data and verification for new data by other users.
  • The lack of a uniform postal code system in this concrete dataset complicates their identification and verification.
  • During working on the project, I spent a lot of time on the conversion of one type of data file to another. Each format has its own advantages and disadvantages. Probably, it is possible to design a universal file type that allows us to store data of any kind, combining the advantages of all existing types and applicable in the most of existing programming languages.
  • Correction of errors made in the data seems to me appropriate to carry out after uploading files to the database. Sometimes a record that is a mistake in terms of filling a particular type of data just contains additional information about geoobjects.

5. Data Overview

5.1 Description of the data structure:

  • nodes - points in space with basic characteristics (lat, long, id, tags);
  • ways - defining linear features and area boundaries (an ordered list of nodes);
  • relations - tags and also an ordered list of nodes, ways and/or relations as members which is used to define logical or geographic relationships between other elements.

5.2 Indicators.

Size of the .osm file: 394,4 MB. Size of the .osm sample file : 3,9 MB.

Nodes: 1890178. Ways: 234327. Relations: 2820. Tags: 503027. Users: 1895.

5.3 MongoDB

With the help of a specific set of commands we can perform a statistical description of the data collections and the database.

DB statistics: db.command("dbstats") {u'avgObjSize': 234.44116488311394, u'collections': 1, u'dataSize': 498071427.0, u'db': u'openstreetmap_dubai', u'indexSize': 19124224.0, u'indexes': 1, u'numExtents': 0, u'objects': 2124505, u'ok': 1.0, u'storageSize': 154611712.0}

We can get the collection statistics as well.

6. Conclusion

I think this project is educational for me. I believe that one of the main tasks in this case was to study the methods of extraction and researching of map data in open access. For example, I used a systematic sample of elements from the original .osm file for trying functions of processing before applying them to the whole dataset. As a result I have some new useful skills in parsing, processing, storing, aggregating and applying the data.

In the research I have read through quite a lot of projects of other students on this topic. After my own research and review the results of other authors I have formed a definite opinion about the ideas in OpenStreetMap.

This website can be viewed as a testing ground of interaction of a large number of people (ncluding non-professionals) to create a unified information space. The prospects of such cooperation can not be overemphasized. The success of the project will allow to implement the ambitious plans in the field of available information technologies, the creation of virtual reality and many other areas.

Increasing of the number of users leads to many positive effects in this kind of projects: 1) a rapid improvement in the accuracy, completeness and timeliness of information; 2) approximation of the information space to the reality , the objectivity of the data evaluation; 3) reduce the effort for data cleansing on erroneous details.

Ideas for improving the project OpenStreetMap are simple and natural. Increasing the number of users can be achieved by additional options like marks of the rating evaluation (eg, the best restaurant or the most convenient parking). The popularity of the project may be more due to the temporary pop-up messages of users (placement is not more than 1-3 hours) with actual information about the geographic location (eg, the presence of traffic jams).

7. Feedback

After the review https://review.udacity.com/#!/reviews/293667 I've created the additional notebook for illustration the preprocessing for one of the data fields.