OpenStreetMap is an open project, which means it's free and everyone can use it and edit as they like. OpenStreetMap is direct competitor of Google Maps. How OpenStreetMap can compete with the giant you ask? It's depend completely on crowd sourcing. There's lot of people willingly update the map around the world, most of them fix their map country.

Openstreetmap is so powerful, and rely heavily on the human input. But its strength also the downfall. Everytime there's human input, there's always be human error.It's very error prone.I choose whole places of Jakarta. Jakarta is the capital of Indonesia.This dataset is huge, over 250,000 examples. It's my hometown, and i somewhat want to help the community.

Problems Encountered in the Map

When I open OpenStreetMap dataset, I notice following issues:

  • Street type abbreviations
  • Incosistent phone number format

Street Type Abbreviations

Take the name of the street for example. People like to abbreviate the type of the street. Street become St. st. In Indonesia, 'Jalan'(Street-Eng), also abbreviated as Jln, jln, jl, Jln. It maybe get us less attention. But for someone as Data Scientist/Web Developer, they expect the street to have generic format.

'Jalan Sudirman' -> Jalan <name> -> name = Sudirman
'Jln Sudirman' -> Jalan <name> -> ERROR!

There are also some users that input street name in two type name, street address and full address. I incorporate all of address name to street address, and result in the following,

In [122]:
pipeline = [{'$match': {'address.street':{'$exists':1}}},
            {'$project': {'_id': '$address.street'}},
            {'$limit' : 5}]
result  = db.jktosm.aggregate(pipeline)['result']
[{u'_id': u'Jalan Sultan Iskandar Muda Kebayoran Lama'},
 {u'_id': u'Jalan Sahari'},
 {u'_id': u'Jalan HR Rasuna Said'},
 {u'_id': u'Jalan HR Rasuna Said'},
 {u'_id': u'Jalan HR Rasuna Said'}]

Inconsistent phone number format

We also have inconsistent phone number:

    {u'_id': u'021-720-0981209'}
    {u'_id': u'(021) 7180317'}
    {u'_id': u'081807217074'}
    {u'_id': u'+62 857 4231 9136'}

This makes difficult for any developer to parse to common format. The country of Jakarta is Indonesia, and it has country code of +62. And we see here that some users prefer to have separator with dash or spaces. Some users even separate the country code and city code(Jakarta: 21) in parantheses. We also see that the numbers prefix with 0, which can be used if you're in Indonesia, but not internationally.

So we have to convert these numbert into common format. Number could benefit by incorporating spaces, that way if developer uses the data, phone number can be extracted by country code, city code, and the rest of the number. Since mobile number doesn't have city code, we can just leave it alone. We can't take prefix all of the number by country code, since operator phone number, like McDonalds, doesn't need country code. So after I solve all of this issues, the results,

In [125]:
pipeline = [{'$match': {'phone':{'$exists':1}}},
            {'$project': {'_id': '$phone'}},
            {'$limit' : 5}]
result  = db.jktosm.aggregate(pipeline)['result']
[{u'_id': u'+62 21 5263137'},
 {u'_id': u'14045'},
 {u'_id': u'+62 21 78834966'},
 {u'_id': u'+62 21 500505'},
 {u'_id': u'+62 21 3140343'}]

Overview of the data

You can see the filesize about the dataset.

In [123]:
!ls -lh dataset/jakarta*
-rwxrwxrwx@ 1 Jonathan  staff    79M Nov  5  2014 dataset/jakarta.osm
-rwxrwxrwx  1 Jonathan  staff    80M Nov 18 06:24 dataset/jakarta_audit.osm
-rwxrwxrwx  1 Jonathan  staff    87M Nov 18 06:27 dataset/jakarta_audit.osm.json

Show the top 5 of contributed users

We also can find the top 5 contributed users. These users are count by how they created the point in the map, and sort descent

In [45]:
pipeline = [
            {'$match': {'created.user':{'$exists':1}}},
            {'$group': {'_id':'$created.user',
            {'$sort': {'count':-1}},
            {'$limit' : 5}
result  = db.jktosm.aggregate(pipeline)['result']
[{u'_id': u'Firman Hadi', u'count': 113770},
 {u'_id': u'dimdim02', u'count': 38860},
 {u'_id': u'riangga_miko', u'count': 36695},
 {u'_id': u'raniedwianugrah', u'count': 30388},
 {u'_id': u'Alex Rollin', u'count': 26496}]

Show the restaurant's name, the food they serve, and contact number

In [114]:
pipeline = [{'$match': {'amenity':'restaurant',
result  = db.jktosm.aggregate(pipeline)['result']
[{u'_id': u'Warung Tekko',
  u'contact': u'+62 21 5263137',
  u'cuisine': u'Indonesian'},
 {u'_id': u'YaUdah bistro',
  u'contact': u'+62 21 3140343',
  u'cuisine': u'german'},
 {u'_id': u'Wabito Ramen',
  u'contact': u'+62 21 3923810',
  u'cuisine': u'japanese'},
 {u'_id': u'Goma ramen',
  u'contact': u'+62 81807217074',
  u'cuisine': u'japanese'}]

Other ideas about the datasets

Earlier, we have found that not only most of the public places can be found, but it also shared what kind of description about these places. Now that we have those data, we can create mobile apps that serve as an assistant where to point the users to their need.

Mobile apps can be opened as a simple text box, and ask the users "What do you want?". If for example, the users say "I want to eat Japanese food", the assistant will find Japanese cuisine restaurant within 1 mile radius. Or if the users ask "I want to eat chicken (Indonesia: Ayam)", the assistant will find restaurant's name with 'Ayam' in it. They will provide these restaurants based on the map location, with user location at the exact center of the map. If the user click on one restaurant, it will be given some pop up that give the description of the restaurant. For example, their kind of cuisine, contact number, and ask users two options, whether they want to be given direction to the restaurant or they want to call the contact number.

Not only restaurant, but other kind of public places. Watch movies, go to gym center, finding a flower, etc. This is a lot of possibilities, but this is not without any risk.

Since this is an open data, that means everyone could become the ones who edit it. What about someone gives false information, and the contact number is a wrong number? Or what about the fake address? A consumer can reach into the location by hours (and in Jakarta, the most traffic jam city in the world, it's possible), turns out that's not the place they're looking for. It will be fatal to our company. So it's good thing to do cross-validation.

I believe that we should do some checking. OpenstreetMap has also some validation check, Osmose. We can actually see that the location/data that we're looking for, in warning color level. If the location is red, or even yellow, we shouldn't incoporate it to our data. It's safe to assume that the green one is pass our validation check. But we also have to account for the fact that Osmose could have false positive.

If we targetting for a big company that rely so much of OpenStreetMap data, it's a good thing to have a team that collaborate to make OpenStreetMap better. Besides of giving back to the community, the team responsible to also do any cross-check of any update location. Either using another map, or call the contact number confirming the owner of the location really has moved.


I actually submit the changes that I made back to the OSM. The changeset is here:

But the changes get reverted back, because I violated some rules that stated I can't change the map with machine code. If you see the changes, I actually made a lot of changes that the community will benefit.